What if Zcash stopped flying blind and had production-grade monitoring?

@outgoing.doze @shieldedmark @zancas Well, I hope you liked it, and if you have any questions, I’m here to answer them. And if you want to support my proposal, I would really appreciate it.

1 Like

this is a good proposal and I already said it but I want to mention some points too and in my opinion Zebra and zcashd already have native support for Prometheus but sometimes the user experience is really bad for example tools like zcashd_exporter have a lot of metrics that you cant filter and that makes Datadog refuse them and some scripts have a lot of conflicts because zcash_monitor for example has things that are obsolete and the configuration is really hard for someone who is just starting in the Zcash ecosystem and I also think @shieldedmark is right about the alerts because if you have too many alerts your tool even if it is good will stop being useful so you need to be careful with that and if you make this tool I can help you testing it before you launch.

And I would also like to hear @zancas opinion since he was here before and use the things @outgoing.doze and @shieldedmark said and the feedback so you have something more solid so congratulations and good luck

1 Like

I’ll jump in here because I’ve had this on my plate for a while now. I will also take some blame for not making the community aware that this is something we have been thinking about and planning for a few months.

Although there are some issues in our repo referencing this, there wasn’t clear information on when work would begin. We have already started identifying what needs to be done, for example:

Besides this, we’ve been working with the Z3 stack to identify all the missing information from a metrics perspective. We are also working on exposing more metrics from Zebra to better understand how to improve performance in specific areas, identify bottlenecks, and more.

I agree that there is a lot to be done regarding overall monitoring and observability. However, I don’t believe this requires an additional platform. It must be part of the existing ecosystem. Each application in the Z3 stack must have the underlying implementation correctly done so it can be leveraged by Prometheus, Grafana, Datadog, Sentry, OpenTelemetry, or whatever tools node operators might be using.

I don’t see this as an external tool, platform, or something built on top of existing applications. Instead, it must be approached from the bottom up: modifying applications to enable the metrics needed by node operators. This will also help development teams identify regressions and performance issues before releases reach production.

Additionally, we are really aligned (and working) in making Zebra production-grade, and our recent release is the best way to prove it. Following releases will have a lot around observability.

Hi @gustavovalverde , thanks a lot for your answer and for all the links.

Just to be clear, this is not a try to create a parallel ecosystem. For me this is more like a tool for the operators, on top of the work that you and the Zebra team are already doing.

My proposal wants to be in that “second layer”. It is not a new ecosystem, it is more like a toolkit for operators that:
• uses the observability work and the issues you already have
• tries to unify the metric names between zcashd and Zebra so the same dashboards can work for both
• and spends time in the part that usually nobody has time for: talking with real operators, testing in production, improving UX and writing simple docs

The recent incidents with Nighthawk, Zecwallet or the hashrate concentration show that only having metrics in the repo is not enough. If operators dont have something already packaged, with good alerts and step-by-step guides, the ecosystem still reacts late. I think this grant can help exactly in that part, without going into the Foundation’s territory.

If the word “platform” is confusing, I am happy to change it to something like “monitoring toolkit for Zcash operators”, and also I am ok if the dashboards, alert files and config examples live directly in the Zebra / zcashd repos if you prefer that.

Because of this I feel the work is not “too early” or “too separate”. It is more a way to make the observability roadmap faster and to be sure that your work really arrives to exchanges, lightwalletd operators, pools and the other important operators in the ecosystem.

Zebra and zcashd are already doing the difficult work inside the code: more metrics, better tracing, better observability.

What I want is to be on the other side of the chain, with the operators, and take all that work you are doing and turn it into something practical (dashboards, alerts, guides). So the things you build don’t stay only inside the repo, but actually run in production on exchanges, pools, explorers, lightwalletd and all the important services.

You continue pushing the core, and I focus on the packaging and adoption part. Like this the work is not duplicated, it is completed.

What I’m building already works with the metrics zcashd and Zebra have right now.

I don’t need any new Zebra updates for this.

My part is more for the operators docs, dashboards, alerts, making things easy so it doesn’t depend on how fast Zebra moves.

These dashboards, alerts, and guides is something I’m already working on, at least from the Zebra side, and some for the Z3 stack. Part of it will be seeing the light in the following 2-3 weeks, so I’d highly suggest waiting for that, as any following work should be build upon it. And I don’t think it would take months to do it, in any case.

1 Like

Cool! Would you mind clarifying whether we’re talking prometheus alerts or Grafana alerts?

1 Like

Thanks for the update, Gustavo. I don’t want to start a big debate or go in circles, but I would like to know your honest opinion, because I still see this grant as something viable mainly because:

with the metrics that zcashd and Zebra already have today, plus what you are going to publish, there is enough material to build a serious monitoring toolkit not only some dashboards, but also well-defined alert rules and ready channels like Telegram, Discord, webhooks, etc., so an operator can just copy the config and start getting useful alerts

and my focus is the operator layer: bringing zcashd + Zebra together, turning all of that into something you can install in minutes (packaging, best practices, clear docs) and testing it with real infra like lightwalletd, pools and explorers, which normally is outside the scope of one or two issues in the repo.

I’m not trying to duplicate or compete with your work, but to stand on top of what you publish and bring it well-packaged and maintained to the operators in the ecosystem. How do you see it?

Thanks for the info, Gustavo. If you don’t mind, could you explain a bit more what you are planning to ship exactly?

It would help me a lot to understand if this will be Prometheus alert rules with Alertmanager, or more like Grafana alerts and dashboards, if it’s focused only on Zebra or also zcashd, and if you plan things like example deploys Docker, docker-compose and simple guides for operators, with alert channels like Telegram / Discord.

I’m asking because I want to see clearly where my grant can sit on top of your work instead of overlapping.

Prometheus alerting rules

1 Like

I honestly don’t think that the actual metrics from either is enough, there’s a lot to be added to be able to build good dashboards (even including OTel with traces).

If we’re going to deprecate zcashd, you should not be focusing effort there. Unless it’s specifically focused for the deprecation.

Thanks Gustavo, that helps. I’m not trying to design a second set of rules I want to build on the metrics and Prometheus rules you ship and turn them into an easy operator toolkit (dashboards, Alertmanager/Telegram/Discord examples, docker-compose, simple docs) mainly for Zebra, and only cover zcashd during the deprecation phase.

Cool, so it seems you understand the Proposal (as-is) wouldn’t suffice. There are other things you’d have to consider, like the new stack. And there’s WIP that would overlap.

1 Like

Well then maybe this proposal can be better since zcash foundstion takes out the repository with what you mentioned do they have any exact date? And so you can peck to improve the ecosystem using the new documentation as a basis

2 weeks ago I promised (well, not exactly :sweat_smile:) that we’d be delivering some of the groundwork needed for observability, and that some of the planning wasn’t still public in the repo.

Now you’ll be able to see the planning for most of this work here:

And also the PRs adding some of the missing metrics and automations:

One important piece of all these are the traces, to be able to identify performance issues, errors, to better understand not just what is happening, but also how is happening.

This is a WIP, but I’d highly recommend to read this to fully understand the impact of this information: zebra/docker/observability/jaeger/README.md at 7f8bcbadd403c1b27870175c90e6c6cb856254e4 · ZcashFoundation/zebra · GitHub



6 Likes

I hope this also answers your question about alerting @outgoing.doze (TLDR; alerts with Prometheus and Alert manager integration)

2 Likes

Awesome, thanks a lot @gustavovalverde !!

1 Like

Looks great, ty! :heart_eyes: :student:

2 Likes