What if Zcash stopped flying blind and had production-grade monitoring?

sudo-julien · November 17, 2025, 1:38pm

@outgoing.doze @shieldedmark @zancas Well, I hope you liked it, and if you have any questions, I’m here to answer them. And if you want to support my proposal, I would really appreciate it.

anon72607568 · November 18, 2025, 7:00am

this is a good proposal and I already said it but I want to mention some points too and in my opinion Zebra and zcashd already have native support for Prometheus but sometimes the user experience is really bad for example tools like zcashd_exporter have a lot of metrics that you cant filter and that makes Datadog refuse them and some scripts have a lot of conflicts because zcash_monitor for example has things that are obsolete and the configuration is really hard for someone who is just starting in the Zcash ecosystem and I also think @shieldedmark is right about the alerts because if you have too many alerts your tool even if it is good will stop being useful so you need to be careful with that and if you make this tool I can help you testing it before you launch.

And I would also like to hear @zancas opinion since he was here before and use the things @outgoing.doze and @shieldedmark said and the feedback so you have something more solid so congratulations and good luck

gustavovalverde · November 19, 2025, 2:19pm

I’ll jump in here because I’ve had this on my plate for a while now. I will also take some blame for not making the community aware that this is something we have been thinking about and planning for a few months.

Although there are some issues in our repo referencing this, there wasn’t clear information on when work would begin. We have already started identifying what needs to be done, for example:

Tracking: DevOps Observation Environment Setup · Issue #9640 · ZcashFoundation/zebra · GitHub
Track the deferred chain value pool in Zebra's metrics · Issue #8820 · ZcashFoundation/zebra · GitHub
feat: Implement Monitoring Stage in Dockerfile · Issue #9643 · ZcashFoundation/zebra · GitHub
feat: Adapt Monitoring Stack for GCP Deployment · Issue #9641 · ZcashFoundation/zebra · GitHub
Simplify metrics and tracing tests · Issue #8715 · ZcashFoundation/zebra · GitHub
Add metrics for chain fork work and lengths · Issue #5297 · ZcashFoundation/zebra · GitHub

Besides this, we’ve been working with the Z3 stack to identify all the missing information from a metrics perspective. We are also working on exposing more metrics from Zebra to better understand how to improve performance in specific areas, identify bottlenecks, and more.

I agree that there is a lot to be done regarding overall monitoring and observability. However, I don’t believe this requires an additional platform. It must be part of the existing ecosystem. Each application in the Z3 stack must have the underlying implementation correctly done so it can be leveraged by Prometheus, Grafana, Datadog, Sentry, OpenTelemetry, or whatever tools node operators might be using.

I don’t see this as an external tool, platform, or something built on top of existing applications. Instead, it must be approached from the bottom up: modifying applications to enable the metrics needed by node operators. This will also help development teams identify regressions and performance issues before releases reach production.

Additionally, we are really aligned (and working) in making Zebra production-grade, and our recent release is the best way to prove it. Following releases will have a lot around observability.

sudo-julien · November 19, 2025, 2:55pm

Hi @gustavovalverde , thanks a lot for your answer and for all the links.

Just to be clear, this is not a try to create a parallel ecosystem. For me this is more like a tool for the operators, on top of the work that you and the Zebra team are already doing.

My proposal wants to be in that “second layer”. It is not a new ecosystem, it is more like a toolkit for operators that:
• uses the observability work and the issues you already have
• tries to unify the metric names between zcashd and Zebra so the same dashboards can work for both
• and spends time in the part that usually nobody has time for: talking with real operators, testing in production, improving UX and writing simple docs

The recent incidents with Nighthawk, Zecwallet or the hashrate concentration show that only having metrics in the repo is not enough. If operators dont have something already packaged, with good alerts and step-by-step guides, the ecosystem still reacts late. I think this grant can help exactly in that part, without going into the Foundation’s territory.

If the word “platform” is confusing, I am happy to change it to something like “monitoring toolkit for Zcash operators”, and also I am ok if the dashboards, alert files and config examples live directly in the Zebra / zcashd repos if you prefer that.

Because of this I feel the work is not “too early” or “too separate”. It is more a way to make the observability roadmap faster and to be sure that your work really arrives to exchanges, lightwalletd operators, pools and the other important operators in the ecosystem.

sudo-julien · November 19, 2025, 3:35pm

Zebra and zcashd are already doing the difficult work inside the code: more metrics, better tracing, better observability.

What I want is to be on the other side of the chain, with the operators, and take all that work you are doing and turn it into something practical (dashboards, alerts, guides). So the things you build don’t stay only inside the repo, but actually run in production on exchanges, pools, explorers, lightwalletd and all the important services.

You continue pushing the core, and I focus on the packaging and adoption part. Like this the work is not duplicated, it is completed.

What I’m building already works with the metrics zcashd and Zebra have right now.

I don’t need any new Zebra updates for this.

My part is more for the operators docs, dashboards, alerts, making things easy so it doesn’t depend on how fast Zebra moves.

gustavovalverde · November 19, 2025, 4:39pm

These dashboards, alerts, and guides is something I’m already working on, at least from the Zebra side, and some for the Z3 stack. Part of it will be seeing the light in the following 2-3 weeks, so I’d highly suggest waiting for that, as any following work should be build upon it. And I don’t think it would take months to do it, in any case.

outgoing.doze · November 19, 2025, 4:53pm

Cool! Would you mind clarifying whether we’re talking prometheus alerts or Grafana alerts?

sudo-julien · November 19, 2025, 4:56pm

Thanks for the update, Gustavo. I don’t want to start a big debate or go in circles, but I would like to know your honest opinion, because I still see this grant as something viable mainly because:

with the metrics that zcashd and Zebra already have today, plus what you are going to publish, there is enough material to build a serious monitoring toolkit not only some dashboards, but also well-defined alert rules and ready channels like Telegram, Discord, webhooks, etc., so an operator can just copy the config and start getting useful alerts

and my focus is the operator layer: bringing zcashd + Zebra together, turning all of that into something you can install in minutes (packaging, best practices, clear docs) and testing it with real infra like lightwalletd, pools and explorers, which normally is outside the scope of one or two issues in the repo.

I’m not trying to duplicate or compete with your work, but to stand on top of what you publish and bring it well-packaged and maintained to the operators in the ecosystem. How do you see it?

sudo-julien · November 19, 2025, 5:13pm

Thanks for the info, Gustavo. If you don’t mind, could you explain a bit more what you are planning to ship exactly?

It would help me a lot to understand if this will be Prometheus alert rules with Alertmanager, or more like Grafana alerts and dashboards, if it’s focused only on Zebra or also zcashd, and if you plan things like example deploys Docker, docker-compose and simple guides for operators, with alert channels like Telegram / Discord.

I’m asking because I want to see clearly where my grant can sit on top of your work instead of overlapping.

gustavovalverde · November 19, 2025, 5:30pm

Prometheus alerting rules

gustavovalverde · November 19, 2025, 6:28pm

I honestly don’t think that the actual metrics from either is enough, there’s a lot to be added to be able to build good dashboards (even including OTel with traces).

If we’re going to deprecate zcashd, you should not be focusing effort there. Unless it’s specifically focused for the deprecation.

sudo-julien · November 19, 2025, 6:50pm

Thanks Gustavo, that helps. I’m not trying to design a second set of rules I want to build on the metrics and Prometheus rules you ship and turn them into an easy operator toolkit (dashboards, Alertmanager/Telegram/Discord examples, docker-compose, simple docs) mainly for Zebra, and only cover zcashd during the deprecation phase.

gustavovalverde · November 19, 2025, 7:20pm

Cool, so it seems you understand the Proposal (as-is) wouldn’t suffice. There are other things you’d have to consider, like the new stack. And there’s WIP that would overlap.

sudo-julien · November 19, 2025, 8:49pm

Well then maybe this proposal can be better since zcash foundstion takes out the repository with what you mentioned do they have any exact date? And so you can peck to improve the ecosystem using the new documentation as a basis

gustavovalverde · December 3, 2025, 9:07am

2 weeks ago I promised (well, not exactly ) that we’d be delivering some of the groundwork needed for observability, and that some of the planning wasn’t still public in the repo.

Now you’ll be able to see the planning for most of this work here:

Tracking: Zebra Observability Implementation · Issue #10160 · ZcashFoundation/zebra · GitHub

And also the PRs adding some of the missing metrics and automations:

feat(observability): Add Grafana auto-provisioning and AlertManager by gustavovalverde · Pull Request #10171 · ZcashFoundation/zebra · GitHub
feat(tracing): add OpenTelemetry distributed tracing support by gustavovalverde · Pull Request #10174 · ZcashFoundation/zebra · GitHub
feat(metrics): add value pool, RPC, and peer health metrics by gustavovalverde · Pull Request #10175 · ZcashFoundation/zebra · GitHub

One important piece of all these are the traces, to be able to identify performance issues, errors, to better understand not just what is happening, but also how is happening.

This is a WIP, but I’d highly recommend to read this to fully understand the impact of this information: zebra/docker/observability/jaeger/README.md at 7f8bcbadd403c1b27870175c90e6c6cb856254e4 · ZcashFoundation/zebra · GitHub

gustavovalverde · December 3, 2025, 9:10am

I hope this also answers your question about alerting @outgoing.doze (TLDR; alerts with Prometheus and Alert manager integration)

outgoing.doze · December 3, 2025, 9:59am

Awesome, thanks a lot @gustavovalverde !!

dismad · December 3, 2025, 4:12pm

Looks great, ty!

Topic		Replies	Views
Zcash Network Health API: Production Monitoring Infrastructure Applications	7	182	January 5, 2026
Grant Application - Zebra Fork and Reorg Observability: Add Missing Prometheus Metrics for Fork Heights and Fork Lengths (Issue #5297) Applications	4	113	February 16, 2026
Zcash Metrics and Analytics Dashboard (Version 1.0) Applications	13	912	November 14, 2023
Grant Idea - Zcash Blockchain Infrastructure (zBI) Community Grants	36	2980	March 12, 2023
Zcash Community Grants Meeting Minutes 11/24/25 Community Grants Updates	2	186	November 27, 2025

What if Zcash stopped flying blind and had production-grade monitoring?

Related topics