What if Zcash stopped flying blind and finally had standard observability (Prometheus + Grafana) for its critical infrastructure?
Zcash’s privacy is world-class, but the infrastructure’s still being operated almost without proper instrumentation.
Zcashd and Zebra already expose metrics via Prometheus, and there are scattered scripts and demos, but what’s missing is something other ecosystems already have: a packaged, maintained, easy-to-adopt monitoring stack for the operators who actually keep the network running.
The idea isn’t to “invent new metrics” but to take everything that already exists and turn it into a production tool that any operator can use in minutes, without depending on a centralised service.
What happens when there’s NO observability (real examples)
These aren’t theoretical problems; we’ve already seen them in Zcash without standard monitoring:
Nighthawk infrastructure shutdown
Lightwalletd servers and the associated explorer went dark. Mobile wallets had to scramble with emergency migrations instead of planned transitions.
Zecwallet server incidents
Multiple outages forced an emergency grant request just to keep the service online. With clear alerts for desync, load, and resources, many of these problems could’ve been spotted early.
Pool with more than 51% hashrate
A single pool exceeded the fifty percent threshold, and the response came through manual warnings and ad hoc analysis, not automated concentration risk alerts.
Drop in full node count
The count fell from ~one hundred seventy to ~one hundred thirty nodes without clear data about what was happening: versions, regions, connectivity, operator exits… it was all guesswork.
In all these cases, the same thing was missing: consolidated metrics, dashboards, and standard alerts that would’ve warned before the problem exploded.
What this project adds on top of what already exists
Knowing that metrics and some previous attempts already exist, this project focuses on adding:
A unified exporter for zcashd and Zebra, that normalises metrics and saves each operator from wrestling with differences between clients and endpoints.
Grafana dashboards designed for production, not just examples: node health, network, mempool, resources, zcashd→Zebra migration, shielded pool usage, and historical network trends.
A reference alert set based on these real incidents: node down, zero peers, prolonged desync, disk nearly full, and early signals of mining concentration, with integrations for email/Slack/Discord.
Self-hosted and open-source deployment (MIT): Docker, bare metal or Kubernetes, without sending metrics to third parties and without creating a new centralised piece that the whole network depends on.
“Copy-paste” documentation so that both small teams and large infrastructure can go from “no monitoring” to “decent observability” without becoming Prometheus experts first.
This is meant to complement and “productize” the Prometheus support that Zcash already has, not to replace any existing tool or lock anyone in.
Any feedback, comments, or support is greatly appreciated! Together we can make Zcash even stronger and more resilient.
Yeah totally, that’d be nice. I’m old school though, Grafana is a mess, prometheus is where the action is for decent nerds. So I’ll be crossing fingers for comprehensive prometheus alerts as well…
Yeah, I totally get you. Grafana is mostly there so we don’t have to look at raw metrics all day.
The important part of this project, just like you said, is really Prometheus and the alerting, not just pretty graphs.
The idea is to:
treat Prometheus alerts as something central, not an extra
ship with a solid default ruleset (node down, zero peers, desync, disk almost full, high load, early mining-concentration signals, etc.)
and make it easy for more “old school” operators like you to just run Prometheus + Alertmanager, even if they barely touch Grafana
If you have ideas for alerts that you think would be good to include, please let me know so I can try to add them and make this as complete as possible. And thanks for the support — I also feel Zcash really needs this, because without proper alerting we’re going to keep seeing avoidable outages.
I don’t know personally, but I do run zebra nodes and I have no idea when they have any issues, and I guess it’s not great for the network if they don’t work properly. So hopefully someone more technical can help determine what are the most critical alerts and reasonable thresholds. We need the right balance obviously, being alerted all the time for minor things tends to have people silence alerts.
Both zebrad and zcashd already expose prometheus endpoints for consumption.
Any proposal towards this effort should include a thorough survey of prior art in the Zcash / DevOps space - for example the reproducible builds work the @antonleviathan has been landing in PRs.
@antonleviathan’s reproducible builds work (great reference!)
Community tools: zcash-monitor.py, zcashd_exporter, hosh.zec.rocks
The gap isn’t “metrics don’t exist” - it’s that they’re scattered, undocumented, and not packaged for production use. My proposal is explicitly about productizing existing capabilities, not reinventing.
2. Including thorough survey in proposal:
100% agree - I’ll add a detailed “Prior Art & Differentiation” section showing:
What zcashd/Zebra already expose
What community tools exist and their status
Specific gaps this fills (unified platform, docs, alerts, migration dashboards)
How this complements (not replaces) existing work
Also - are there specific observability gaps you’ve noticed in the DevOps space that should be prioritized?
I think the standard move there is to run alertmanager alongside prometheus, and I haven’t yet checked if there are pre-defined alert definitions anywhere that you can import.
@outgoing.doze Great point on balance alert fatigue is very real, and the last thing we want is people muting everything.
The idea is to start with a very small core ruleset: things like “node unreachable,” “zero peers for a while,” “far behind chain tip,” and “disk almost full.” Everything else would be opt-in and adjustable.
I’d rather tune thresholds based on real incidents and feedback from operators like you than guess them in advance. Since you’re running Zebra nodes today, is there any kind of failure you usually notice too late that you’d actually want an alert for? That’s exactly what I’d like to capture in the first version.
That’s what I’m looking for. I cannot come up with alerts, it takes someone with a deep knowledge of how zebra works and what values are normal vs what values should trigger an alert.
Hey, sorry if I use LLM sometimes. It’s just that I always think in a very technical way, you know? I understand the things in my head, but it’s hard for me to explain them good, so that’s why I use it.
Like, something I could explain in maybe one hundred lines of code, I write like five hundred because I wanna explain my idea well, and sometimes I repeat things a lot. So yeah, that’s why I use LLM.
Also I want the info to be easy to read for everyone. Believe me, the knowledge is mine, I just use the LLM to make it more clear, because sometimes I see posts with super long paragraphs and honestly it’s too heavy for me. And I don’t wanna do something to you guys that I don’t like when other people do it.
Sorry about that. I will try to answer by myself now without using LLM. If you have any question just tell me and I answer you.
And btw I also have a document where I wrote all the info before, so sometimes I copy some code lines from there just to answer faster. But I didn’t use the LLM to answer you exactly.
All the things that I respond, I have them in a document and in my previous investigation. But I don’t want to give you like 61 pages and I use the LLM to answer you more fast and give you a more concrete response.
And sorry if I say the same things again, but sometimes I don’t know if you understand those points or maybe you miss them, so I repeat them to make it clear.
@NoFace@shieldedmark Okay, I want to say sorry to all of you. Thank you so much for your understanding @zancas. Honestly, I would like to answer you in Spanish because I can explain things much better. I will try to use Google Translate to find a way that doesn’t change my answer too much. The bad thing is that it also uses AI, but I will try to make it not sound too dry. I will answer all your questions with my own words, just I will take a little more time.
If you have any question you want to ask me, or if you have any question about this topic, you can tell me. Today I will focus on answering everything, so I can give you all the time possible and it helps me answer faster later. because i need a lot of time to answer correctly with my own word and in english
Thank you, @sudo-julien. You seem like a capable dev, and I hate to see that hidden behind a wall of AI-generated text. I should have also sent that message in private.
Before I continue I want to say that I made a long post to answer all the questions. Maybe you think I added more questions but to do this I checked your post and I made a list of the questions you asked or the ones I think you probably have.
Outgoing.doze Questiones:
For the question and comments from @outgoing.doze I have the answer for the basic questions that you are probably asking.
If your project is only nice dashboards or if you are also going to deliver a complete and serious set of Prometheus alerts ready to use..
No the project is not only nice dashboards I will also deliver a full set with Prometheus alerts ready to use.
1. ¿Who is going to define which alerts are critical for Zebra?
The alerts in this case will be defined by me in the platform but not in an isolated way. I will base them on things that, if detected on time, could have stopped or avoided past problems. I will check the incidents Zcash had like the hash rate concentration in one pool or the issues Nighthawk had. I will also use the metrics that Zebra and zcashd already expose in their native Prometheus endpoints and the feedback from operators that already run Zebra, especially from the forum because there is a lot of important info there but it is very scattered. Later I also want operators to help me adjust the alerts so the people with experience can tell me what things I should add.
3. How are you going to define reasonable thresholds so there is no alert fatigue?
At the beginning there will be only a few types of alerts, only the ones linked to serious incidents like node down, zero peers, big sync delay, disk almost full, mining concentration. Those things have already caused problems in the system and they are the most serious. From there with the data you can tell if it is a critical alert or just a warning. For example something critical is having zero peers for many minutes and something that is more like a warning and not something to wake people up is having less than eight peers for fifteen minutes. I also want to add a system to avoid too many notifications by setting this inside the alert rules. And of course everything will be separated and filtered by critical, warning, and info so depending on the category the information will be shown in a different way.
4. Is your project going to help people like him, who run Zebra but don’t know which values are “normal” or “critical”?
Actually that is a good point. The main goal of this is that someone who runs Zebra or zcashd but is not an expert in monitoring can have “normal” and “critical” values already coded in the rules and dashboards.
Is there already a rule file ready to import, or is your project going to be the one that provides those alerts?
What the project gives is a file with the alert rules that will include lists of critical alerts, warnings, what they are, and why they usually happen. It also includes an Alertmanager setup so they only need to change their webhooks or emails.
Also while checking I noticed there is no standard alert rule file for Zcash and the proposal wants to fill that gap.
Shielded Mark Questiones:
@shieldedmark this is all the things that i think that you are question me if ou have another question you can say me but im going to respond little late,
Did your proposal already take into account that zebrad and zcashd expose native metrics?
Yes that is part of the design. The exporter does not ignore the native metrics, it uses them as the main source when they are available and then it combines them with RPC data. The hard part of this project is to unify the names and the schemas of zcashd and Zebra so one dashboard can work for both clients and you can compare them side by side. And when a metric already exists natively it helps more because it has less overhead.
Does your grant include a serious survey of prior art (previous scripts, exporters, dashboards, reproducible builds, etc.)?
In the documentation and in my research about this problem I made an analysis of zcash-monitor.py from Ageis and the zcashd_exporter from zcash-hackworks, and also the demo from hosh.zec.rocks and the infra work from other teams. I will also compare it with Bitcoin exporters and the Ethereum ecosystem so people can have something to compare with.
Can you explain how your project is different and what it adds compared to what already exists?
This is a good question because I hope it doesn’t cause confusion. Basically what already exists are loose pieces, like a puzzle spread around the forum and GitHub, and some pieces are missing. There are native endpoints, old scripts, and a couple of demos. What my project does is make one single exporter that works for both zcashd and Zebra, and from there there will be around ten or more Grafana dashboards ready for production. There will also be one specific dashboard for the migration from zcashd to Zebra and another one for privacy metrics like Sapling, Orchard, and shielded usage. We will also add alert rules ready to use with justified thresholds and we will separate them by severity. And we will include a production package with support for Kubernetes, bare metal deployment, and documentation so people can go from zero to having a full dashboard, with a clear focus on the transaction metrics for both zcashd and Zebra.
Is your stack going to be Prometheus + Alertmanager + Grafana as the standard, or something different?
In the proposed stack we will use Prometheus as the time series database and alert engine, then Alertmanager for routing the alerts by severity, channels, silencing, and grouping. And then Grafana as the visualization layer and dashboards. Everything will be self hosted to avoid centralization.
** Are you going to deliver alert rule files that can be imported as drop-in (like zcash-alerts.yaml) or only loose examples?**
The plan is to deliver files ready to use. One file with the rules and the groups of critical alerts, warning alerts, and info alerts, and an example configuration so they only need to replace their Slack, Discord, email, or PagerDuty channels. In the documentation we will try to make it so operators can literally copy the files, adjust a few parameters depending on their case, and have the alerts working.
Have you already checked if there are alert definitions you can reuse or improve, instead of creating everything from zero?
The alert design is based on rules and patterns used in Bitcoin and Ethereum exporters, for example block lag, zero peers, disk usage, and high mempool. From there we will use the public configs that already exist in projects like zcashd_exporter and other community tools. To improve the best practices we will use time windows, severities, and we will avoid duplicate alerts for the same condition. Basically we will look at everything that already works for people, we will repeat it, use it as feedback, and also add things that we see are important.
Okey I don’t know if I should take this the wrong way but I think I understand the concern and sorry if I used AI, but all these explanations were made by me. I used the translator but I tried to write everything myself and give it the human touch that an LLM can’t give. I did the research on this problem and I was thinking about it a lot because before sending this grant I was working on the technical idea and all the things we have to check and what is important. The AI was only used to translate and for style. If you look, I like to have everything ordered and I tried to make this message more clear and nice to read, that’s why it took me more time, but I did all of this by myself. If you have any technical question you want me to answer just tell me and I will answer it.
I hope you like this way of presenting the information. I will wait for your feedback and questions and I will try to answer as fast as I can. If you have another question I’m happy to help. Sorry for the misunderstanding and thank you for your understanding. And @zancas if you want to read it or give support I really appreciate it, and I also appreciate the support from all of you. I hope we can bring this proposal to reality. Thank you so much.