Zebra State Snapshot And Fast Sync Infrastructure

Hello everyone:

I am submitting a proposal to Zcash Community Grants (ZCG) to build a State Snapshot and Fast Sync infrastructure module for Zebra, the Rust full node implementation of Zcash.

This project focuses on exporting, distributing, verifying, and restoring finalized chain state, allowing new Zebra nodes to bootstrap quickly from trusted snapshots instead of syncing from genesis.

Project Overview

This proposal introduces a reusable, secure state snapshot system for Zebra that enables:

•   Point-in-time snapshots of finalized Zcash chain state

•   Fast node bootstrap from verified snapshots

•   Dramatically reduced cold-start time for full nodes

•   More efficient CI/CD, testing, and elastic deployment workflows

Motivation

Running a new Zebra full node today requires replaying the entire blockchain from genesis, which creates several challenges:

High Cold-Start Cost

•   Full sync can take many hours or days

•   Heavy CPU, IO, and memory usage

•   Impractical for frequent or automated deployments

Poor Developer CI/CD Experience

•   Tests and integration environments must wait for full sync

•   Hard to spin up multiple nodes with identical state

•   Slows development, testing, and experimentation

Lack of Standardized Snapshot Infrastructure

•   No official or community-maintained snapshot solution exists for Zebra

•   Different teams re-implement partial solutions

High-Level Technical Approach

The proposed solution consists of:

1. Snapshot Generation

•   Extract finalized state from Zebra’s RocksDB storage

•   Support Transparent, Sapling, and Orchard pools

•   Streaming export for large state sizes

2. Custom Snapshot Format (.zsnap)

•   Designed for large-scale data

•   Segmented layout for incremental verification

•   Built-in metadata (network, height, tree roots, versioning)

•   Optional compression (LZ4 / Zstd)

3. Multi-Level Verification

•   Integrity checks (checksums)

•   Tree root verification (Sapling / Orchard / Sprout)

4. Snapshot Loading & Fast Sync

•   Restore Zebra state directly from snapshot

•   Pre-load verification

•   Streaming import with controlled memory usage

•   Designed to handle 200–250GB+ state sizes predictably

5. Distribution CLI Tooling

•   Snapshot publishing to HTTP / S3 / CDN

•   Resume-capable downloads

•   CLI commands for generate / verify / load / inspect

Expected Impact

For the Zcash ecosystem, this enables:

•   Faster onboarding of new full node operators

•   4–10× faster bootstrap compared to syncing from genesis

•   Node ready in \~2–3 hours instead of 10–24 hours

Deliverables Milestones

The proposal is structured into three milestones covering:

1.  Snapshot format, core read/write engine, compression, generation

2.  Verification system and snapshot loader

3.  Download system, CLI tools, documentation, and release

Full proposal details, are available here:

https://github.com/ZcashCommunityGrants/zcashcommunitygrants/issues/187

Thank you for your time and input.

1 Like

I’m curious, how this process is working now? Manually with standard shell commands only?

1 Like

Currently, Zebra nodes are typically run either via Docker or directly from the shell, and they sync from genesis using the standard workflow. There is no official snapshot or fast-sync mechanism today.

From a node operator’s perspective, once this proposal is implemented, the goal is that everything can still be done via shell commands: download a snapshot, verify it, and start Zebra.

Snapshot maintenance is a separate role. Snapshot operators need to run a fully synced Zebra node, generate and verify snapshots from the current state, and publish them. Most node operators would only consume snapshots.

I’m interested in the second role, do you know if a step by step tutorial exists? I think a direct comparison between “how to do it today” and “how can be done tomorrow” can be useful for your proposal also.

1 Like

Currently, Zebra nodes are typically run either via Docker or directly from the shell, and they sync from genesis .

After implementation, the operational cost of the second role (snapshot maintainers) is intentionally designed to be very low.

Snapshots do not need to be generated frequently. They could be produced once every three or even six months by running a fully synced Zebra node, executing a single command to generate a snapshot at a finalized height, verifying it, and then uploading it to an official or community mirror. This is closer to publishing an infrastructure artifact than operating a long-running online service.

For regular node operators, using snapshots will be a completely optional optimization. When starting a node, they can choose to enable a “fast bootstrap” option, which will automatically download, verify, and load a snapshot, and then sync the incremental blocks from the snapshot height to the current tip. If startup speed is not a concern, operators can continue to sync from genesis as they do today, with no change in behavior.

1 Like

The idea is neat but the big blocker for this is hosting - who is going to host it and pay for the bandwidth? The DB is huge.

1 Like

How much does a new node operator save in bandwidth & time compared to download and processing the blocks?

My concern is that the db does not trim much data because the privacy set is not trimmable (at the moment). Proofs could be removed but I think zebra stores the complete block anyway. Therefore isn’t the zebra db larger than the blockchain?

My second and bigger concern is the loss of security. I wouldn’t trust a third party db which may not be tampered with. (There could be hashes, but who will ensure they were not tampered too?)

I think calling it a snapshot is not exact.

2 Likes

Theres already a snapshot on zecrocks on storjshare. Its uncompressed so its the full 260 GB but at ~20MB/s thats like 3.5 hours. That also assumes a constant rate and non-stop download, it can’t be paused it seems (edit: I think thats not correct but it’s still inherently unreliable). Normal syncing takes longer but obviously can stop and restart and the BW bottlenecks are more spread out. I suppose clicking download before bed and hoping the whole 260 gets down without incident is probably the way to go. Theres no subsequent verification process and that, as it stands, would still rely on you trusting zecrocks’ hash of their own thing anyways. However it is the one thats a part of their self-hosting LWD infra project, I haven’t heard any issues about it. In any case, it circumvents full verification of the chain which node operators should be all about anyways.

1 Like

Thanks for raising this — hosting and bandwidth are important practical considerations for snapshot-based approaches, especially given the size of the database. This question helps clarify the operational assumptions and makes the proposal more complete.

Cloudflare R2 can be used as the snapshot object storage and distribution layer, avoiding the need to operate complex infrastructure. This requires only basic bucket configuration, access control, periodic snapshot uploads, and a bound domain, from which users download snapshots.

Assuming snapshots of approximately 500 GB (allowing for future growth), quarterly uploads, and around 10,000 downloads per month, the estimated cost is ~$7.65 per month (storage $7.50, requests negligible, zero egress). This corresponds to a base annual cost of ~$110, with a conservative upper bound of $200 per year including incidental expenses.

These storage and distribution costs are modest and could be covered as part of a grant if appropriate. From an operational perspective, existing community operators already running similar infrastructure are welcome to host and maintain snapshots on a voluntary basis. If no such maintainer is available, I am willing to handle snapshot uploads and distribution myself to ensure continuity and reliability.

1 Like

1. On compression

Around 80–90% of Zcash chain state consists of cryptographic hash–based data, which is essentially non-compressible. Snapshots are therefore not intended to significantly reduce storage size.

2. On security and verification

Modifying a snapshot can at most put the local node into an invalid state; it cannot affect network consensus validity. From an attack-surface perspective, snapshot tampering can only target double spends or inflation, both of which are rejected by local nullifier and value-invariant verification.

Snapshots are never trusted directly. After download, a node performs full local verification, including file integrity checks (full BLAKE2b scan) and validation of consensus-critical invariants such as nullifier and UTXO uniqueness, value pool bounds, historical consistency (anchors), and chain continuity. Any tampered or inconsistent state will be detected and rejected.

3. On performance

Full verification is faster not by weakening security, but by changing the verification model. It relies on sequential I/O, batched writes, and in-memory deduplication, and skips PoW, signature, and ZK proof checks that were already performed when blocks were accepted, validating only final state invariants.

Traditional block-by-block sync is dominated by repeated cryptographic verification and heavy random I/O, so avoiding these steps yields orders-of-magnitude improvements in end-to-end sync time.

Illustrative Cost Drivers of Block-by-Block Sync (Order-of-Magnitude)

Traditional block-by-block synchronization is dominated by a small number of structurally expensive operations.

Zero-knowledge proof verification takes approximately 10 milliseconds per shielded transaction, which corresponds to roughly 28 hours for around 10 million transactions.

Transaction signature verification takes approximately 100 microseconds per signature, which corresponds to roughly 3 hours for around 100 million transactions.

Random I/O operations for UTXO and nullifier lookups take approximately 100 microseconds per lookup, which corresponds to roughly 3 hours for around 100 million accesses.

These order-of-magnitude estimates show that synchronization time is dominated by repeated cryptographic verification and heavy random I/O, rather than raw data transfer.

In future, a practical snapshot design must support large state sizes (300–500GB) with chunked and resumable downloads to avoid the failure risks of monolithic transfers.

More importantly, snapshots are never trusted directly. After download, a node performs local full verification that does not replay chain history, but instead checks consensus-critical invariants, including nullifier and UTXO uniqueness, value bounds, value pool constraints, shielded state consistency (anchors), and block height continuity.

This verification model is significantly faster than syncing from genesis: by relying on sequential I/O, batched processing, and in-memory deduplication, it avoids the heavy random I/O and repeated cryptographic verification of block-by-block sync, resulting in much shorter end-to-end bootstrap time.

I’d rather not have the full node verification skipped because that leverages the validation code in zcashd/zebra. If it is skipped, we need to do a threat analysis to check that the “lightweight” checks are sufficient. I’d like to have zebra support reading the blockchain data from a file (maybe it does, bitcoind supports reading a bootstrap.dat file).

2 Likes

This concern is valid. The intent is not to skip verification, but to perform full verification of a core set of consensus-critical invariants during snapshot loading, with the goal of ensuring correctness of the monetary state and preventing double spends and inflation. Snapshots are never trusted. A node first performs full local state-level verification, focusing on nullifier uniqueness for double-spend prevention, UTXO uniqueness and value bounds for inflation prevention, value pool constraints, shielded state consistency (anchors and commitments), and block height continuity, thereby ensuring that account balances and the overall monetary state are correct and internally consistent.

Once the snapshot has been successfully verified and loaded, the node transitions to normal incremental block synchronization. All subsequent blocks are validated using the standard full-node consensus pipeline, identical to syncing from genesis in terms of security guarantees. Even in the extreme case where a snapshot has been maliciously tampered with prior to verification, subsequent consensus validation would quickly surface the inconsistency. The node would refuse to continue serving or accepting transactions, and any transactions generated from or propagated based on an invalid state would be rejected by other honest nodes under consensus rules, preventing them from being confirmed or finalized on the network.

It would be a distinct code base that needs to perform the right useful subset of the validation rules. Too much and it won’t be fast, too little and it is won’t be safe.
It needs to be audited, maintained, tested, etc while being used only during bootstrap. Protocol upgrades will impact it too.
In all in all, I would rather use the regular validation code path that is used by every node and every block because I don’t think the time it saves me is worth the risk.

It’d be good to save on the download though.

Your concerns are entirely reasonable, and I agree with the engineering and security value of reusing the existing full-node validation code path, especially to avoid introducing a separate validation implementation that is only used during bootstrap but still requires long-term maintenance and auditing.

I would like to clarify that the goal here is not to introduce a new “lightweight” validation implementation. Instead, the design continues to rely on existing consensus rules and validation logic, while changing the data source and verification granularity during bootstrap, shifting from block-by-block historical replay to state-level invariant verification. Once incremental synchronization begins, the node fully returns to the standard full-node validation pipeline.

I understand and respect the preference to always use the regular block-level validation path during startup, which is a very conservative and sound choice. This proposal is intended to address use cases with stronger requirements around cold-start time, automation, and operational cost, rather than to replace the existing startup model for all nodes.

It doesn’t, but this is something we would probably like to support

I think the trust issues could be streamlined by using the checkpoints as a comparison.

Zebra supports checkpoints, which is basically a list of block hashes. Checkpointed blocks are validated by simply checking the block hashes and they are trusted if the hash matches, which allows skipping most of the consensus rule checks. ZF trusts the checkpoint because we update it ourselves, and the community trusts the checkpoint because they trust ZF (and anyone can validate the hashes if they want).

Something similar could be done for the snapshots, by having checkpoints on it which would be updated by ZF (and verified by anyone). This would exclude the need of a different validation path for them, they would be validated by simply checking the hash. The only tricky part would be to generate a “stable” snapshot format (i.e. the snapshots from different nodes at the same height need to be identical), I have no idea how to do that with RocksDB, but seems doable.

1 Like

Thanks for the suggestion. After further technical investigation, I believe that building snapshot verification on top of Zebra’s existing checkpoint mechanism is a more robust and broadly acceptable direction.

This design fully reuses an already accepted trust model by treating snapshots analogously to checkpointed blocks and verifying them via deterministic hash comparison, thereby avoiding the introduction of a new or bootstrap-only validation path.

The key challenge lies in producing a deterministic snapshot format such that snapshots generated by different nodes at the same height are byte-identical. While this is not provided out of the box by RocksDB, investigation shows that it is an engineering-feasible and well-scoped problem that can be pursued independently. Once addressed, snapshot verification would align cleanly with existing consensus trust boundaries.

The database files are portable between hosts but I doubt they are byte identical. It is difficult (and usually not required) because db engines use multiple threads for better concurrency and performance.

You’re right that RocksDB does not guarantee byte-identical database files across different nodes, and I agree with that assessment.

This proposal does not rely on the determinism of RocksDB’s physical files, nor on copying database directories. Instead, it follows the same principle as checkpoints: at a higher level, a deterministic representation of the consensus-critical state is defined, and a hash of that representation is used for verification.

Concretely, determinism is achieved through a canonical traversal and serialization of the consensus state (for example, fixed component ordering, deterministic key ordering, and inclusion of only consensus-critical fields), producing a state digest that can be compared by hash. As long as honest nodes agree on the state at a given height, they will compute the same result.

In this sense, determinism is enforced by the snapshot format and verification rules, not by the storage engine, and snapshot verification aligns with the existing checkpoint trust model.