Hackmas: Pushing Zebra’s Initial Sync Performance Further

This post features work completed by @arya2 .

This hackathon was all about one thing: making Zebra sync faster. Initial sync is one of the most demanding phases for any node implementation, and Zebra is no exception. With millions of blocks to download, verify, and commit—especially in the wake of the historic spam attack—the bottlenecks become painfully visible. So I spent the hackathon diving deep into Zebra’s sync pipeline, profiling behavior, testing hypotheses, and experimenting with improvements.

What follows is a walkthrough of what I explored, what I learned, and where Zebra could go next.

:puzzle_piece: Understanding the Baseline

Zebra’s initial sync relies heavily on the checkpoint verifier. In this mode, Zebra doesn’t perform full validation on every block; instead, it focuses on:

  • Downloading block data, and
  • Updating the chain state.

This makes the checkpoint sync much faster than a full validation sync—but it also means the performance bottlenecks shift elsewhere.

Writing transaction data to foyer, an object storage engine, instead of RocksDB on a very large machine, Zebra completed a full sync in ~5.5 hours, a significant improvement, and profiling the initial sync with those changes revealed something interesting:

:stopwatch: Writing to the database wasn’t the main bottleneck

Zebra spent far less time writing blocks to disk than the total sync time, so initial sync time wasn’t purely or even mostly constrained by time spent writing to the database.

During most of the initial sync, the queue of blocks waiting to be written to the database was empty, indicating a bottleneck in finding and downloading blocks.

:magnifying_glass_tilted_left: Syncer Behavior: A Hidden Constraint

One of the biggest discoveries was how the syncer requests blocks:

:red_exclamation_mark: Zebra requests one block at a time from a single peer, waits for it, then requests the next.

Even with a large peer set, this serialized request pattern can throttle throughput.

To test this, I ran a comparison:

  • A full sync using only RocksDB and requesting blocks from multiple peers concurrently

  • Completed in 6 hours on the same large machine

Despite being slower than the initial sync using foyer, this test showed that parallel block fetching can, on its own, significantly improve initial sync time relative to the current baseline.

This suggests that Zebra’s single-block request pattern is a meaningful constraint.

Additionally, measurements showed that the block write task was still starved of blocks waiting to be validated and committed during much of the initial sync while about 80 peer connections were sitting unready in the peer set, indicating a need to request multiple blocks at once per network request.

:bug: A Syncer Bug That Hurts Performance

Another issue surfaced during testing:

:red_exclamation_mark: Sometimes all block download and verify tasks time out simultaneously

When this happens, the syncer restarts the entire process. This obviously hurts performance and adds unnecessary churn.

Fixing this bug alone could smooth out sync behavior significantly.

:package: Improving Block Flow: Checkpoints, Lookahead, and Buffers

A major theme of the hackathon was improving the flow of blocks through the pipeline.

:small_blue_diamond: More frequent checkpoint commits

Currently, Zebra commits checkpoints every ~400 blocks. Reducing this to 50 blocks or fewer helps maintain a steadier flow of:

  • Downloaded blocks

  • Validated blocks

  • Blocks ready to write

  • This reduces stalls and keeps the pipeline fed.

:small_blue_diamond: Increasing the lookahead limit

A larger lookahead window means Zebra can download and prepare more blocks in advance. This also smooths out the pipeline—but it comes at the cost of higher memory usage.

:small_blue_diamond: Adding a buffer of validated blocks

By keeping a buffer of non-finalized, validated blocks ready to write, Zebra can:

  • Write blocks more consistently

  • Validate dependent blocks more quickly

  • Reduce idle time in the pipeline

This change alone reduced initial sync time by ~13%.

:gear: Smarter State Access and Parallelization

A few more targeted optimizations showed promising potential:

:small_blue_diamond: Pre-reading UTXOs

UTXOs spent by a block can be read from the database in advance by a separate task rather than being read just before a block is written to the database to prevent these reads from blocking block writes.

:small_blue_diamond: Offloading RPC-only data updates

Some state updates are only needed for RPC methods. Moving these updates to a separate task prevents them from blocking block writes.

:small_blue_diamond: Parallelizing block hash computation

Checkpoint-verified block hashes are currently computed sequentially. This is low-hanging fruit: they could be parallelized easily, reducing latency in the checkpoint pipeline.

:chequered_flag: Takeaways and Next Steps

This hackathon made it clear that Zebra’s initial sync performance is already strong—but there’s still meaningful room for improvement. The biggest opportunities lie in:

  • Downloading multiple blocks per network request and concurrently requesting blocks from all available peers until reaching the lookahead limit.

  • Smoothing the flow of downloaded and verified blocks being prepared to be written to the database with denser checkpoints, larger lookahead limits, an in-memory chain state during the checkpoint sync, and fixing the syncer bug causing timeouts in download and verify tasks.

  • Minimizing the concerns of the block write task, increasing RocksDB’s configured parallelism, and using the integrated BlobDB object storage with some column families so downloaded blocks can be added to the database as quickly as possible.

Each of these changes chips away at the bottlenecks that slow down initial sync, especially during high-block-count periods like the spam attack.

The work isn’t finished, but the path forward is clearer than ever. With a few targeted improvements, Zebra can become even faster, more resilient, and more efficient during initial sync.

10 Likes

Outstanding analysis! Would be cool if operators could opt-in to sending syncing time data in a unified way so we can start to visaulize and graph the data from around the world. If this could be done in a privacy preserving way I would be all for it.

:heart_eyes:

2 Likes