Ziggurat: the Zcash Network Stability Framework

A Ziggurat is a rectangular stepped tower that uses precise measurements to ensure that each of the foundational platforms can support the layers above.

Based on Equilibrium’s experience stress testing projects such as Rust IPFS and Aleo, this metaphor can be applied to network testing by defining three layers, each one building upon the next.

Ziggurat will start by testing conformance, making sure that each tested node adheres to the network protocol. With only zcashd this was perhaps simple but with zebra coming out, enforcing a specification that a bona fide Zcash node must satisfy becomes critical. Once foundation is established, Ziggurat will stress test performance using an arbitrary number of test nodes in various topologies. Finally, it tests resistance to bad actors by simulating malicious behavior.

Read on, we’ll be happy to answer any questions (likely on EU time)

9 Likes

Just a quick message to say THANK YOU to ECC, ZOMG, and the community for funding our proposal. We’re really looking forward to engaging with people as we move through the process of enriching the Zcash network.

7 Likes

Hey folks! :wave:

We believe we have reached the successful end of Milestone 1, and thus I’d like to post an update on the Ziggurat project.

First, we inspected the code and runtime behavior of both zebra and zcashd, and then we made a draft proposal of a series of tests to batter the nodes with. After initial review and feedback from the core devs, are proud to announce that we have we made the eqlabs/ziggurat public, and that you can now read the first draft of the Ziggurat spec.

So far we have:

  • 17 conformance tests
  • 2 performance tests
  • 6 resistance tests

There’s still plenty of potential for additional tests, and plenty of work to do even beyond implementation. There are also still a few open questions that we would now like to ask the community’s feedback on, particularly those with experience running zcashd and zebra:

  • Any particular considerations or requirements around node setup and teardown (CI, caches, test data or preloaded state)?
  • Any notable known differences between Zcashd and Zebra nodes, especially with respect to the network protocol or assumptions connected to it?
  • Any relevant complex peering or sync cases to give particular attention to?
  • Any particular malicious angles, design compromises or potential problem areas in need of extra test coverage
  • If you do have experience running a node, how do you think we should define “reasonable load” and “heavy load” for load testing? This can be in terms of number of peers, message frequency, or any metric you select.

We would love to hear from people out in the wild, and use your experiences to inform our future work.

Thank you so much, grantors and community alike.

4 Likes

Thank you for the update!

Can you elaborate on where the feedback/conversation loop with the ECC and ZFND is taking place so ZOMG can better follow along?

Sure, it’s mostly in the #testing channel of the ZCash Dev discord.

1 Like

can you please add some details on what preloaded state means ?

let’s say I want to speed up the initial sync and I already have a “trusted” copy of the full chain, that I can import ( think S3 or on a different machine ). Are you considering to add a test for validating the state ?

1 Like

Hi vamsi, thank you for the question. With preloaded state, we are referring to testing a node which isn’t starting from scratch with its chain state. This could be useful to test block propagation and more broadly the chain syncing mechanism. We haven’t currently planned a test specifically for validating a full chain’s state beyond a few blocks, though we’re open to suggestions regarding more complex scenarios. The one you mention could definitely be envisaged (assuming a full chain is available—this should also probably be capped in size to avoid overly long running tests).

2 Likes

Hi aphelionz,

Welcome! Total Rust newb here, but this looks like an exciting project! :nerd_face: :crab: :zcash: :zebra:

Wanted to pass along some information in regards to your questions.

“Any particular considerations or requirements around node setup and teardown (CI, caches, test data or preloaded state)?”

CI
If you intend to use Docker for CI related tasks, Docker Hub (any image named zcashd-build-* will build for the default linux host) has all the builders for various platforms. Note these do not have Python layers to run RPC tests, but we will be adding these soon as zcashd-worker-*. Most platforms this is trivial to add but to avoid issues on older platforms, I typically recommend folks to use zcashd Ubuntu 20.04 image(Docker Hub).

Caches
It is strongly encouraged to cache output from fetch-params.sh into something you operate as the default mirror is rate limited. We have used a few options to cache these and other artifacts, but IPFS has worked well for this depending on the requirements.

Test Data or Preloaded State
In general, I never preserve cache between tests and the majority of the underlying scripts “should” gracefully clean this up for you. However there are ways to disable this if you want to archive test cache for other purposes.

Preloading the node with a given chain can save a TON of time, depending on your ISP and system hardware. It is recommended to cache these in something you operate/manage per your requirements. Also, depending on your test requirements, it is generally recommended to have two chain copies per network. For example on mainnet, have a chain that is built from a node without txindex=1 in zcash.conf, and another with txindex=1 in zcash.conf. This allows the operator to not have to reindex/rescan for preloaded nodes that need chain meta from txindex=1. When generating these chain snapshots from blocks and chainstate it is important to ensure the node is completely stopped. Otherwise, you can risk corrupting these snapshots.

Any notable known differences between Zcashd and Zebra nodes, especially with respect to the network protocol or assumptions connected to it?

I’m not familiar enough with zebrad or Rust to speak to this.

Any relevant complex peering or sync cases to give particular attention to?

Operating nodes on the expected best chain generally is fairly straight forward. There are some minor issues with operating nodes with Tor configurations. If you intend to spin up your own testnets operating N nodes, there is a whole other layer of cases to consider.

Any particular malicious angles, design compromises or potential problem areas in need of extra test coverage

I can’t speak to specific malicious angles or design compromises. Other core devs and/or security folks could provide this information potentially. We are slowly getting the majority of the pieces together to finish up the last mile of longer running tests, but we have yet to overlap all the items mentioned above with zebrad, so this is uncharted waters.

A couple of tools that can aid tremendously for the scope of work you mentioned:

If you do have experience running a node, how do you think we should define “reasonable load” and “heavy load” for load testing? This can be in terms of number of peers, message frequency, or any metric you select.

I typically baseline this with a default zcash.conf on a 2CPU 8GB system, as this is the minimum hardware needed to build/run a zcash node. From there you can start to model some of the bounds based upon the test criteria to better understand “reasonable”, “average”, and “heavy” load. Then it is clearer to model the given load per some system/network config as it scales up or down. Also helps to isolate peering/network issues that may come up in the wild with these nodes, if they aren’t in an isolated environment.

Please let us know if you have any other questions. For whatever reason my Discord is not functioning, so I am unable to message in that portal :frowning_face:

3 Likes

Hey again,

I wanted to quickly post an update on our progress with Ziggurat.

We are working publicly in the eqlabs/ziggurat GitHub repo, which includes a handy table detailing the project status. There are 13 / 25 test cases done.

We had originally scheduled between 4 and 8 weeks for Milestones 2a and 2b, but we definitely hit some of the risks we stated in the proposal, in particular discrepancies between the zcashd and zebra implementations. Communicating this prompted valuable discussion in the Zcash Dev discord, and hopefully implementations will become more aligned (from Ziggurat’s point of view) over time.

Additionally, we submitted a number of vulnerability disclosures via the preferred channels. They were exclusive to Zebra, so no cause for alarm. We trust they’ve made it into the correct hands.

PS @mdr0id thank you so much for all of that context - that gives us a really good sense of how Ziggurat fits into the lay of the land.

3 Likes

Hello again, another update from Equilibrium!

Status

Milestone 2a - Implement the test suite

At this point, we have accomplished what we set out to do, and more. Right now, Ziggurat is as complete as it can be given the status of the implementations, and over the course of this work we have discovered over a dozen discrepancies between the zebra and zcashd implementations. This has catalyzed the creation of several issues and PRs.

Some examples:

This a huge success, even if it was more work than we’d originally planned.

Milestone 2b - Develop Output and Visualizations

This was planned before the outset of any implementation work, and due to a number of factors we don’t believe visualizations will add value at this point.

  1. There isn’t much data involved beyond simple pass/fail metrics. We have a total of six tables (three * 2 node types), but we don’t believe that’s enough to satisfy the entire grant milestone.
  2. Despite relying on the available time of the ZF/ECC core devs, the continued work of finding more discrepancies and security flaws is far more important. Our work has rekindled this discussion, and we believe the next 2-3 weeks of our work will be better spent supporting this effort.

Request

We’d like to request an amendment to the grant milestones based on our current status. As I’m not sure who exactly to ask, and that the guidance states we should default to posting publicly on the forum, we’re asking here. :pray:

What we’d like to do, given the above context, is to simply expand Milestone 2a into 2b, fusing them together as essential and continued work. We will work in parallel on Milestone 3 in the meantime, thus allowing us to finish the grant work more-or-less on time.

Thank you for your consideration, and we’re really looking forward to finishing off this critically important work!

6 Likes

Milestone 2b Submission

As discussed, previously, we had merged 2a and 2b into a single deliverable due to the fact that output and visualizations were not as useful as originally thought.

Instead, we focused on:

  • Completing the test suite (25/25 test cases done)
  • Refactoring to use SyntheticNode as much as possible across the board
  • Prometheus compatible metrics export (which could be used in Grafana, so there is some visualization opportunities after all)
  • Various refactors and organization not based on SyntheticNode

Complete Test Suite

All of the tests (25/25) are now implemented. You can now read, in full detail, all about the tests that are performed in the Test Index section of the Ziggurat manual.

Versions used: ZCashd v4.4.1 (0dade79ce) and Zebra 1.0.0-alpha.11 (6396ac2).

SyntheticNode usage across the board

Much of the refactoring that took place was to place SyntheticNode at the core of the Ziggurat tests. A SyntheticNode is a Rust struct and corresponding implementation that “mocks” a Zcash node to the extent that is necessary for testing, including a bidirectional communication channel and a graceful shutdown. This allows us to be much more flexible with our testing.

Prometheus-compatible metrics

The latest changes introduce tracing into the code base, with Prometheus exporting built on top. Despite our decision to not focus on data visualization, this would indeed make it very easy to bring the metrics into a tool like Grafana.

Other Refactoring and Organization

In addition to all of the above the team refactored, organized, and fixed several items, including but not limited to:

  • Rustdoc additions and improvements
  • “Clippy” tests up to the latest version of Rust
  • Code organization and streamlining
  • Removing unnecessary assertions (i.e. ping/pong)
  • Additional fuzzing and fuzzing improvements
2 Likes