The Trailing Finality Layer

You should also take into account more next gen consensus methods like Narwhal/Bullshark, Jolteon, AptosBFT, Hotshot etc.

A theoretically higher validator amount is not necessarily the best metric to look at.

What advantages do they have over Snowman? From what I gather they are all fundamentally similar to Tendermint, but with more optimizations, because they are all classical consensus protocols.

I am not claiming that it is the only thing that matters, but I think I have already outlined why it is important, especially for a privacy preserving currency.

1 Like

Here is a document I wrote up about disadvantages of leaving a potentially unbounded gap between the last finalized block and the chain tip (which is the simplest form of an ebb-and-flow protocol as described in @nathan-at-least’s blog post), and what I think should be done instead. It also argues that it is necessary to allow for the possibility of overriding finalization in order to respond to certain attacks, and that this should be explicitly modelled and subject to a well-defined governance process.

The Argument for Bounded Best-Effort Consensus

Background

An “ebb-and-flow protocol”, as described in [NTT2021], is essentially a composition of two consensus protocols, one of which provides finality and the other dynamic availability. The protocol with finality validates a prefix of the blocks validated by the one with dynamic availability.

This is claimed by the paper to “resolve” the tension between finality and dynamic availability. However, a necessary consequence is that in a situation where the “final” protocol stalls and the “available” protocol does not, the “finality gap” between the finalization point and the chain tip can grow without bound.

In this note, we argue that this is unacceptable, and that it is preferable to sacrifice strict dynamic availability. However, we also argue that the main idea behind ebb-and-flow protocols is a good one, and that maintaining a bounded gap between the finalization point and the chain tip does in fact make sense. That is, we argue that allowing the current chain tip to run ahead of the finalization point has practical advantages, but that losing strict dynamic availability is actually preferable to the consequences of the unbounded finality gap, if/when a “long stall” in finalization occurs.

We also argue that it is beneficial to explicitly allow “finalization overrides” under the control of a well-documented governance process. Such overrides allow long rollbacks that may be necessary in the case of an exploited security flaw — because the time needed to detect such a flaw and decide whether to roll back will almost certainly be greater than the finality gap time at the point of detection. The governance process can impose a limit on the length of this long rollback if desired.

Finality + Dynamic availability implies an unbounded finality gap

Since partition between nodes sufficient for finalization cannot be prevented, the CAP theorem implies that any consistent protocol (and therefore any protocol with finality) may stall for at least as long as the partition takes to heal.

Dynamic availability implies that the current tip will continue to advance, and so the finality gap increases without bound.

Partition is not necessarily the only condition that could cause a finalization stall, it is just the one that most easily proves that this conclusion is impossible to avoid.

Problems with an unbounded finality gap

Both the available protocol, and the subprotocol that provides finality, will be used in practice — otherwise, one or both of them might as well not exist. There is always a risk that blocks may be rolled back to the finalization point, by definition.

Suppose, then, that there is a long finalization stall. The final and available protocols are not separate: there is no duplication of tokens between protocols, but the rules about how to determine best-effort balance and guaranteed balance depend on both protocols, how they are composed, and how the history after the finalization point is interpreted.

The guaranteed minimum balance of a given party is not just the minimum of their balance at the finalization point and their balance at the current tip. It is the minimum balance taken over all possible transaction histories that extend the finalized chain – taking into account that a party’s previously published transactions might be able to be reapplied in a different context without its explicit consent. The extent to which published transactions can be reapplied depends on technical choices that we must make, subject to some constraints (for example, we know that shielded transactions cannot be reapplied after their anchors have been invalidated). It may be desirable to further constrain re-use in order to make guaranteed minimum balances easier to compute.

As the finalization gap increases, the negative consequences of a rollback increase. There are several possible (not mutually exclusive) outcomes:

  • Users of the currency start to consider the available protocol increasingly unreliable.
  • Users start to consider a rollback to be untenable, and lobby to prevent it or cry foul if it occurs.
  • Users start to consider finalization increasingly irrelevant. Services that depend on finalization become unavailable.
    • There is no free lunch that would allow us to avoid availability problems for services that also depend on finality.
  • Service providers adopt temporary workarounds that may not have had adequate security analysis.

Any of these might precipitate a crisis of confidence, and there are reasons to think this effect might be worse than if the chain had simply stalled. Any such crisis may have a negative effect on token prices and long-term adoption.

Note that adding finalization in this way does not by itself increase the probability of a rollback in the available chain, provided the PoW remains as secure against rollbacks of a given length as before. But that is a big proviso. We have a design constraint (motivated by limiting token devaluation and by governance issues) to limit issuance to be no greater than that of the original Zcash protocol up to a given height. Since some of the issuance is likely needed to reward staking, the amount of money available for mining rewards is reduced, which may reduce overall hash rate and security of the PoW. Independently, there may be a temptation for design decisions to rely on finalization in a way that reduces security of PoW (“risk compensation”). There is also pressure to reduce the energy usage of PoW, which necessarily reduces the global hash rate, and therefore the cost of performing an attack that depends on the adversary having any given proportion of global hash rate.

It could be argued that the issue of availability of services that depend on finality is mainly one of avoiding over-claiming about what is possible. Nevertheless I think there are also real usability issues if balances as seen by those services can differ significantly and for long periods from balances at the chain tip.

Regardless, incorrect assumptions about the extent to which the finalized and available states can differ are likely to be exposed if there is a finalization stall. And those who made the assumptions may (quite reasonably!) not accept “everything is fine, those assumptions were always wrong” as a satisfactory response.

What is Bounded Best-Effort Consensus?

The idea is simple to describe: if an unbounded finalization gap is a problem, then just enforce a bound on it. In this approach, progress of the chain tip will, eventually, also stall if a finalization stall lasts for long enough — and so we are sacrificing strict dynamic availability.

A way of describing this that may be intuitive for many people is that it works like video streaming. All video streaming services use a buffer to paper over short-term interruptions or slow-downs of network access. In most cases, this buffer is bounded. This allows the video to be watched uninterrupted and at a constant rate in most circumstances. But if there is a longer-term network failure or insufficient sustained bandwidth, the playback will unavoidably stall.

So, why do I advocate this over:

  1. A protocol that only provides dynamic availability;
  2. A protocol that only provides finality;
  3. An unmodified ebb-and-flow protocol?

The reason to reject a) is straightforward: finality is a valuable security property that is necessary for some use cases.

If a protocol only provides finality (option b), then short-term availability is directly tied to finalization. It may be possible to make finalization stalls sufficiently rare or short-lived that this is tolerable. But that is more likely to be possible if and when there is a well-established staking ecosystem. Before that ecosystem is established, the protocol may be particularly vulnerable to stalls. Furthermore, it’s difficult to get to such a protocol from a pure PoW system like current Zcash.

We argued in the previous section that an unbounded finality gap is bad, and that c) entails an unbounded finality gap. However, that isn’t sufficient to argue that a bounded best-effort protocol is better. Perhaps there are no good solutions! What are we gaining from a bounded best-effort approach that would justify the complexity of a hybrid protocol without obtaining strict dynamic availability?

My argument goes like this:

  • It is likely that a high proportion of the situations in which a sustained finalization stall happens will require human intervention. If the finality protocol were going to recover without intervention, there is no reason to think that it wouldn’t do so in a relatively short time.
  • When human intervention is required, the fact that the chain tip is still proceeding apace (in protocols with strict dynamic availability) makes restarting the chain harder, for many potential causes of a finalization stall. Those problems may be easier to fix when the chain tip is also stalled. This argument carries even more force when the protocol also allows “finalization overrides”, as discussed later in the Complementarity section.
  • Nothing about the bounded best-effort option prevents us from working hard to design a system that makes finalization stalls as infrequent and short-lived as possible, just as we would for any other option that provides finality.
  • We want to optimistically minimize the finality gap under good conditions, because this improves the usability of services that depend on finality. This argues against protocols that try to maintain a fixed gap, and motivates letting the gap vary up to a bound.
  • In practice, the likelihood of short finalization stalls is high enough that heuristically retaining dynamic availability in those situations is useful.

The argument that it is difficult to completely prevent finalization stalls is supported by experience on Ethereum in May 2023, when there were two stalls within 24 hours, one for about 25 minutes and one for about 64 minutes. This experience is consistent with my arguments:

  • Neither stall required short-term human intervention, and the network did in fact recover from them quickly.
  • The stalls were caused by a resource exhaustion problem in the Prysm consensus client when handling attestations. It’s plausible to think that if this bug had been more serious, or possibly if Prysm clients had made up more of the network, then it would have required a hotfix release (and/or a significant proportion of nodes switching to another client) in order to resolve the stall. So this lines up with my hypothesis that longer stalls are likely to require manual intervention.
  • A bounded best-effort protocol would very likely have resulted in either a shorter or no interruption in availability. If, say, the finality gap bound were set to be roughly an hour, then the first finalization stall would have been “papered over” and the second would have resulted in only a short chain stall.

Retaining short-term availability does not result in a risk compensation hazard:

  • A finalization stall is still very visible, and directly affects applications relying on finality.
  • Precisely because of the bounded finality gap, it is obvious that it could affect chain progress if it lasted long enough.

A potential philosophical objection to lack of dynamic availability is that it creates a centralization risk to availability. That is, it becomes more likely that a coalition of validators can deliberately cause the chain to halt. I think this objection may be more prevalent among people who would object to adding a finality layer or PoS at all.

Finalization Overrides

Consensus protocols sometimes fail. Potential causes of failure include:

  • A design problem with the finality layer that causes a stall, or allows a stall to be provoked.
  • A balance violation or spend authorization flaw that is being exploited or is sufficiently likely to be exploited.
  • An implementation bug in a widely used node implementation that causes many nodes to diverge from consensus.

In these situations, overriding finality may be better than any other alternative.

An example is a balance violation flaw due to a 64-bit integer overflow that was exploited on Bitcoin mainnet on 15th August 2010. The response was to roll back the chain to before the exploitation, which is widely considered to have been the right decision. The time between the exploit (at block height 74638) and the forked chain overtaking the exploited chain (at block height 74691) was 53 blocks, or around 9 hours.

Of course, Bitcoin used and still uses a pure PoW consensus. But the applicability of the example does not depend on that: the flaw was independent of the consensus mechanism.

Another example of a situation that prompted this kind of override was the DAO recursion exploit on the Ethereum main chain in June 2016. The response to this was the forced balance adjustment hard fork on 20th July 2016 commonly known as the DAO fork. Although this adjustment was not implemented as a rollback, and although Ethereum was using PoW at the time and did not make any formal finality guarantees, it did override transfers that would heuristically have been considered final at the fork height. Again, this flaw was independent of the consensus mechanism.

The DAO fork was of course much more controversial than the Bitcoin fork, and a substantial minority of mining nodes split off to form Ethereum Classic. In any case, the point of this example is that it’s always possible to override finality in response to an exceptional situation, and that a chain’s community may decide to do so. The fact that Ethereum 2.0 now does claim a finality guarantee, would not in practice prevent a similar response in future that would override that guarantee.

The question then is whether the procedure to override finality should be formalised or ad hoc. I argue that it should be formalised, including specifying the governance process to be used.

This makes security analysis — of the consensus protocol per se, of the governance process, and of their interaction — much more feasible. Arguably a complete security analysis is not possible at all without it.

It also front-loads arguing about what procedure should be followed, and so it is more likely that stakeholders will agree to follow the process in any time-critical incident.

A way of modelling overrides that is insufficient

There is another possible way to model a protocol that claims finality but can be overridden in practice. We could say that the protocol after the override is a brand new protocol and chain (inheriting balances from the previous one, possibly modulo adjustments such as those that happened in the DAO fork).

Although that would allow saying that the finality property has technically not been violated, it does not match how users think about an override situation. They are more likely to think of it as a protocol with finality that can be violated in exceptional cases — and they would reasonably want to know what those cases are and how they will be handled. It also does nothing to help with security analysis of such cases.

Complementarity

Finalization overrides and bounded best-effort consensus are complementary in the following way: if a problem is urgent enough, then validators can be asked to stop validating. For genuinely harmful problems, it is likely to be in the interests of enough validators to stop that this causes a finalization stall. If this lasts longer than the finality gap bound then the chain will halt, giving time for the defined governance process to occur and decide what to do. And because the unfinalized chain will not have run too far ahead, the option of a long rollback remains realistically open.

If, on the other hand, there is time pressure to make a governance decision about a rollback in order to reduce its length, that may result in a less well-considered decision.

A possible objection is that there might be a coalition of validators who ignore the request to stop (possibly including the attacker or validators that an attacker can bribe), in which case the finalization stall would not happen. But that just means that we don’t gain the advantage of more time to make a governance decision; it isn’t actively a disadvantage relative to alternative designs. This outcome can also be thought of as a feature rather than a bug: halting the chain should be a last resort, and if the argument given for the request to stop failed to convince a sufficient number of validators that it was reason enough to halt the chain, then perhaps it wasn’t a good enough reason.

It is also possible to make the argument that the threshold of stake needed is imposed by technical properties of the finality protocol and by the resources of the attacker, which might not be ideal for the purpose described above. However, I would argue that it does not need to be ideal, and will be in the right ballpark in practice.

Addendum

Since I posted the above note, I’ve been thinking about potential attacks on protocols using the bounded best-effort model.

Tail-thrashing attacks

There is an important class of potential attacks based on the fact that when the unfinalized chain stalls, an adversary has more time to find blocks, and this might violate security assumptions of the more-available protocol. For instance, if the more-available protocol is PoW-based, then its security in the steady state is predicated on the fact that an adversary with a given proportion of hash power has only a limited time to use that power, before the rest of the network finds another block. During a chain stall this is not the case. If, say, the adversary has 10% hash power, then it can on average find a block in 10 block times. And so in 100 block times it can create a 10-block fork.

It may in fact be worse than this: once miners know that a finalization stall is happening, their incentive to continue mining is reduced, since they know that there is a greater chance that their blocks might be rolled back. So we would expect the global hash rate to fall —even before the finality gap bound is hit— and then the adversary would have a greater proportion of hash rate.

Even in a pure ebb-and-flow protocol, a finalization stall could cause miners to infer that their blocks are more likely to be rolled back, but the fact that the chain is continuing would make more difficult to exploit. This issue with the global hash rate is mostly specific to the more-available protocol being PoW: if it were PoS, then its validators might as well continue proposing blocks because it is cheap to do so. There might be other attacks when the more-available protocol is PoS; I haven’t spent much time analysing that case.

The problem here is that it may have been assumed from the earlier description that the more-available chain would just halt during a chain stall. But in fact, for a finality gap bound of k blocks, an adversary could cause the k-block “tail” of the chain as seen by any given node to “thrash” between different chains. I will call this a tail-thrashing attack.

If a protocol allowed such attacks then it would be a regression relative to the security we would normally expect from an otherwise similar PoW-based protocol. It only occurs during a chain stall, but note that we cannot exclude the possibility of an adversary being able to provoke a chain stall.

Let’s put a pin in solving this problem, because there is another issue that we need to consider, and my preferred approach addresses both.

Finalized Prefix Availability

In the absence of security flaws and under the security assumptions required by the finality layer, the finalization point will not be seen by any honest node to roll back. However, that does not imply that all nodes will see the same finalized height — which is impossible given network delays and unreliable messaging.

In order to optimize the availability of applications that require finality, we need to consider availability of the information, e.g. validator proposals and (possibly aggregate) signatures, needed to finalize the chain up to a particular point. It is possible to incentivize distribution of that information by piggy-backing it on block headers, so that block producers have to include it in order to obtain the block production reward (e.g. mining reward) for the more-available protocol.

Obviously the piggy-backed information has to be for finalization only up to some previous block, because the information for later blocks wasn’t yet available to the block producer.

Optionally, we could incentivize the block producer to include the latest information it has, for example by burning part of the block reward or by giving the producer some limited mining advantage that depends on how many blocks back the finalization information is. Alternatively we could just rely on the fact that some proportion of block producers are honest and will include the latest information they have.

Suppose that for a k-block finality gap bound, we required each block header to include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back. This would automatically implement the finality gap bound without any further explicit check, because it would be impossible to produce a block after the bound. But it would also guarantee that availability of the finalized prefix, at least up to the bound, is incentivized to the same extent as the more-available protocol.

In the Ebb-and-Flow paper, we also have information from the more-available protocol being used in the BFT protocol (top of right column, page 7):

In addition, \mathsf{ch}^t_i is used as side information in \Pi_{\mathrm{bft}} to boycott the finalization of invalid snapshots proposed by the adversary.

This does not cause any circularity. In fact, it means that BFT validators have to listen to the block transmission protocol anyway, so that could be also the protocol over which BFT communication occurs. That is, BFT validators could listen for block headers and get all the information needed to make and broadcast their own signatures or proposals. (A possible reason not to broadcast individual signatures to all nodes is that with large numbers of validators, the proof that a sufficient proportion of validators/stake has signed can use an aggregate signature, which could be much smaller.)

Back to the tail-thrashing problem

Note that in ebb-and-flow-based protocols, snapshots of the “longest chain” protocol \Pi_{\mathrm{lc}} are used as input to the BFT protocol. That implies that the tail-thrashing problem could also affect the input to that protocol, which would be bad (not least for security analysis of availability, which seems somewhat intractable in that case).

Also, when we restart the more-available chain, we would need to take account of the fact that the adversary has had an arbitrary length of time to build long chains from every block that we could potentially restart from. It could be possible to invalidate those chains by requiring blocks after the restart to be dependent on fresh randomness, but that sounds quite tricky (especially given that we want to restart without manual intervention if possible), and there may be other attacks I haven’t thought of.

So, instead of trying to directly solve the tail-thrashing problem, we will avoid it. We will say that if a block producer cannot give the information needed to advance the finalization point to at least k blocks back, then it must produce a coinbase-only block.

This achieves pretty much the same effect, for our purposes, as actually stalling the more-available chain. Since funds cannot be spent in coinbase-only blocks, the vast majority of attacks that we are worried about would not be exploitable in this state.

It is possible that a security flaw could affect coinbase transactions. We might want to turn off shielded coinbase for those blocks in order to reduce the chance of that.

Also, mining rewards cannot be spent in a coinbase-only block; in particular, mining pools cannot distribute rewards. So there is a risk that an unscrupulous mining pool might try to do a rug-pull after mining of non-coinbase-only blocks resumes, if there was a very long finalization stall. But this approach works at least in the short term, and probably for long enough to allow manual intervention into the finalization protocol, or governance processes if needed.

There’s a caveat: part of the reason we wanted the more-available chain to also stall is to make it more acceptable to do a rollback, possibly as far as the stalled finalization point — maybe earlier if the stall was triggered deliberately in response to noticing a sufficiently serious attack on-chain, and if governance decisions allow it. What happens to incentives of miners on a chain that might be rolled back in that way?

This is actually fairly easy to solve. We have the governance procedures say that if we do an intentional rollback, the coinbase-only mining rewards will be preserved. I.e. we mine a block or blocks that include those rewards paid to the same addresses (adjusting the consensus to allow them to be created from thin air if necessary), have everyone check it thoroughly, and require the chain to restart from that block. So as long as miners believe that this procedure will be followed and that the chain will eventually recover at a reasonable coin price, they will still have incentive to mine on the \Pi_{\mathrm{lc}} chain, at least for a time.

There’s one more thing to solve: when we resume mining non-coinbase-only blocks, the rule that

Each block header must include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back.

will not be sufficient to catch up the finalization point using only information from block headers. We can solve this by adjusting the rule as follows (this also includes not enforcing it for coinbase-only blocks):

Let \mathsf{thisHeight} be the height of this block, and let \mathsf{prevFinal} be the height to which finalization information has been given by the header chain up to and including this block’s parent. If the block is not coinbase-only, then the block header MUST include the information necessary for a node to finalize to height \mathsf{min}(\mathsf{prevFinal} + 100, \mathsf{thisHeight} - k), given that the node has already finalized to height \mathsf{prevFinal}. (No information is needed if \mathsf{thisHeight} - k \leq \mathsf{prevFinal}.)

This allows nodes to catch up quickly (at a rate of 100 finalized blocks per new block) while still keeping the size of finalization information in each block bounded.

5 Likes

Exploring each methods advantages over snowman is a research paper in its own right, and if seemed useful, would be part of any research made towards transitioning to PoS.

These are all very different consensus protocols, maybe the only clear difference is that snowman is leaderless while the aforementioned are not.

1 Like

Hi all,

I wanted to share an update on PoS research at ECC.

The most recent event is that I gave this presentation at Zcon4 on TFL. I’d like to share some of the feedback I got in that session and from multiple conversations at Zcon, my reaction to it, what we’ve been up to since then, and what we’re planning next.

Zcon4 Feedback

At Zcon I got two kinds of feedback:

  • Could it be better to take an alternative approach where we transition Zcash from pure PoW to pure PoS in one step?
  • Does TFL have security flaws, and if so, are they show-stoppers or something we can mitigate by improving the design?

Alternative Approach: A Single Complete Transition

I’ve thought a bit about these suggestions since Zcon4, and mainly about the perceived trade-offs vs a TFL style approach (or any multi-step approach) as well as considering timelines, resources, and the trade-off between considering many alternatives vs making further progress on specific alternatives. You can read more about the trade-offs we’ve identified on this github ticket comment.

The gist of it is that we intend to continue developing TFL for now up until it’s either a viable ZIP-worthy proposal, or until we identify specific blockers. We provide more explanation for this decision in this github ticket comment.

This doesn’t mean we wouldn’t welcome others to flesh out more concrete alternative proposals! If anyone does so, that could be helpful for Zcash.

However, given our limited bandwidth at ECC (especially since we’re currently highly focused on Emergency Mode) our goal is to produce one good specific proposal for everyone to evaluate, so we’ll continue refining the TFL approach.

TFL Security

The second category of feedback on TFL was around security concerns. As you can see in my presentation, there are multiple unresolved considerations to understand the security of TFL.

One thing to keep in mind about security: sometimes security can be more of a continuum of weaker or stronger on some specific continuum. But at the end of the day, what users need is a simple “yes / no” answer to “is it safe enough for my needs?” So we might argue some design is safe enough, even though it doesn’t have maximal security compared to other known alternatives. Any kind of argument about “safe enough” deserves a lot of scrutiny, because it can be extremely hard to know what users’ needs and risk tolerances are, and even when we do, it can be very hard to measure security against that. Nevertheless, we believe the best designs make conscientious trade-offs, and that includes for security considerations. None of this is to suggest we should diminish the Zcash development community’s top-notch record and culture on strong security.

Ok, with that, let’s dig into the primary concern we took out of the Zcon4 feedback about security. While there are multiple ways to frame the concern, a plain English summary goes like this:

Often security is guaranteed by the weakest link in a system. In the TFL design there are two core subsystems: PoW and PoS. So the core concern is: will the security be limited by the weakest guarantees provided by these two subsystems?

And that brings me to what we’ve been focused on since Zcon4:

Progress since Zcon4

All of our PoS research effort since Zcon4 has been focused on two goals: organizing the TFL roadmap and analyzing the “weakest link” security issue above.

If you want to see under the covers of the crude/early roadmap, we’re using github milestones with these two initial milestones:

  • Design Phase 1 specifies a simplified subset of TFL more precisely based primarily on Ebb-and-Flow (and odds and ends we’ve picked up from studying Ethereum’s transition and Ghasper protocol). This milestone excludes most of the mechanics of PoS itself.
  • Implementation Phase 1 specifies a codebase for simulating attacks. We want to get started on coding early in the design process so that as the design improves we’ll have working, though incomplete, prototypes. As the design matures we hope to have a fully functional testnet.

What’s Next?

We’ll keep posting updates as we go, and we’ll be refining and adding new R&D milestones. ECC will also be publishing more top-level roadmap posts that will give people a higher level understanding of PoS R&D progress (along with all our other efforts), so if you’d prefer the high level of only major milestones, stay tuned for those.

16 Likes

I added a long Addendum section to this note, which describes attacks and mitigations specific to Bounded Best-Effort consensus. If you liked the original note, you’ll want to read it (or the version of it in the TFL Book)!

2 Likes

Just wanted to post the link to the Zapa GitHub in case you hadn’t seen it @nathan-at-least, a project a while back were someone modified zcashd to use snowman consensus.

1 Like

We’ve just published v0.1.0 of the TFL Book.

This version of the design includes the Crosslink construction, a hybrid consensus protocol which integrates PoW and PoS subprotocols. The Crosslink doc has security proofs of safety and liveness properties. This represents the first milestone with a specific abstract hybrid PoW/PoS protocol defined.

This v0.1.0 version has the content split between the TFL Book and a separate hackmd defining the Crosslink construction. The next milestone will integrate the Crosslink definition into the book ensuring it’s self-consistent.

The short-to-medium term roadmap for Trailing Finality Layer is to bolster this initial version of Crosslink with simulations of various attacks and more in-depth analysis of attacks, as well as beginning to examine real-world PoS protocols to integrate into the hybrid construction.

The medium term roadmap (within the next ~3 or so months), we’re considering a variety of steps beyond the roadmap defined above:

  • Complete TFL protocol:
    • Define a specific PoS subprotocol integration.
    • Define staking operations which are integrated with Zcash transaction semantics.
    • Define staking parameters and analyze their security and economic properties.
  • Begin creating a prototype / testnet.
  • Crosslink:
    • Get broader review of Crosslink security.
    • Compare Crosslink to other known viable hybrid protocols.
  • Begin designing a transition plan for transitioning Zcash from PoW to PoW+TFL.

In addition to the direct work on Trailing Finality Layer, we’re aware of multiple projects or people interested in bridging protocols or systems, so we hope to engage with them and ensure the design of Trailing Finality Layer can integrate and/or support trust-minimized bridge protocols.

As always, feedback is appreciated!

Shout out to Daira Emma Hopwood who has done most of the research and design behind Crosslink. <3

24 Likes

I have been hoping for movement in this direction for nearly four years, so this is really exciting to hear! Zcash <-> Cosmos Integration

Although I’m now on the sidelines of the Zcash effort, I would implore the Zcash community to choose Tendermint/CometBFT off-the-shelf as the PoS implementation, rather than putting effort into building or customizing it in any way. It is written in Go, but there is no problem interacting with it from other languages like Rust using ABCI.

This approach has two enormous advantages:

  • Light Client Compatibility: using CometBFT off the shelf means that the state of the PoS subprotocol can be verified using an off-the-shelf light client, and importantly, it can be verified on-chain by every Cosmos chain. This opens the door to IBC connections to Zcash.
  • No custom engineering: using CometBFT off the shelf means that Zcash can skip the engineering effort of designing and building a new consensus mechanism, and can reuse the investment made by the entire Cosmos ecosystem.

At this point there’s no problem with using CometBFT to drive the state of a Rust application (and have that Rust application implement custom staking logic). We do so for Penumbra; you could look at our source code as an example and potentially also reuse our stack. For instance, Namada uses our tower-abci library for interfacing with CometBFT and Astria uses penumbra-storage for state management, penumbra-component for application structuring, and penumbra-ibc for an IBC implementation.

If the Zcash TFL design pulled CometBFT off-the-shelf, it could potentially go the Astria route, reuse parts of the Penumbra stack, and have a drop-in IBC implementation for Zcash on a very accelerated timeline (completion within your medium term roadmap). We’d love for the Zcash community to be able to take advantage of the code we’ve written and we’d be happy to answer questions about it.

17 Likes

Thanks for the feedback.

For the record, I’ve personally preferred each of the CometBFT components (SDK, consensus protocol, IBC) as prime candidates for Zcash. You can see that prototyping with CometBFT is in the first set of tickets for the TFL Book. On our current roadmap we need to first do the design work (aka “adapting it” for Crosslink), and that doesn’t start until the fourth milestone out.

I appreciate you laying out the strengths, and I concur with all of those. (Also, I’ve stumbled across Penumbra rust crates a few times and started experimenting with them before realizing where they came from. :wink: They seem well designed from my cursory explorations.)

Do you think there are any drawbacks to selecting CometBFT consensus protocol vs any alternatives? (Assume for the sake of argument other candidate protocols had well engineered off the shelf SDKs and supported IBC.)

In particular I’ve seen a fair amount of discussion on these forums about different PoS protocol characteristics (and I haven’t examined all of the suggested alternatives yet).

Are you aware of efforts to productionize alternative consensus protocols in CometBFT SDK?

Are there any alternatives to IBC worth considering?

8 Likes

I’ve personally preferred each of the CometBFT components (SDK, … )

To clarify, I would not recommend trying to use the Cosmos SDK; it’s not well-shaped to your problem and it’s written in Go. I’d recommend building a Rust app directly on top of CometBFT, where you can have complete control of the application layer, and where you could reuse parts of the Penumbra stack:

  • cometbft, out-of-process, standalone consensus server, which is an ABCI client talking to your Rust app
  • tendermint-rs datatypes for modeling Tendermint data like block headers, which you’ll need to implement your custom staking logic;
  • tower-abci, an ABCI server library that runs a listener and converts ABCI wire data into async request/response pairs handled by your application;
  • penumbra-storage as the storage backend, which manages chain state in a verifiable K/V store backed by a Jellyfish tree, allows transactional writes using CoW snapshots, and also provides a GRPC interface for making provable queries about chain state (for use by, e.g., IBC relayers),
  • penumbra-component’s Component and ActionHandler traits allow you to structure the application logic into components that can depend on each other (e.g., in Penumbra, the shielded pool is a component, the staking system is a component, etc)
  • penumbra-ibc as a Component that provides an off-the-shelf IBC implementation. It handles all of the IBC state machine and all of the IBC events needed to inform relayers about IBC data they might need to handle. It also provides GRPC interfaces that relayers can use to relay packets back and forth. The actual packet handling logic (what to do if someone sends you an IBC packet) is up to your application.

All of these libraries were built for our use case originally but don’t have other Penumbra-specific logic in them; this is the same part of our stack that Astria is currently reusing. Our Penumbra-specific logic is built in other crates on top of these, which wouldn’t make sense for Zcash to reuse.

Do you think there are any drawbacks to selecting CometBFT consensus protocol vs any alternatives? (Assume for the sake of argument other candidate protocols had well engineered off the shelf SDKs and supported IBC.)

Yes, assuming for the sake of argument that other candidate protocols had well-engineered, off-the-shelf SDKs and supported IBC, there would be various drawbacks to using CometBFT:

In the end though, none of this matters, because there are no other candidate protocols with well-engineered, off-the-shelf SDKs and IBC support. CometBFT is the only mature BFT consensus protocol developed independently of a specific chain. The Narwhal/Bullshark implementation I linked above, for instance, is just a research prototype. So in practical terms you are stuck with either using CometBFT or going off into the wilderness.

Are there any alternatives to IBC worth considering?

No, for kind of similar reasons. One thing to highlight is that there are two versions of IBC: “IBC as specified”, which is a very flexible mechanism for verifiable messaging between verifiable state machines, and “IBC as deployed”, where communications between Cosmos-SDK-on-Tendermint chains are much smoother than other configurations. At a high level, there are three parts of IBC:

  • the light client part (chains A and B run on-chain light clients of each other to verify each others’ state roots);
  • the merkle proofs part, usually implemented with ICS23, a DSL for specifying generic Merkle proof verification programs and program specifications, so that chains can verify each others’ proofs without special support for each Merkle tree;
  • the application data part, specific applications like ICS20 (fungible token transfers).

For the Penumbra stack, for instance, the light client part is off-the-shelf, the merkle proofs part is provided by the jmt and penumbra-storage crates, which provide an ICS23 proof specification and query interfaces respectively, and the penumbra-ibc crate handles writing all the messages in the write places and provides hooks for customizing behavior of the application.

To actually operationalize this you’ll also need a relayer implementation, which can subscribe to events on both chains and post data back and forth. The relayer implementation needs to have endpoints to subscribe to; if you use penumbra-ibc those come included, and we’ve upstreamed changes to Hermes to make it chain-agnostic, you could reuse that work too.

There are no general alternatives to IBC, because nobody else tried to build a fully general purpose networking system for verifiable state machines. There are a lot of subtle implementation-specified behaviors inherited from ibc-go, so it would save a lot of time to reuse an existing IBC implementation.

13 Likes

I’d argue Substrate meets the definition of an independent from a chain BFT layer as much as CometBFT does. CometBFT defines the ABCI, Substrate defines a (non-networked) runtime API (which could be networked? theoretically?).

Substrate is a single SDK, in Rust, which would meet all of the above features of CometBFT (and directly include a SDK, unlike CosmosSDK in Go or the collection of works from Penumbra, who I quite respect as a project) except IBC.

The P2P layer is a LibP2P-based solution, not a completely custom one.

The mempool logic in Substrate is already modular.

As is consensus, meaning it can be used with BABE+GRANDPA (which trades-off per-block finality for much greater performance, at least, when comparing literal implementations) or anything else implemented. I prior implemented Tendermint for Substrate, and there are a few other consensus algorithms around (Aura instead of Babe for block production, SASSAFRAS as a research WIP, Aleph Zero’s AlephBFT (which is asynchronous, like Narwhal and Bullshark)…).

Do I recommend Substrate? Definitely not if you want IBC at this time.

Do I recommend Substrate if Zcash doesn’t want IBC at this time? Maybe. There are works on a GRANDPA-IBC solution, meaning it wouldn’t be cutting it off forever. I have my variety of complaints/comments on it, but I’m sure people experience with CometBFT/Cosmos have their own complaints/comments. I really don’t care to advocate for it at depth in this context, solely to say CometBFT isn’t the only option.

My main question is should Zcash have a consensus-layer blockchain at all. Once a block has n confirmations, cannot a commit be on top without an entire new blockchain? I’d advocate staying so minimal for now, if possible (though I’ll confess to not having read the 0.1 book yet due to time constraints, and apologize if this is raising a discussion already had).

5 Likes

I am very happy and appreciative that we’re having this conversation on the forum.

I am not particularly well versed in the difference between PoS systems, however I understand how much of a tectonic shift it is for Zcash, and how it has the potential to enable critical features.

Bridging is a major feature I am looking for. Being able to exchange with other ecosystems is indeed critical. For a long time now we’ve seen bridging, but the way it is currently implemented hasn’t been practical, or necessarily safe, and hasn’t seen much real usage. So, to put it simply: it hasn’t been solved.

The creator of Session seem to understand that difference in bridging technologies:

Let’s make sure we select a PoS algorithm compatible with that collaborative routing.

1 Like

“that collaborative routing” doesn’t require any specific consensus mechanism for the integrated projects. All of those projects use collateralized multisigs and can be connected through mutual networks/explicit integrations (which can be PoW or their own PoS or…).

EDIT: Just saw I’m 2m late to comment, sorry for the necrobump.

2 Likes

is anything real going to happen in 2024 that takes the protocol closer to Proof of Stake?
theres a lot of good research but nobody can see anything tangible happening for Zcash right now, is late 2025 a good estimate for when changes might actually get into the Zcash software???

@nathan-at-least @daira anybody else that would have a clue

Yes, work is being done but we haven’t committed to an official roadmap yet. We’ll be discussing the research completed to date and our intended path ahead at our Zeboot event the end of the month. I’ll post a more detailed agenda this Friday. The current set of activities are documented in the TFL research DAG here.

11 Likes

I think that something that ought to be rediscussed at Zeboot is TFL vs transitioning in one step. I have been supportive of the two step approach in the past, but personally I have begun leaning more towards transitioning in one step because of speed and complexity.

5 Likes

i never really understood why we are looking for a hybrid POS & POW system, does this not require to rebuild everything when we decide to switch to POS only?

why not go to POS now?

the hybrid system would also only take half of the selling pressure of miners.
and as someone on the telegram group recently said, those that currently hold ZEC are the true believers, and i tend to agree with this. best conditions for stakers in the network!

1 Like

Nate listed the pros and cons here.

So the main benefit is potentially not disrupting the ecosystem if something goes wrong with the PoS protocol, as the PoW protocol could keep going. I am not sure how valuable this benefit is in practice. I mean there are probably 0 (or very close to 0) people in the world who rely on Zcash for their day-to-day purchases, we are already seen as a risky asset, and we don’t have any on-chain Defi that would be disrupted either. We would take a reputational hit for sure, but it is already pretty low and I don’t see any catastrophic consequences that would justify the increased time and complexity of the two-step solution. We are already trying to increase development speed and simplicity by switching to Zebra, the hybrid protocol could cancel out the benefits.

I say just develop a pure PoS Zcash and let the difficulty bomb kill the PoW chain.

6 Likes

that was a great read, thanks @Milton

i agree on all your points, POS is the way to go! less complex and not as complicated as a hybrid, that either way most likely will switch to POS later on.

one thing that i would like to note, nate stated at the cons “losing miners”
that point needs some additional comments… we have already lost the miners a long time ago. (at the time where we decided to go with asics)

they sell their ZEC for BTC as soon as they get it, and they also don’t care about security, as we all can see how they choose the mining pools. (VIABTC currently has 58% of the hasrate)

replacing them with diamond hand believers of Zcash is a nobrainer to me!
Let’s make it happen.

5 Likes