The Trailing Finality Layer

Hi all, we’ve published The Trailing Finality Layer: A stepping stone to proof of stake in Zcash and I’m preparing a Zcon4 Workshop: Interactive Design of a Zcash Trailing Finality Layer (Hybrid PoW+PoS) (Session Video).

This is a design that I hope can be a fairly safe / non-disruptive transition to a hybrid PoW/PoS protocol. The current state of TFL is a “design idea” at this point, with many details yet to be figured out. Please see the blog for the overview.

At this early stage, I’m especially interested in finding out if there’s community support for the general approach, and what kinds of questions, suggestions, or concerns people may have. Also, if you’re interested in collaborating or you have more concrete requests, let me know. (For example, I’ve received multiple inbound queries or requests from people working on cross-chain bridges.)

Feel free to chime in here on this topic, on this twitter announcement post, this bluesky announcement post, on the Zcash R&D discord, or you can email me directly.

You can also “pre-ask” questions on the Whova page for the Zcon4 workshop! That’s a cool way for me to tailor my presentation to address questions up front. (I’m not sure if anyone can join the app to ask questions; let me know if you try and it doesn’t work.)

9 Likes

I will be predisposed during that particular times slot at Zcon4. However, the idea of a hybrid pow pos mechanism is not new though it never developed much so more-or-less the same deal here as a design idea👍

1 Like

First, I just want to say that I support the concept of the trailing finality layer. I think it is the safest way to transition, even though it will introduce more complexity. I don’t know of any protocol that went from PoW to PoS completely cold turkey.

Now I am going to repeat my concerns from this thread.

The only real choices we have for a finalizing consensus design are Snowman (Avalanche) and Tendermint (Cosmos). Ethereum’s design has a way too long time to finality (15 minutes), which makes it not suitable in my opinion. I don’t think a cryptocurrency with a long time to finality will gain any traction in day-to-day use.

Why do I believe that Snowman is a better protocol than Tendermint?

  1. Snowman supports millions of fully participating validator nodes (at least theoretically, I haven’t found the highest number actually tested), while Tendermint only supports around 200 currently. Zaki Manian has claimed on Twitter that Tendermint could support at least 1000 validators (without degrading performance) by optimizing the network stack, but that is work that the Zcash devs would have to do because nobody else are doing it. Either way, it is clear that Snowman supports many more nodes than Tendermint.
    Why do I believe this is important? For one it makes us more censorship resistant. Even with concentrated stake, a large number of up and running validators is good because in the event we need to fork out compromised stake, we already have a big pool of alternative validators to run the fork on. I also think it is more conducive to competition between validators. Zcash as a privacy preserving cryptocurrency are among the first in line for attack by nation-state level actors. If you believe that Bitcoin or Ethereum need the number of validators they support, you must believe that Zcash needs at least as large number of validators.
    Another reason I think this is important is because it gives a perception of decentralization and public ownership of the network. Many will reject the notion that we should give perceptions any consideration in protocol design, but the notion of a network that belongs equally to everyone, where a solo hobby home staker can participate in running the network on equal terms with a large institution, is part of what gave crypto any legitimacy. I don’t think Bitcoin would have been where it is today or gained the same public legitimacy if it only had supported 200 validators. Sure, you can argue that delegations and verifying blocks without proposing or voting on them is participating and validating the network, but it doesn’t feel as egalitarian as being able to fully participate in consensus.
    My last reason for believing support for a large node count is good, is that there are more incentives to running your own fullnode that you can connect your mobile wallet to, thus helping reduce some privacy concerns with using public lightwallet servers.

  2. Snowman has a lower time to finality. From searching around and from what I can tell from Cosmos block explorers, Tendermint has a time to finality of about 6 seconds. The Avalanche C-chain has time to finality of about 2 seconds. Some of this time difference may be due to the different virtual machines, but I think it is reasonable to assume that Snowman is a few seconds faster than Tendermint, at least with the current node counts (180 in Cosmos and 1231 in Avalanche). 4 seconds may not seem like much, but it actually matters for the user experience in day-to-day activities like buying a coffee.

The only advantage I see that Tendermint has over Snowman is that it will enable easy interop with other Tendermint chains. Some will also argue that Tendermint is more tested, but the Cosmos chain is only 1.5 years older than Avalanche, so I don’t see a big difference. Avalanche mainnet launched September 2020 while Cosmos launched March 2019. I think the disadvantages I have mentioned regarding Tendermint are more fundamental than bridges and interop and should take precedence.

1 Like

Thanks! I didn’t follow that whole thread, so I’ll go back and catch up.

2 Likes

Thanks for the feedback!

My current thoughts.

Time-to-Finality

First I will address time-to-finality first: for the initial trailing finality design, the PoS protocol will come to consensus about already mined PoW blocks. This means it cannot be “faster” than PoW. Furthermore, I think there may be reasons it’s safer to finalize only after a certain depth of PoW blocks.

So basically the bad news is that Trailing Finality isn’t fast. My off-the-cuff guess is that if we’re aiming to finalize, say 10ish blocks, that means finality will be a bit faster than ~15min. :-/

I definitely want fast finality in the future, but I think we should do trailing finality first, because it’s a safer step. My thinking is that this doesn’t help use cases that need fast transactions, such as point-of-sale purchases, but it helps with use cases that already have worse time scales (a prime example is exchange deposits).

It will also enable bridges to be built, and while a ~15m bridge transfer is not as nice as could be, it seems feasible enough to me to be practical.

I have received one positive request that TTF is less than 30m from a cross-chain DEX / bridging project, so that’s my current primary data point there. I would really like to find out, though, from people who have product or user research experience with bridging how acceptable that is. So if anyone has info on this front, please chime in!

Now if we’re thinking longer term about a next step after “Trailing Finality” then, yeah, I think we should aim for fast finality because of the obvious usability benefits. So considering that, maybe we want a fast-finality protocol for Trailing Finality so that the subsequent transition is less disruptive (ie we don’t have to switch out the PoS protocol a second time…).

Validator Decentralization and Delegation

I’m still a fan of delegation over large validator counts for two main reasons:

  • The most fundamental reason is that I believe delegation leads to greater decentralization of staking rewards.
  • A “second order” rationale is that I believe delegation is inevitable, and if the protocol has it built in, then it can at least make the playing field more level and protect users better.

Digging in on the first:

Basically it boils down to expert specialized validator operators being able to run more efficiently than hobbyists, and especially more so than non-tech-savvy users.

This problem is much worse in an Ethereum-style protocol that requires X stake per node, because professionals can run tens of thousands of nodes at much lower cost and larger scale than hobbyists. (Also X per node prevents anyone with less than X, which sucks the larger X gets.)

For a protocol that allows a large number of validators and allows a single validator to scale up / down their stake, this problem is greatly reduced (but I still think it’s there due to gotchas like power outages, hobbyists upgrading their home computer, etc…). Also that case still shuts out non-techies.

For the second point about “delegation is inevitable”: we already see this as a major force in Ethereum, which doesn’t have built-in delegation. So what happens? Everyone just gave their money to Coinbase or whomever. Terrible for centralization. This evolved into Lido and the decentralized delegation protocols. Those are better, but still problematic in ways.

I think a “Lido-style” approach might address the risks I see with non-delegating protocols if the separate “delegation protocol” were released at the same time as the base PoS protocol activates. (At that point, the distinction is more nuanced and is basically about how modular delegation is.)

Providing delegation directly in the protocol seems important to me especially for non-techie users: it lets them participate even if they can’t run validator software on their home system reliably.

Feedback To Consider

But you bring up three great points I haven’t yet thought about in depth:

  • Running a “validator” directly feels good and helps keep people engaged and contributing. This seems like it could be really important for a swath of hobbyist / power users, and one painful lesson from the transition to ASICs is that Zcash lost a lot of hobbyist contributors who were running small scale GPU miners. I’ll have to think more about this for sure!
  • An incentive to run a full node: I definitely think it would be a healthy development if more people ran full nodes. I don’t think it’s absolutely necessary for resilience or decentralization past a certain point, but I still come back to the previous idea: giving people a way to contribute helps them stay engaged.
  • If we have a large pool of small-time validators, the network is more resilient and capable of forking out bad actors. This also seems important, and is a weakness of the few-validators-with-delegation model.

I just want to spell out my beliefs about delegation above for comparison.

Tying it all together

Ok, all this said, here’s how my position is evolving:

  • Delegation built into the protocol seems important for decentralization (supporting non-techies, preventing centralization in big players like exchanges).
  • Enabling users to scale up or down the amount they stake is important, so I’m against “X stake per node” designs (because they make economies-of-scale worse: Coinbase can run a hundred thousand nodes at an operational cost of so many cents per minute, but hobbyists pay much more per minute for their nodes, so if everyone only needs a single node, this levels out that cost issue).
  • Assuming those two conditions are met, the larger number of validators that are possible, the better.
  • Enabling good mechanisms to allow a “user revolt” to fork out bad validators is important, and a low-validator-capped protocol might have more risk here: if the validator set is acting badly, how can “UASF-style” users set up an alternative validator and slash the bad actors?

And I think you bring up super important values and I am wondering how much the PoS protocol itself can/needs to meet them, versus if there are other ways to meet them:

  • Enable hobbyists to directly and actively contribute: running a validator is one way, can we enable other ways? Off-the-cuff brainstorms: what if wallets paid fees to lightwalletd servers, so hobbyists could run them? What if hobbyists can operate network privacy nodes (like Nym?) and get paid in ZEC?
  • Provide more benefits to running full nodes.
5 Likes

Thanks for replying. I just want to point out that Avalanche has delegation built into the protocol and that you can scale up and down staking amounts. Avalanche currently has a minimum of 2000 AVAX to stake, but this can be lowered to Zcash’s needs. They also currently have a minimum of 25 AVAX to delegate, and I am not actually sure why there is a minimum to delegate at all, so you might want to look into why.

You should also take into account more next gen consensus methods like Narwhal/Bullshark, Jolteon, AptosBFT, Hotshot etc.

A theoretically higher validator amount is not necessarily the best metric to look at.

What advantages do they have over Snowman? From what I gather they are all fundamentally similar to Tendermint, but with more optimizations, because they are all classical consensus protocols.

I am not claiming that it is the only thing that matters, but I think I have already outlined why it is important, especially for a privacy preserving currency.

1 Like

Here is a document I wrote up about disadvantages of leaving a potentially unbounded gap between the last finalized block and the chain tip (which is the simplest form of an ebb-and-flow protocol as described in @nathan-at-least’s blog post), and what I think should be done instead. It also argues that it is necessary to allow for the possibility of overriding finalization in order to respond to certain attacks, and that this should be explicitly modelled and subject to a well-defined governance process.

The Argument for Bounded Best-Effort Consensus

Background

An “ebb-and-flow protocol”, as described in [NTT2021], is essentially a composition of two consensus protocols, one of which provides finality and the other dynamic availability. The protocol with finality validates a prefix of the blocks validated by the one with dynamic availability.

This is claimed by the paper to “resolve” the tension between finality and dynamic availability. However, a necessary consequence is that in a situation where the “final” protocol stalls and the “available” protocol does not, the “finality gap” between the finalization point and the chain tip can grow without bound.

In this note, we argue that this is unacceptable, and that it is preferable to sacrifice strict dynamic availability. However, we also argue that the main idea behind ebb-and-flow protocols is a good one, and that maintaining a bounded gap between the finalization point and the chain tip does in fact make sense. That is, we argue that allowing the current chain tip to run ahead of the finalization point has practical advantages, but that losing strict dynamic availability is actually preferable to the consequences of the unbounded finality gap, if/when a “long stall” in finalization occurs.

We also argue that it is beneficial to explicitly allow “finalization overrides” under the control of a well-documented governance process. Such overrides allow long rollbacks that may be necessary in the case of an exploited security flaw — because the time needed to detect such a flaw and decide whether to roll back will almost certainly be greater than the finality gap time at the point of detection. The governance process can impose a limit on the length of this long rollback if desired.

Finality + Dynamic availability implies an unbounded finality gap

Since partition between nodes sufficient for finalization cannot be prevented, the CAP theorem implies that any consistent protocol (and therefore any protocol with finality) may stall for at least as long as the partition takes to heal.

Dynamic availability implies that the current tip will continue to advance, and so the finality gap increases without bound.

Partition is not necessarily the only condition that could cause a finalization stall, it is just the one that most easily proves that this conclusion is impossible to avoid.

Problems with an unbounded finality gap

Both the available protocol, and the subprotocol that provides finality, will be used in practice — otherwise, one or both of them might as well not exist. There is always a risk that blocks may be rolled back to the finalization point, by definition.

Suppose, then, that there is a long finalization stall. The final and available protocols are not separate: there is no duplication of tokens between protocols, but the rules about how to determine best-effort balance and guaranteed balance depend on both protocols, how they are composed, and how the history after the finalization point is interpreted.

The guaranteed minimum balance of a given party is not just the minimum of their balance at the finalization point and their balance at the current tip. It is the minimum balance taken over all possible transaction histories that extend the finalized chain – taking into account that a party’s previously published transactions might be able to be reapplied in a different context without its explicit consent. The extent to which published transactions can be reapplied depends on technical choices that we must make, subject to some constraints (for example, we know that shielded transactions cannot be reapplied after their anchors have been invalidated). It may be desirable to further constrain re-use in order to make guaranteed minimum balances easier to compute.

As the finalization gap increases, the negative consequences of a rollback increase. There are several possible (not mutually exclusive) outcomes:

  • Users of the currency start to consider the available protocol increasingly unreliable.
  • Users start to consider a rollback to be untenable, and lobby to prevent it or cry foul if it occurs.
  • Users start to consider finalization increasingly irrelevant. Services that depend on finalization become unavailable.
    • There is no free lunch that would allow us to avoid availability problems for services that also depend on finality.
  • Service providers adopt temporary workarounds that may not have had adequate security analysis.

Any of these might precipitate a crisis of confidence, and there are reasons to think this effect might be worse than if the chain had simply stalled. Any such crisis may have a negative effect on token prices and long-term adoption.

Note that adding finalization in this way does not by itself increase the probability of a rollback in the available chain, provided the PoW remains as secure against rollbacks of a given length as before. But that is a big proviso. We have a design constraint (motivated by limiting token devaluation and by governance issues) to limit issuance to be no greater than that of the original Zcash protocol up to a given height. Since some of the issuance is likely needed to reward staking, the amount of money available for mining rewards is reduced, which may reduce overall hash rate and security of the PoW. Independently, there may be a temptation for design decisions to rely on finalization in a way that reduces security of PoW (“risk compensation”). There is also pressure to reduce the energy usage of PoW, which necessarily reduces the global hash rate, and therefore the cost of performing an attack that depends on the adversary having any given proportion of global hash rate.

It could be argued that the issue of availability of services that depend on finality is mainly one of avoiding over-claiming about what is possible. Nevertheless I think there are also real usability issues if balances as seen by those services can differ significantly and for long periods from balances at the chain tip.

Regardless, incorrect assumptions about the extent to which the finalized and available states can differ are likely to be exposed if there is a finalization stall. And those who made the assumptions may (quite reasonably!) not accept “everything is fine, those assumptions were always wrong” as a satisfactory response.

What is Bounded Best-Effort Consensus?

The idea is simple to describe: if an unbounded finalization gap is a problem, then just enforce a bound on it. In this approach, progress of the chain tip will, eventually, also stall if a finalization stall lasts for long enough — and so we are sacrificing strict dynamic availability.

A way of describing this that may be intuitive for many people is that it works like video streaming. All video streaming services use a buffer to paper over short-term interruptions or slow-downs of network access. In most cases, this buffer is bounded. This allows the video to be watched uninterrupted and at a constant rate in most circumstances. But if there is a longer-term network failure or insufficient sustained bandwidth, the playback will unavoidably stall.

So, why do I advocate this over:

  1. A protocol that only provides dynamic availability;
  2. A protocol that only provides finality;
  3. An unmodified ebb-and-flow protocol?

The reason to reject a) is straightforward: finality is a valuable security property that is necessary for some use cases.

If a protocol only provides finality (option b), then short-term availability is directly tied to finalization. It may be possible to make finalization stalls sufficiently rare or short-lived that this is tolerable. But that is more likely to be possible if and when there is a well-established staking ecosystem. Before that ecosystem is established, the protocol may be particularly vulnerable to stalls. Furthermore, it’s difficult to get to such a protocol from a pure PoW system like current Zcash.

We argued in the previous section that an unbounded finality gap is bad, and that c) entails an unbounded finality gap. However, that isn’t sufficient to argue that a bounded best-effort protocol is better. Perhaps there are no good solutions! What are we gaining from a bounded best-effort approach that would justify the complexity of a hybrid protocol without obtaining strict dynamic availability?

My argument goes like this:

  • It is likely that a high proportion of the situations in which a sustained finalization stall happens will require human intervention. If the finality protocol were going to recover without intervention, there is no reason to think that it wouldn’t do so in a relatively short time.
  • When human intervention is required, the fact that the chain tip is still proceeding apace (in protocols with strict dynamic availability) makes restarting the chain harder, for many potential causes of a finalization stall. Those problems may be easier to fix when the chain tip is also stalled. This argument carries even more force when the protocol also allows “finalization overrides”, as discussed later in the Complementarity section.
  • Nothing about the bounded best-effort option prevents us from working hard to design a system that makes finalization stalls as infrequent and short-lived as possible, just as we would for any other option that provides finality.
  • We want to optimistically minimize the finality gap under good conditions, because this improves the usability of services that depend on finality. This argues against protocols that try to maintain a fixed gap, and motivates letting the gap vary up to a bound.
  • In practice, the likelihood of short finalization stalls is high enough that heuristically retaining dynamic availability in those situations is useful.

The argument that it is difficult to completely prevent finalization stalls is supported by experience on Ethereum in May 2023, when there were two stalls within 24 hours, one for about 25 minutes and one for about 64 minutes. This experience is consistent with my arguments:

  • Neither stall required short-term human intervention, and the network did in fact recover from them quickly.
  • The stalls were caused by a resource exhaustion problem in the Prysm consensus client when handling attestations. It’s plausible to think that if this bug had been more serious, or possibly if Prysm clients had made up more of the network, then it would have required a hotfix release (and/or a significant proportion of nodes switching to another client) in order to resolve the stall. So this lines up with my hypothesis that longer stalls are likely to require manual intervention.
  • A bounded best-effort protocol would very likely have resulted in either a shorter or no interruption in availability. If, say, the finality gap bound were set to be roughly an hour, then the first finalization stall would have been “papered over” and the second would have resulted in only a short chain stall.

Retaining short-term availability does not result in a risk compensation hazard:

  • A finalization stall is still very visible, and directly affects applications relying on finality.
  • Precisely because of the bounded finality gap, it is obvious that it could affect chain progress if it lasted long enough.

A potential philosophical objection to lack of dynamic availability is that it creates a centralization risk to availability. That is, it becomes more likely that a coalition of validators can deliberately cause the chain to halt. I think this objection may be more prevalent among people who would object to adding a finality layer or PoS at all.

Finalization Overrides

Consensus protocols sometimes fail. Potential causes of failure include:

  • A design problem with the finality layer that causes a stall, or allows a stall to be provoked.
  • A balance violation or spend authorization flaw that is being exploited or is sufficiently likely to be exploited.
  • An implementation bug in a widely used node implementation that causes many nodes to diverge from consensus.

In these situations, overriding finality may be better than any other alternative.

An example is a balance violation flaw due to a 64-bit integer overflow that was exploited on Bitcoin mainnet on 15th August 2010. The response was to roll back the chain to before the exploitation, which is widely considered to have been the right decision. The time between the exploit (at block height 74638) and the forked chain overtaking the exploited chain (at block height 74691) was 53 blocks, or around 9 hours.

Of course, Bitcoin used and still uses a pure PoW consensus. But the applicability of the example does not depend on that: the flaw was independent of the consensus mechanism.

Another example of a situation that prompted this kind of override was the DAO recursion exploit on the Ethereum main chain in June 2016. The response to this was the forced balance adjustment hard fork on 20th July 2016 commonly known as the DAO fork. Although this adjustment was not implemented as a rollback, and although Ethereum was using PoW at the time and did not make any formal finality guarantees, it did override transfers that would heuristically have been considered final at the fork height. Again, this flaw was independent of the consensus mechanism.

The DAO fork was of course much more controversial than the Bitcoin fork, and a substantial minority of mining nodes split off to form Ethereum Classic. In any case, the point of this example is that it’s always possible to override finality in response to an exceptional situation, and that a chain’s community may decide to do so. The fact that Ethereum 2.0 now does claim a finality guarantee, would not in practice prevent a similar response in future that would override that guarantee.

The question then is whether the procedure to override finality should be formalised or ad hoc. I argue that it should be formalised, including specifying the governance process to be used.

This makes security analysis — of the consensus protocol per se, of the governance process, and of their interaction — much more feasible. Arguably a complete security analysis is not possible at all without it.

It also front-loads arguing about what procedure should be followed, and so it is more likely that stakeholders will agree to follow the process in any time-critical incident.

A way of modelling overrides that is insufficient

There is another possible way to model a protocol that claims finality but can be overridden in practice. We could say that the protocol after the override is a brand new protocol and chain (inheriting balances from the previous one, possibly modulo adjustments such as those that happened in the DAO fork).

Although that would allow saying that the finality property has technically not been violated, it does not match how users think about an override situation. They are more likely to think of it as a protocol with finality that can be violated in exceptional cases — and they would reasonably want to know what those cases are and how they will be handled. It also does nothing to help with security analysis of such cases.

Complementarity

Finalization overrides and bounded best-effort consensus are complementary in the following way: if a problem is urgent enough, then validators can be asked to stop validating. For genuinely harmful problems, it is likely to be in the interests of enough validators to stop that this causes a finalization stall. If this lasts longer than the finality gap bound then the chain will halt, giving time for the defined governance process to occur and decide what to do. And because the unfinalized chain will not have run too far ahead, the option of a long rollback remains realistically open.

If, on the other hand, there is time pressure to make a governance decision about a rollback in order to reduce its length, that may result in a less well-considered decision.

A possible objection is that there might be a coalition of validators who ignore the request to stop (possibly including the attacker or validators that an attacker can bribe), in which case the finalization stall would not happen. But that just means that we don’t gain the advantage of more time to make a governance decision; it isn’t actively a disadvantage relative to alternative designs. This outcome can also be thought of as a feature rather than a bug: halting the chain should be a last resort, and if the argument given for the request to stop failed to convince a sufficient number of validators that it was reason enough to halt the chain, then perhaps it wasn’t a good enough reason.

It is also possible to make the argument that the threshold of stake needed is imposed by technical properties of the finality protocol and by the resources of the attacker, which might not be ideal for the purpose described above. However, I would argue that it does not need to be ideal, and will be in the right ballpark in practice.

Addendum

Since I posted the above note, I’ve been thinking about potential attacks on protocols using the bounded best-effort model.

Tail-thrashing attacks

There is an important class of potential attacks based on the fact that when the unfinalized chain stalls, an adversary has more time to find blocks, and this might violate security assumptions of the more-available protocol. For instance, if the more-available protocol is PoW-based, then its security in the steady state is predicated on the fact that an adversary with a given proportion of hash power has only a limited time to use that power, before the rest of the network finds another block. During a chain stall this is not the case. If, say, the adversary has 10% hash power, then it can on average find a block in 10 block times. And so in 100 block times it can create a 10-block fork.

It may in fact be worse than this: once miners know that a finalization stall is happening, their incentive to continue mining is reduced, since they know that there is a greater chance that their blocks might be rolled back. So we would expect the global hash rate to fall —even before the finality gap bound is hit— and then the adversary would have a greater proportion of hash rate.

Even in a pure ebb-and-flow protocol, a finalization stall could cause miners to infer that their blocks are more likely to be rolled back, but the fact that the chain is continuing would make more difficult to exploit. This issue with the global hash rate is mostly specific to the more-available protocol being PoW: if it were PoS, then its validators might as well continue proposing blocks because it is cheap to do so. There might be other attacks when the more-available protocol is PoS; I haven’t spent much time analysing that case.

The problem here is that it may have been assumed from the earlier description that the more-available chain would just halt during a chain stall. But in fact, for a finality gap bound of k blocks, an adversary could cause the k-block “tail” of the chain as seen by any given node to “thrash” between different chains. I will call this a tail-thrashing attack.

If a protocol allowed such attacks then it would be a regression relative to the security we would normally expect from an otherwise similar PoW-based protocol. It only occurs during a chain stall, but note that we cannot exclude the possibility of an adversary being able to provoke a chain stall.

Let’s put a pin in solving this problem, because there is another issue that we need to consider, and my preferred approach addresses both.

Finalized Prefix Availability

In the absence of security flaws and under the security assumptions required by the finality layer, the finalization point will not be seen by any honest node to roll back. However, that does not imply that all nodes will see the same finalized height — which is impossible given network delays and unreliable messaging.

In order to optimize the availability of applications that require finality, we need to consider availability of the information, e.g. validator proposals and (possibly aggregate) signatures, needed to finalize the chain up to a particular point. It is possible to incentivize distribution of that information by piggy-backing it on block headers, so that block producers have to include it in order to obtain the block production reward (e.g. mining reward) for the more-available protocol.

Obviously the piggy-backed information has to be for finalization only up to some previous block, because the information for later blocks wasn’t yet available to the block producer.

Optionally, we could incentivize the block producer to include the latest information it has, for example by burning part of the block reward or by giving the producer some limited mining advantage that depends on how many blocks back the finalization information is. Alternatively we could just rely on the fact that some proportion of block producers are honest and will include the latest information they have.

Suppose that for a k-block finality gap bound, we required each block header to include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back. This would automatically implement the finality gap bound without any further explicit check, because it would be impossible to produce a block after the bound. But it would also guarantee that availability of the finalized prefix, at least up to the bound, is incentivized to the same extent as the more-available protocol.

In the Ebb-and-Flow paper, we also have information from the more-available protocol being used in the BFT protocol (top of right column, page 7):

In addition, \mathsf{ch}^t_i is used as side information in \Pi_{\mathrm{bft}} to boycott the finalization of invalid snapshots proposed by the adversary.

This does not cause any circularity. In fact, it means that BFT validators have to listen to the block transmission protocol anyway, so that could be also the protocol over which BFT communication occurs. That is, BFT validators could listen for block headers and get all the information needed to make and broadcast their own signatures or proposals. (A possible reason not to broadcast individual signatures to all nodes is that with large numbers of validators, the proof that a sufficient proportion of validators/stake has signed can use an aggregate signature, which could be much smaller.)

Back to the tail-thrashing problem

Note that in ebb-and-flow-based protocols, snapshots of the “longest chain” protocol \Pi_{\mathrm{lc}} are used as input to the BFT protocol. That implies that the tail-thrashing problem could also affect the input to that protocol, which would be bad (not least for security analysis of availability, which seems somewhat intractable in that case).

Also, when we restart the more-available chain, we would need to take account of the fact that the adversary has had an arbitrary length of time to build long chains from every block that we could potentially restart from. It could be possible to invalidate those chains by requiring blocks after the restart to be dependent on fresh randomness, but that sounds quite tricky (especially given that we want to restart without manual intervention if possible), and there may be other attacks I haven’t thought of.

So, instead of trying to directly solve the tail-thrashing problem, we will avoid it. We will say that if a block producer cannot give the information needed to advance the finalization point to at least k blocks back, then it must produce a coinbase-only block.

This achieves pretty much the same effect, for our purposes, as actually stalling the more-available chain. Since funds cannot be spent in coinbase-only blocks, the vast majority of attacks that we are worried about would not be exploitable in this state.

It is possible that a security flaw could affect coinbase transactions. We might want to turn off shielded coinbase for those blocks in order to reduce the chance of that.

Also, mining rewards cannot be spent in a coinbase-only block; in particular, mining pools cannot distribute rewards. So there is a risk that an unscrupulous mining pool might try to do a rug-pull after mining of non-coinbase-only blocks resumes, if there was a very long finalization stall. But this approach works at least in the short term, and probably for long enough to allow manual intervention into the finalization protocol, or governance processes if needed.

There’s a caveat: part of the reason we wanted the more-available chain to also stall is to make it more acceptable to do a rollback, possibly as far as the stalled finalization point — maybe earlier if the stall was triggered deliberately in response to noticing a sufficiently serious attack on-chain, and if governance decisions allow it. What happens to incentives of miners on a chain that might be rolled back in that way?

This is actually fairly easy to solve. We have the governance procedures say that if we do an intentional rollback, the coinbase-only mining rewards will be preserved. I.e. we mine a block or blocks that include those rewards paid to the same addresses (adjusting the consensus to allow them to be created from thin air if necessary), have everyone check it thoroughly, and require the chain to restart from that block. So as long as miners believe that this procedure will be followed and that the chain will eventually recover at a reasonable coin price, they will still have incentive to mine on the \Pi_{\mathrm{lc}} chain, at least for a time.

There’s one more thing to solve: when we resume mining non-coinbase-only blocks, the rule that

Each block header must include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back.

will not be sufficient to catch up the finalization point using only information from block headers. We can solve this by adjusting the rule as follows (this also includes not enforcing it for coinbase-only blocks):

Let \mathsf{thisHeight} be the height of this block, and let \mathsf{prevFinal} be the height to which finalization information has been given by the header chain up to and including this block’s parent. If the block is not coinbase-only, then the block header MUST include the information necessary for a node to finalize to height \mathsf{min}(\mathsf{prevFinal} + 100, \mathsf{thisHeight} - k), given that the node has already finalized to height \mathsf{prevFinal}. (No information is needed if \mathsf{thisHeight} - k \leq \mathsf{prevFinal}.)

This allows nodes to catch up quickly (at a rate of 100 finalized blocks per new block) while still keeping the size of finalization information in each block bounded.

5 Likes

Exploring each methods advantages over snowman is a research paper in its own right, and if seemed useful, would be part of any research made towards transitioning to PoS.

These are all very different consensus protocols, maybe the only clear difference is that snowman is leaderless while the aforementioned are not.

1 Like

Hi all,

I wanted to share an update on PoS research at ECC.

The most recent event is that I gave this presentation at Zcon4 on TFL. I’d like to share some of the feedback I got in that session and from multiple conversations at Zcon, my reaction to it, what we’ve been up to since then, and what we’re planning next.

Zcon4 Feedback

At Zcon I got two kinds of feedback:

  • Could it be better to take an alternative approach where we transition Zcash from pure PoW to pure PoS in one step?
  • Does TFL have security flaws, and if so, are they show-stoppers or something we can mitigate by improving the design?

Alternative Approach: A Single Complete Transition

I’ve thought a bit about these suggestions since Zcon4, and mainly about the perceived trade-offs vs a TFL style approach (or any multi-step approach) as well as considering timelines, resources, and the trade-off between considering many alternatives vs making further progress on specific alternatives. You can read more about the trade-offs we’ve identified on this github ticket comment.

The gist of it is that we intend to continue developing TFL for now up until it’s either a viable ZIP-worthy proposal, or until we identify specific blockers. We provide more explanation for this decision in this github ticket comment.

This doesn’t mean we wouldn’t welcome others to flesh out more concrete alternative proposals! If anyone does so, that could be helpful for Zcash.

However, given our limited bandwidth at ECC (especially since we’re currently highly focused on Emergency Mode) our goal is to produce one good specific proposal for everyone to evaluate, so we’ll continue refining the TFL approach.

TFL Security

The second category of feedback on TFL was around security concerns. As you can see in my presentation, there are multiple unresolved considerations to understand the security of TFL.

One thing to keep in mind about security: sometimes security can be more of a continuum of weaker or stronger on some specific continuum. But at the end of the day, what users need is a simple “yes / no” answer to “is it safe enough for my needs?” So we might argue some design is safe enough, even though it doesn’t have maximal security compared to other known alternatives. Any kind of argument about “safe enough” deserves a lot of scrutiny, because it can be extremely hard to know what users’ needs and risk tolerances are, and even when we do, it can be very hard to measure security against that. Nevertheless, we believe the best designs make conscientious trade-offs, and that includes for security considerations. None of this is to suggest we should diminish the Zcash development community’s top-notch record and culture on strong security.

Ok, with that, let’s dig into the primary concern we took out of the Zcon4 feedback about security. While there are multiple ways to frame the concern, a plain English summary goes like this:

Often security is guaranteed by the weakest link in a system. In the TFL design there are two core subsystems: PoW and PoS. So the core concern is: will the security be limited by the weakest guarantees provided by these two subsystems?

And that brings me to what we’ve been focused on since Zcon4:

Progress since Zcon4

All of our PoS research effort since Zcon4 has been focused on two goals: organizing the TFL roadmap and analyzing the “weakest link” security issue above.

If you want to see under the covers of the crude/early roadmap, we’re using github milestones with these two initial milestones:

  • Design Phase 1 specifies a simplified subset of TFL more precisely based primarily on Ebb-and-Flow (and odds and ends we’ve picked up from studying Ethereum’s transition and Ghasper protocol). This milestone excludes most of the mechanics of PoS itself.
  • Implementation Phase 1 specifies a codebase for simulating attacks. We want to get started on coding early in the design process so that as the design improves we’ll have working, though incomplete, prototypes. As the design matures we hope to have a fully functional testnet.

What’s Next?

We’ll keep posting updates as we go, and we’ll be refining and adding new R&D milestones. ECC will also be publishing more top-level roadmap posts that will give people a higher level understanding of PoS R&D progress (along with all our other efforts), so if you’d prefer the high level of only major milestones, stay tuned for those.

16 Likes

I added a long Addendum section to this note, which describes attacks and mitigations specific to Bounded Best-Effort consensus. If you liked the original note, you’ll want to read it (or the version of it in the TFL Book)!

2 Likes

Just wanted to post the link to the Zapa GitHub in case you hadn’t seen it @nathan-at-least, a project a while back were someone modified zcashd to use snowman consensus.

1 Like

We’ve just published v0.1.0 of the TFL Book.

This version of the design includes the Crosslink construction, a hybrid consensus protocol which integrates PoW and PoS subprotocols. The Crosslink doc has security proofs of safety and liveness properties. This represents the first milestone with a specific abstract hybrid PoW/PoS protocol defined.

This v0.1.0 version has the content split between the TFL Book and a separate hackmd defining the Crosslink construction. The next milestone will integrate the Crosslink definition into the book ensuring it’s self-consistent.

The short-to-medium term roadmap for Trailing Finality Layer is to bolster this initial version of Crosslink with simulations of various attacks and more in-depth analysis of attacks, as well as beginning to examine real-world PoS protocols to integrate into the hybrid construction.

The medium term roadmap (within the next ~3 or so months), we’re considering a variety of steps beyond the roadmap defined above:

  • Complete TFL protocol:
    • Define a specific PoS subprotocol integration.
    • Define staking operations which are integrated with Zcash transaction semantics.
    • Define staking parameters and analyze their security and economic properties.
  • Begin creating a prototype / testnet.
  • Crosslink:
    • Get broader review of Crosslink security.
    • Compare Crosslink to other known viable hybrid protocols.
  • Begin designing a transition plan for transitioning Zcash from PoW to PoW+TFL.

In addition to the direct work on Trailing Finality Layer, we’re aware of multiple projects or people interested in bridging protocols or systems, so we hope to engage with them and ensure the design of Trailing Finality Layer can integrate and/or support trust-minimized bridge protocols.

As always, feedback is appreciated!

Shout out to Daira Emma Hopwood who has done most of the research and design behind Crosslink. <3

24 Likes

I have been hoping for movement in this direction for nearly four years, so this is really exciting to hear! Zcash <-> Cosmos Integration

Although I’m now on the sidelines of the Zcash effort, I would implore the Zcash community to choose Tendermint/CometBFT off-the-shelf as the PoS implementation, rather than putting effort into building or customizing it in any way. It is written in Go, but there is no problem interacting with it from other languages like Rust using ABCI.

This approach has two enormous advantages:

  • Light Client Compatibility: using CometBFT off the shelf means that the state of the PoS subprotocol can be verified using an off-the-shelf light client, and importantly, it can be verified on-chain by every Cosmos chain. This opens the door to IBC connections to Zcash.
  • No custom engineering: using CometBFT off the shelf means that Zcash can skip the engineering effort of designing and building a new consensus mechanism, and can reuse the investment made by the entire Cosmos ecosystem.

At this point there’s no problem with using CometBFT to drive the state of a Rust application (and have that Rust application implement custom staking logic). We do so for Penumbra; you could look at our source code as an example and potentially also reuse our stack. For instance, Namada uses our tower-abci library for interfacing with CometBFT and Astria uses penumbra-storage for state management, penumbra-component for application structuring, and penumbra-ibc for an IBC implementation.

If the Zcash TFL design pulled CometBFT off-the-shelf, it could potentially go the Astria route, reuse parts of the Penumbra stack, and have a drop-in IBC implementation for Zcash on a very accelerated timeline (completion within your medium term roadmap). We’d love for the Zcash community to be able to take advantage of the code we’ve written and we’d be happy to answer questions about it.

17 Likes

Thanks for the feedback.

For the record, I’ve personally preferred each of the CometBFT components (SDK, consensus protocol, IBC) as prime candidates for Zcash. You can see that prototyping with CometBFT is in the first set of tickets for the TFL Book. On our current roadmap we need to first do the design work (aka “adapting it” for Crosslink), and that doesn’t start until the fourth milestone out.

I appreciate you laying out the strengths, and I concur with all of those. (Also, I’ve stumbled across Penumbra rust crates a few times and started experimenting with them before realizing where they came from. :wink: They seem well designed from my cursory explorations.)

Do you think there are any drawbacks to selecting CometBFT consensus protocol vs any alternatives? (Assume for the sake of argument other candidate protocols had well engineered off the shelf SDKs and supported IBC.)

In particular I’ve seen a fair amount of discussion on these forums about different PoS protocol characteristics (and I haven’t examined all of the suggested alternatives yet).

Are you aware of efforts to productionize alternative consensus protocols in CometBFT SDK?

Are there any alternatives to IBC worth considering?

8 Likes

I’ve personally preferred each of the CometBFT components (SDK, … )

To clarify, I would not recommend trying to use the Cosmos SDK; it’s not well-shaped to your problem and it’s written in Go. I’d recommend building a Rust app directly on top of CometBFT, where you can have complete control of the application layer, and where you could reuse parts of the Penumbra stack:

  • cometbft, out-of-process, standalone consensus server, which is an ABCI client talking to your Rust app
  • tendermint-rs datatypes for modeling Tendermint data like block headers, which you’ll need to implement your custom staking logic;
  • tower-abci, an ABCI server library that runs a listener and converts ABCI wire data into async request/response pairs handled by your application;
  • penumbra-storage as the storage backend, which manages chain state in a verifiable K/V store backed by a Jellyfish tree, allows transactional writes using CoW snapshots, and also provides a GRPC interface for making provable queries about chain state (for use by, e.g., IBC relayers),
  • penumbra-component’s Component and ActionHandler traits allow you to structure the application logic into components that can depend on each other (e.g., in Penumbra, the shielded pool is a component, the staking system is a component, etc)
  • penumbra-ibc as a Component that provides an off-the-shelf IBC implementation. It handles all of the IBC state machine and all of the IBC events needed to inform relayers about IBC data they might need to handle. It also provides GRPC interfaces that relayers can use to relay packets back and forth. The actual packet handling logic (what to do if someone sends you an IBC packet) is up to your application.

All of these libraries were built for our use case originally but don’t have other Penumbra-specific logic in them; this is the same part of our stack that Astria is currently reusing. Our Penumbra-specific logic is built in other crates on top of these, which wouldn’t make sense for Zcash to reuse.

Do you think there are any drawbacks to selecting CometBFT consensus protocol vs any alternatives? (Assume for the sake of argument other candidate protocols had well engineered off the shelf SDKs and supported IBC.)

Yes, assuming for the sake of argument that other candidate protocols had well-engineered, off-the-shelf SDKs and supported IBC, there would be various drawbacks to using CometBFT:

In the end though, none of this matters, because there are no other candidate protocols with well-engineered, off-the-shelf SDKs and IBC support. CometBFT is the only mature BFT consensus protocol developed independently of a specific chain. The Narwhal/Bullshark implementation I linked above, for instance, is just a research prototype. So in practical terms you are stuck with either using CometBFT or going off into the wilderness.

Are there any alternatives to IBC worth considering?

No, for kind of similar reasons. One thing to highlight is that there are two versions of IBC: “IBC as specified”, which is a very flexible mechanism for verifiable messaging between verifiable state machines, and “IBC as deployed”, where communications between Cosmos-SDK-on-Tendermint chains are much smoother than other configurations. At a high level, there are three parts of IBC:

  • the light client part (chains A and B run on-chain light clients of each other to verify each others’ state roots);
  • the merkle proofs part, usually implemented with ICS23, a DSL for specifying generic Merkle proof verification programs and program specifications, so that chains can verify each others’ proofs without special support for each Merkle tree;
  • the application data part, specific applications like ICS20 (fungible token transfers).

For the Penumbra stack, for instance, the light client part is off-the-shelf, the merkle proofs part is provided by the jmt and penumbra-storage crates, which provide an ICS23 proof specification and query interfaces respectively, and the penumbra-ibc crate handles writing all the messages in the write places and provides hooks for customizing behavior of the application.

To actually operationalize this you’ll also need a relayer implementation, which can subscribe to events on both chains and post data back and forth. The relayer implementation needs to have endpoints to subscribe to; if you use penumbra-ibc those come included, and we’ve upstreamed changes to Hermes to make it chain-agnostic, you could reuse that work too.

There are no general alternatives to IBC, because nobody else tried to build a fully general purpose networking system for verifiable state machines. There are a lot of subtle implementation-specified behaviors inherited from ibc-go, so it would save a lot of time to reuse an existing IBC implementation.

13 Likes

I’d argue Substrate meets the definition of an independent from a chain BFT layer as much as CometBFT does. CometBFT defines the ABCI, Substrate defines a (non-networked) runtime API (which could be networked? theoretically?).

Substrate is a single SDK, in Rust, which would meet all of the above features of CometBFT (and directly include a SDK, unlike CosmosSDK in Go or the collection of works from Penumbra, who I quite respect as a project) except IBC.

The P2P layer is a LibP2P-based solution, not a completely custom one.

The mempool logic in Substrate is already modular.

As is consensus, meaning it can be used with BABE+GRANDPA (which trades-off per-block finality for much greater performance, at least, when comparing literal implementations) or anything else implemented. I prior implemented Tendermint for Substrate, and there are a few other consensus algorithms around (Aura instead of Babe for block production, SASSAFRAS as a research WIP, Aleph Zero’s AlephBFT (which is asynchronous, like Narwhal and Bullshark)…).

Do I recommend Substrate? Definitely not if you want IBC at this time.

Do I recommend Substrate if Zcash doesn’t want IBC at this time? Maybe. There are works on a GRANDPA-IBC solution, meaning it wouldn’t be cutting it off forever. I have my variety of complaints/comments on it, but I’m sure people experience with CometBFT/Cosmos have their own complaints/comments. I really don’t care to advocate for it at depth in this context, solely to say CometBFT isn’t the only option.

My main question is should Zcash have a consensus-layer blockchain at all. Once a block has n confirmations, cannot a commit be on top without an entire new blockchain? I’d advocate staying so minimal for now, if possible (though I’ll confess to not having read the 0.1 book yet due to time constraints, and apologize if this is raising a discussion already had).

5 Likes

I am very happy and appreciative that we’re having this conversation on the forum.

I am not particularly well versed in the difference between PoS systems, however I understand how much of a tectonic shift it is for Zcash, and how it has the potential to enable critical features.

Bridging is a major feature I am looking for. Being able to exchange with other ecosystems is indeed critical. For a long time now we’ve seen bridging, but the way it is currently implemented hasn’t been practical, or necessarily safe, and hasn’t seen much real usage. So, to put it simply: it hasn’t been solved.

The creator of Session seem to understand that difference in bridging technologies:

Let’s make sure we select a PoS algorithm compatible with that collaborative routing.

1 Like

“that collaborative routing” doesn’t require any specific consensus mechanism for the integrated projects. All of those projects use collateralized multisigs and can be connected through mutual networks/explicit integrations (which can be PoW or their own PoS or…).

EDIT: Just saw I’m 2m late to comment, sorry for the necrobump.

2 Likes