The Trailing Finality Layer

Here is a document I wrote up about disadvantages of leaving a potentially unbounded gap between the last finalized block and the chain tip (which is the simplest form of an ebb-and-flow protocol as described in @nathan-at-least’s blog post), and what I think should be done instead. It also argues that it is necessary to allow for the possibility of overriding finalization in order to respond to certain attacks, and that this should be explicitly modelled and subject to a well-defined governance process.

The Argument for Bounded Best-Effort Consensus

Background

An “ebb-and-flow protocol”, as described in [NTT2021], is essentially a composition of two consensus protocols, one of which provides finality and the other dynamic availability. The protocol with finality validates a prefix of the blocks validated by the one with dynamic availability.

This is claimed by the paper to “resolve” the tension between finality and dynamic availability. However, a necessary consequence is that in a situation where the “final” protocol stalls and the “available” protocol does not, the “finality gap” between the finalization point and the chain tip can grow without bound.

In this note, we argue that this is unacceptable, and that it is preferable to sacrifice strict dynamic availability. However, we also argue that the main idea behind ebb-and-flow protocols is a good one, and that maintaining a bounded gap between the finalization point and the chain tip does in fact make sense. That is, we argue that allowing the current chain tip to run ahead of the finalization point has practical advantages, but that losing strict dynamic availability is actually preferable to the consequences of the unbounded finality gap, if/when a “long stall” in finalization occurs.

We also argue that it is beneficial to explicitly allow “finalization overrides” under the control of a well-documented governance process. Such overrides allow long rollbacks that may be necessary in the case of an exploited security flaw — because the time needed to detect such a flaw and decide whether to roll back will almost certainly be greater than the finality gap time at the point of detection. The governance process can impose a limit on the length of this long rollback if desired.

Finality + Dynamic availability implies an unbounded finality gap

Since partition between nodes sufficient for finalization cannot be prevented, the CAP theorem implies that any consistent protocol (and therefore any protocol with finality) may stall for at least as long as the partition takes to heal.

Dynamic availability implies that the current tip will continue to advance, and so the finality gap increases without bound.

Partition is not necessarily the only condition that could cause a finalization stall, it is just the one that most easily proves that this conclusion is impossible to avoid.

Problems with an unbounded finality gap

Both the available protocol, and the subprotocol that provides finality, will be used in practice — otherwise, one or both of them might as well not exist. There is always a risk that blocks may be rolled back to the finalization point, by definition.

Suppose, then, that there is a long finalization stall. The final and available protocols are not separate: there is no duplication of tokens between protocols, but the rules about how to determine best-effort balance and guaranteed balance depend on both protocols, how they are composed, and how the history after the finalization point is interpreted.

The guaranteed minimum balance of a given party is not just the minimum of their balance at the finalization point and their balance at the current tip. It is the minimum balance taken over all possible transaction histories that extend the finalized chain – taking into account that a party’s previously published transactions might be able to be reapplied in a different context without its explicit consent. The extent to which published transactions can be reapplied depends on technical choices that we must make, subject to some constraints (for example, we know that shielded transactions cannot be reapplied after their anchors have been invalidated). It may be desirable to further constrain re-use in order to make guaranteed minimum balances easier to compute.

As the finalization gap increases, the negative consequences of a rollback increase. There are several possible (not mutually exclusive) outcomes:

  • Users of the currency start to consider the available protocol increasingly unreliable.
  • Users start to consider a rollback to be untenable, and lobby to prevent it or cry foul if it occurs.
  • Users start to consider finalization increasingly irrelevant. Services that depend on finalization become unavailable.
    • There is no free lunch that would allow us to avoid availability problems for services that also depend on finality.
  • Service providers adopt temporary workarounds that may not have had adequate security analysis.

Any of these might precipitate a crisis of confidence, and there are reasons to think this effect might be worse than if the chain had simply stalled. Any such crisis may have a negative effect on token prices and long-term adoption.

Note that adding finalization in this way does not by itself increase the probability of a rollback in the available chain, provided the PoW remains as secure against rollbacks of a given length as before. But that is a big proviso. We have a design constraint (motivated by limiting token devaluation and by governance issues) to limit issuance to be no greater than that of the original Zcash protocol up to a given height. Since some of the issuance is likely needed to reward staking, the amount of money available for mining rewards is reduced, which may reduce overall hash rate and security of the PoW. Independently, there may be a temptation for design decisions to rely on finalization in a way that reduces security of PoW (“risk compensation”). There is also pressure to reduce the energy usage of PoW, which necessarily reduces the global hash rate, and therefore the cost of performing an attack that depends on the adversary having any given proportion of global hash rate.

It could be argued that the issue of availability of services that depend on finality is mainly one of avoiding over-claiming about what is possible. Nevertheless I think there are also real usability issues if balances as seen by those services can differ significantly and for long periods from balances at the chain tip.

Regardless, incorrect assumptions about the extent to which the finalized and available states can differ are likely to be exposed if there is a finalization stall. And those who made the assumptions may (quite reasonably!) not accept “everything is fine, those assumptions were always wrong” as a satisfactory response.

What is Bounded Best-Effort Consensus?

The idea is simple to describe: if an unbounded finalization gap is a problem, then just enforce a bound on it. In this approach, progress of the chain tip will, eventually, also stall if a finalization stall lasts for long enough — and so we are sacrificing strict dynamic availability.

A way of describing this that may be intuitive for many people is that it works like video streaming. All video streaming services use a buffer to paper over short-term interruptions or slow-downs of network access. In most cases, this buffer is bounded. This allows the video to be watched uninterrupted and at a constant rate in most circumstances. But if there is a longer-term network failure or insufficient sustained bandwidth, the playback will unavoidably stall.

So, why do I advocate this over:

  1. A protocol that only provides dynamic availability;
  2. A protocol that only provides finality;
  3. An unmodified ebb-and-flow protocol?

The reason to reject a) is straightforward: finality is a valuable security property that is necessary for some use cases.

If a protocol only provides finality (option b), then short-term availability is directly tied to finalization. It may be possible to make finalization stalls sufficiently rare or short-lived that this is tolerable. But that is more likely to be possible if and when there is a well-established staking ecosystem. Before that ecosystem is established, the protocol may be particularly vulnerable to stalls. Furthermore, it’s difficult to get to such a protocol from a pure PoW system like current Zcash.

We argued in the previous section that an unbounded finality gap is bad, and that c) entails an unbounded finality gap. However, that isn’t sufficient to argue that a bounded best-effort protocol is better. Perhaps there are no good solutions! What are we gaining from a bounded best-effort approach that would justify the complexity of a hybrid protocol without obtaining strict dynamic availability?

My argument goes like this:

  • It is likely that a high proportion of the situations in which a sustained finalization stall happens will require human intervention. If the finality protocol were going to recover without intervention, there is no reason to think that it wouldn’t do so in a relatively short time.
  • When human intervention is required, the fact that the chain tip is still proceeding apace (in protocols with strict dynamic availability) makes restarting the chain harder, for many potential causes of a finalization stall. Those problems may be easier to fix when the chain tip is also stalled. This argument carries even more force when the protocol also allows “finalization overrides”, as discussed later in the Complementarity section.
  • Nothing about the bounded best-effort option prevents us from working hard to design a system that makes finalization stalls as infrequent and short-lived as possible, just as we would for any other option that provides finality.
  • We want to optimistically minimize the finality gap under good conditions, because this improves the usability of services that depend on finality. This argues against protocols that try to maintain a fixed gap, and motivates letting the gap vary up to a bound.
  • In practice, the likelihood of short finalization stalls is high enough that heuristically retaining dynamic availability in those situations is useful.

The argument that it is difficult to completely prevent finalization stalls is supported by experience on Ethereum in May 2023, when there were two stalls within 24 hours, one for about 25 minutes and one for about 64 minutes. This experience is consistent with my arguments:

  • Neither stall required short-term human intervention, and the network did in fact recover from them quickly.
  • The stalls were caused by a resource exhaustion problem in the Prysm consensus client when handling attestations. It’s plausible to think that if this bug had been more serious, or possibly if Prysm clients had made up more of the network, then it would have required a hotfix release (and/or a significant proportion of nodes switching to another client) in order to resolve the stall. So this lines up with my hypothesis that longer stalls are likely to require manual intervention.
  • A bounded best-effort protocol would very likely have resulted in either a shorter or no interruption in availability. If, say, the finality gap bound were set to be roughly an hour, then the first finalization stall would have been “papered over” and the second would have resulted in only a short chain stall.

Retaining short-term availability does not result in a risk compensation hazard:

  • A finalization stall is still very visible, and directly affects applications relying on finality.
  • Precisely because of the bounded finality gap, it is obvious that it could affect chain progress if it lasted long enough.

A potential philosophical objection to lack of dynamic availability is that it creates a centralization risk to availability. That is, it becomes more likely that a coalition of validators can deliberately cause the chain to halt. I think this objection may be more prevalent among people who would object to adding a finality layer or PoS at all.

Finalization Overrides

Consensus protocols sometimes fail. Potential causes of failure include:

  • A design problem with the finality layer that causes a stall, or allows a stall to be provoked.
  • A balance violation or spend authorization flaw that is being exploited or is sufficiently likely to be exploited.
  • An implementation bug in a widely used node implementation that causes many nodes to diverge from consensus.

In these situations, overriding finality may be better than any other alternative.

An example is a balance violation flaw due to a 64-bit integer overflow that was exploited on Bitcoin mainnet on 15th August 2010. The response was to roll back the chain to before the exploitation, which is widely considered to have been the right decision. The time between the exploit (at block height 74638) and the forked chain overtaking the exploited chain (at block height 74691) was 53 blocks, or around 9 hours.

Of course, Bitcoin used and still uses a pure PoW consensus. But the applicability of the example does not depend on that: the flaw was independent of the consensus mechanism.

Another example of a situation that prompted this kind of override was the DAO recursion exploit on the Ethereum main chain in June 2016. The response to this was the forced balance adjustment hard fork on 20th July 2016 commonly known as the DAO fork. Although this adjustment was not implemented as a rollback, and although Ethereum was using PoW at the time and did not make any formal finality guarantees, it did override transfers that would heuristically have been considered final at the fork height. Again, this flaw was independent of the consensus mechanism.

The DAO fork was of course much more controversial than the Bitcoin fork, and a substantial minority of mining nodes split off to form Ethereum Classic. In any case, the point of this example is that it’s always possible to override finality in response to an exceptional situation, and that a chain’s community may decide to do so. The fact that Ethereum 2.0 now does claim a finality guarantee, would not in practice prevent a similar response in future that would override that guarantee.

The question then is whether the procedure to override finality should be formalised or ad hoc. I argue that it should be formalised, including specifying the governance process to be used.

This makes security analysis — of the consensus protocol per se, of the governance process, and of their interaction — much more feasible. Arguably a complete security analysis is not possible at all without it.

It also front-loads arguing about what procedure should be followed, and so it is more likely that stakeholders will agree to follow the process in any time-critical incident.

A way of modelling overrides that is insufficient

There is another possible way to model a protocol that claims finality but can be overridden in practice. We could say that the protocol after the override is a brand new protocol and chain (inheriting balances from the previous one, possibly modulo adjustments such as those that happened in the DAO fork).

Although that would allow saying that the finality property has technically not been violated, it does not match how users think about an override situation. They are more likely to think of it as a protocol with finality that can be violated in exceptional cases — and they would reasonably want to know what those cases are and how they will be handled. It also does nothing to help with security analysis of such cases.

Complementarity

Finalization overrides and bounded best-effort consensus are complementary in the following way: if a problem is urgent enough, then validators can be asked to stop validating. For genuinely harmful problems, it is likely to be in the interests of enough validators to stop that this causes a finalization stall. If this lasts longer than the finality gap bound then the chain will halt, giving time for the defined governance process to occur and decide what to do. And because the unfinalized chain will not have run too far ahead, the option of a long rollback remains realistically open.

If, on the other hand, there is time pressure to make a governance decision about a rollback in order to reduce its length, that may result in a less well-considered decision.

A possible objection is that there might be a coalition of validators who ignore the request to stop (possibly including the attacker or validators that an attacker can bribe), in which case the finalization stall would not happen. But that just means that we don’t gain the advantage of more time to make a governance decision; it isn’t actively a disadvantage relative to alternative designs. This outcome can also be thought of as a feature rather than a bug: halting the chain should be a last resort, and if the argument given for the request to stop failed to convince a sufficient number of validators that it was reason enough to halt the chain, then perhaps it wasn’t a good enough reason.

It is also possible to make the argument that the threshold of stake needed is imposed by technical properties of the finality protocol and by the resources of the attacker, which might not be ideal for the purpose described above. However, I would argue that it does not need to be ideal, and will be in the right ballpark in practice.

Addendum

Since I posted the above note, I’ve been thinking about potential attacks on protocols using the bounded best-effort model.

Tail-thrashing attacks

There is an important class of potential attacks based on the fact that when the unfinalized chain stalls, an adversary has more time to find blocks, and this might violate security assumptions of the more-available protocol. For instance, if the more-available protocol is PoW-based, then its security in the steady state is predicated on the fact that an adversary with a given proportion of hash power has only a limited time to use that power, before the rest of the network finds another block. During a chain stall this is not the case. If, say, the adversary has 10% hash power, then it can on average find a block in 10 block times. And so in 100 block times it can create a 10-block fork.

It may in fact be worse than this: once miners know that a finalization stall is happening, their incentive to continue mining is reduced, since they know that there is a greater chance that their blocks might be rolled back. So we would expect the global hash rate to fall —even before the finality gap bound is hit— and then the adversary would have a greater proportion of hash rate.

Even in a pure ebb-and-flow protocol, a finalization stall could cause miners to infer that their blocks are more likely to be rolled back, but the fact that the chain is continuing would make more difficult to exploit. This issue with the global hash rate is mostly specific to the more-available protocol being PoW: if it were PoS, then its validators might as well continue proposing blocks because it is cheap to do so. There might be other attacks when the more-available protocol is PoS; I haven’t spent much time analysing that case.

The problem here is that it may have been assumed from the earlier description that the more-available chain would just halt during a chain stall. But in fact, for a finality gap bound of k blocks, an adversary could cause the k-block “tail” of the chain as seen by any given node to “thrash” between different chains. I will call this a tail-thrashing attack.

If a protocol allowed such attacks then it would be a regression relative to the security we would normally expect from an otherwise similar PoW-based protocol. It only occurs during a chain stall, but note that we cannot exclude the possibility of an adversary being able to provoke a chain stall.

Let’s put a pin in solving this problem, because there is another issue that we need to consider, and my preferred approach addresses both.

Finalized Prefix Availability

In the absence of security flaws and under the security assumptions required by the finality layer, the finalization point will not be seen by any honest node to roll back. However, that does not imply that all nodes will see the same finalized height — which is impossible given network delays and unreliable messaging.

In order to optimize the availability of applications that require finality, we need to consider availability of the information, e.g. validator proposals and (possibly aggregate) signatures, needed to finalize the chain up to a particular point. It is possible to incentivize distribution of that information by piggy-backing it on block headers, so that block producers have to include it in order to obtain the block production reward (e.g. mining reward) for the more-available protocol.

Obviously the piggy-backed information has to be for finalization only up to some previous block, because the information for later blocks wasn’t yet available to the block producer.

Optionally, we could incentivize the block producer to include the latest information it has, for example by burning part of the block reward or by giving the producer some limited mining advantage that depends on how many blocks back the finalization information is. Alternatively we could just rely on the fact that some proportion of block producers are honest and will include the latest information they have.

Suppose that for a k-block finality gap bound, we required each block header to include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back. This would automatically implement the finality gap bound without any further explicit check, because it would be impossible to produce a block after the bound. But it would also guarantee that availability of the finalized prefix, at least up to the bound, is incentivized to the same extent as the more-available protocol.

In the Ebb-and-Flow paper, we also have information from the more-available protocol being used in the BFT protocol (top of right column, page 7):

In addition, \mathsf{ch}^t_i is used as side information in \Pi_{\mathrm{bft}} to boycott the finalization of invalid snapshots proposed by the adversary.

This does not cause any circularity. In fact, it means that BFT validators have to listen to the block transmission protocol anyway, so that could be also the protocol over which BFT communication occurs. That is, BFT validators could listen for block headers and get all the information needed to make and broadcast their own signatures or proposals. (A possible reason not to broadcast individual signatures to all nodes is that with large numbers of validators, the proof that a sufficient proportion of validators/stake has signed can use an aggregate signature, which could be much smaller.)

Back to the tail-thrashing problem

Note that in ebb-and-flow-based protocols, snapshots of the “longest chain” protocol \Pi_{\mathrm{lc}} are used as input to the BFT protocol. That implies that the tail-thrashing problem could also affect the input to that protocol, which would be bad (not least for security analysis of availability, which seems somewhat intractable in that case).

Also, when we restart the more-available chain, we would need to take account of the fact that the adversary has had an arbitrary length of time to build long chains from every block that we could potentially restart from. It could be possible to invalidate those chains by requiring blocks after the restart to be dependent on fresh randomness, but that sounds quite tricky (especially given that we want to restart without manual intervention if possible), and there may be other attacks I haven’t thought of.

So, instead of trying to directly solve the tail-thrashing problem, we will avoid it. We will say that if a block producer cannot give the information needed to advance the finalization point to at least k blocks back, then it must produce a coinbase-only block.

This achieves pretty much the same effect, for our purposes, as actually stalling the more-available chain. Since funds cannot be spent in coinbase-only blocks, the vast majority of attacks that we are worried about would not be exploitable in this state.

It is possible that a security flaw could affect coinbase transactions. We might want to turn off shielded coinbase for those blocks in order to reduce the chance of that.

Also, mining rewards cannot be spent in a coinbase-only block; in particular, mining pools cannot distribute rewards. So there is a risk that an unscrupulous mining pool might try to do a rug-pull after mining of non-coinbase-only blocks resumes, if there was a very long finalization stall. But this approach works at least in the short term, and probably for long enough to allow manual intervention into the finalization protocol, or governance processes if needed.

There’s a caveat: part of the reason we wanted the more-available chain to also stall is to make it more acceptable to do a rollback, possibly as far as the stalled finalization point — maybe earlier if the stall was triggered deliberately in response to noticing a sufficiently serious attack on-chain, and if governance decisions allow it. What happens to incentives of miners on a chain that might be rolled back in that way?

This is actually fairly easy to solve. We have the governance procedures say that if we do an intentional rollback, the coinbase-only mining rewards will be preserved. I.e. we mine a block or blocks that include those rewards paid to the same addresses (adjusting the consensus to allow them to be created from thin air if necessary), have everyone check it thoroughly, and require the chain to restart from that block. So as long as miners believe that this procedure will be followed and that the chain will eventually recover at a reasonable coin price, they will still have incentive to mine on the \Pi_{\mathrm{lc}} chain, at least for a time.

There’s one more thing to solve: when we resume mining non-coinbase-only blocks, the rule that

Each block header must include the information necessary for a node to finalize to k blocks back, given that the node has already finalized to k+1 blocks back.

will not be sufficient to catch up the finalization point using only information from block headers. We can solve this by adjusting the rule as follows (this also includes not enforcing it for coinbase-only blocks):

Let \mathsf{thisHeight} be the height of this block, and let \mathsf{prevFinal} be the height to which finalization information has been given by the header chain up to and including this block’s parent. If the block is not coinbase-only, then the block header MUST include the information necessary for a node to finalize to height \mathsf{min}(\mathsf{prevFinal} + 100, \mathsf{thisHeight} - k), given that the node has already finalized to height \mathsf{prevFinal}. (No information is needed if \mathsf{thisHeight} - k \leq \mathsf{prevFinal}.)

This allows nodes to catch up quickly (at a rate of 100 finalized blocks per new block) while still keeping the size of finalization information in each block bounded.

5 Likes