V4.1.1 : CopyPreviousWitnesses & Initial Sync problem

Ooooops !

zcashd: wallet/wallet.cpp:1115: void CopyPreviousWitnesses(NoteDataMap &, int, int64_t) [NoteDataMap = std::__1::map<SaplingOutPoint, SaplingNoteData, std::__1::less, std::__1::allocator<std::__1::pair<const SaplingOutPoint, SaplingNoteData> > >]: Assertion `(nd->witnessHeight == -1) || (nd->witnessHeight == indexHeight - 1)’ failed.

Reindexing…

2 Likes

Another reincarnation of https://github.com/zcash/zcash/issues/4301.

Please file a bug.

ChileBob, sorry you hit this; it should be fixed in v4.1.0, but the fix doesn’t fix existing corruption. So this can even happen after upgrading from pre-v4.1.0, even though you’re now running v4.1.0 or later. I’m guessing (hoping) that you saw this just after restarting the node and that, previous to the restart, the node was running pre-v4.1.0. See PR 4690 for details.

You would need to restart with -rescan (or -reindex but that takes longer) and then you should never see it again. “Corruption” is a harsh term, but there’s no danger of losing funds (it’s a limited type of corruption that’s easily fixed).

If you were already running v4.1.0 or later and were fully synced with the network, and then hit this assertion, there’s a new problem (or the fix doesn’t fully work). So please let me know (or file a bug, as rex4539 suggests) if that’s the case!

I was running v4.1.1 (upgraded days ago), previously running v4.1.0 with no problems.

Reindexing now to recover & preserved debug.log just in case.

EDIT: Probably a red herring, IP addr (external) changed after the node was started but before it finished syncing.

EDIT: It just died during the reindex (at block 230035), here’s the error :-

zcashd: main.cpp:2597: DisconnectResult DisconnectBlock(const CBlock &, CValidationState &, const CBlockIndex *, CCoinsViewCache &, const CChainParams &, const bool): Assertion `pindex->nCachedBranchId’ failed.
Aborted

EDIT: Building a new node from scratch & will import the old wallet.dat (its gonna take a while)
EDIT: 12 hrs later & its at 18%, really should upgrade that machine…
EDIT: Syncing during a total solar eclipse, this is gonna be a very special node…19%…

EDIT: 365GB total RX & 48Mb/s RX (according to iftop), at block 212653 (19%) after 20 hrs

That’s a LOT of data, very different to the last time I synced a node from scratch!!!

Attaching a screenshot showing the node console, iftop, debug.log & top :-

EDIT:

I’m calling it ‘officially pooched’ & aborting the sync.
The node had been running for 24 hours, it synced to block 217784 & used 440GB download - something is broken.

EDIT:

Just for the sake of completeness I restarted the node, it resumed syncing at a MUCH faster rate.

I’ll leave it running, but for now I would say ‘initial sync is pooched’, ‘resumed/restarted sync works’.

EDIT: Confirmed. Its been 2hrs since restarting the node & its synced to block 298300, much faster.

EDIT: Checked it again this morning & it’d bogged down, syncing slowly & had reached block 373610. Restarted the node & it was much faster.

Addtional: Noticed warnings in debug.log ‘overwinter version too high’, which is new to me.

EDIT: Restarted with ‘-connect=node.IP.address.here’ so it could sync from another local node, much MUCH faster & will let it sync to completion.

EDIT: I’m seeing BAN messages in debug.log ‘Misbehaving: BAN THRESHOLD EXCEEDED’, the node its syncing from is running v4.1.1. I’m using the ‘-connect’ option so the node just reconnects automatically & continues, but the ban shouldn’t be happening.

Here’s what the misbehaving is about from debug.log :-
Dec 15 11:31:29.197 ERROR main: CScriptCheck(): d493018c4c53af2825d54560108d5e5523d7ebb6a89098f2af1691ef8de1ef0a:0 VerifySignature failed: Script evaluated without error but finished with a false/empty top stack element
Dec 15 11:31:29.197 ERROR main: AcceptToMemoryPool: ConnectInputs failed d493018c4c53af2825d54560108d5e5523d7ebb6a89098f2af1691ef8de1ef0a
Dec 15 11:31:29.197 INFO main: Misbehaving: 192.168.0.71 (0 -> 100) BAN THRESHOLD EXCEEDED

EDIT: Looks to me like the syncing node is seeing forwarded Canopy TXNs as invalid as it hasn’t yet reached the activation block (1046400), which is causing it to flag the sending node as ‘naughty’ and after 100 txns received it issues a ban.

SUMMARY: The node is back, synced, old wallet installed and rescanned.

The CopyPreviousWitnesses problem was probably caused by ‘bad things’ happening to this machine, its been killed several times, power outages, all sorts. I can easily imagine some disc corruption caused by that.

The Initial Sync problem, is a real problem - one I think we’ve seen before ? @LarryRuane

1 Like

Hi ChileBob, thanks for the detailed reports; you’re probably remembering these two PRs:

But the fix needed here isn’t a straightforward extension of either of those. (I thought perhaps it would be; just extend those to account for Canopy for example.)

The large amount of receive data – could that be related to this?

Are you aware if you could have been connected to a zebra node by any chance? Perhaps that would have caused the problem?

Its possible but not sure, I’ll build another this evening & watch for that.

Same bug reported by Andreas Brekken on Telegram. sideshift.ai stopped all Zcash trading because of “blockchain index corruption” after updating to 4.2.0

CC @LarryRuane

2 Likes

FWIW this one has tripped me up straight up ~10 times now

I’m almost convinced the corruption was caused by me killing my test machine in new, horrible and imaginative ways. Last time magic smoke escaped after the mobo was shorted by a dropped paperclip.

Running v4.2.0 now, no problems.

It’s important to determine if they upgraded from 4.0.0 (or earlier); if so, then this is the known problem (that’s fixed in 4.1.0 and beyond). However, “blockchain index corruption” would be an entirely different problem. The known problem isn’t in the index, it’s in the wallet db.

I’m still seeing initial sync problems - built another v4.2.0 node & synced it directly from another v4.2.0 node using ‘-connect’

It downloads headers and very quickly gets to 90% while using 90Mb/s of network (watching that with iftop) - whats weird is the network traffic stays at that level while the progress meter (for headers) stops.

I eventually gave up, cloned an existing node, replaced the wallet & reindexed.

1 Like

While this doesn’t have to do with the crash, I did notice some comments about the excessive download during initial sync.

I recall a while back “bigbrain” opened an issue for improving the download performance on initial sync and even offered a workaround code and explanation. This issue is still open - perhaps some devs can take a look if it is worth merging.

See The download traffic is several times larger than the downloaded data · Issue #4624 · zcash/zcash (github.com)

I’m investigating the slow download problem. I tried IBD (initial sync) on testnet using v4.2.0 and after 15 hours it was only able to download 23% of the headers, and less than 1% of blocks. During this time, it downloaded 479 GB while generating 0.5 GB of blocks data (the sum of the sizes of the blknnnn.dat files). I think those two numbers should be close (I did IBD on bitcoin core testnet, and they were the same, 25 GB).

2 Likes

I just created PR 4935 which is just the fix recommended in issue 4624. It does improve the IBD performance greatly, both time and network usage. Please review and test if you can; that would be very helpful, thanks in advance.

I made some updates to PR 4935. I think this is a good fix for the slow / high-download-data IBD (initial block download) problem; we just need reviews and then it can be merged.

1 Like