V4.1.1 : CopyPreviousWitnesses & Initial Sync problem

ChileBob · December 13, 2020, 2:30pm

Ooooops !

zcashd: wallet/wallet.cpp:1115: void CopyPreviousWitnesses(NoteDataMap &, int, int64_t) [NoteDataMap = std::__1::map<SaplingOutPoint, SaplingNoteData, std::__1::less, std::__1::allocator<std::__1::pair<const SaplingOutPoint, SaplingNoteData> > >]: Assertion `(nd->witnessHeight == -1) || (nd->witnessHeight == indexHeight - 1)’ failed.

Reindexing…

anon16456014 · December 13, 2020, 6:00pm

Another reincarnation of zcashd assertion failure CopyPreviousWitnesses() Assertion (nd->witnessHeight == -1) || (nd->witnessHeight == indexHeight - 1) · Issue #4301 · zcash/zcash · GitHub.

Please file a bug.

LarryRuane · December 13, 2020, 8:37pm

ChileBob, sorry you hit this; it should be fixed in v4.1.0, but the fix doesn’t fix existing corruption. So this can even happen after upgrading from pre-v4.1.0, even though you’re now running v4.1.0 or later. I’m guessing (hoping) that you saw this just after restarting the node and that, previous to the restart, the node was running pre-v4.1.0. See PR 4690 for details.

You would need to restart with -rescan (or -reindex but that takes longer) and then you should never see it again. “Corruption” is a harsh term, but there’s no danger of losing funds (it’s a limited type of corruption that’s easily fixed).

If you were already running v4.1.0 or later and were fully synced with the network, and then hit this assertion, there’s a new problem (or the fix doesn’t fully work). So please let me know (or file a bug, as rex4539 suggests) if that’s the case!

ChileBob · December 13, 2020, 9:46pm

I was running v4.1.1 (upgraded days ago), previously running v4.1.0 with no problems.

Reindexing now to recover & preserved debug.log just in case.

EDIT: Probably a red herring, IP addr (external) changed after the node was started but before it finished syncing.

EDIT: It just died during the reindex (at block 230035), here’s the error :-

zcashd: main.cpp:2597: DisconnectResult DisconnectBlock(const CBlock &, CValidationState &, const CBlockIndex *, CCoinsViewCache &, const CChainParams &, const bool): Assertion `pindex->nCachedBranchId’ failed.
Aborted

EDIT: Building a new node from scratch & will import the old wallet.dat (its gonna take a while)
EDIT: 12 hrs later & its at 18%, really should upgrade that machine…
EDIT: Syncing during a total solar eclipse, this is gonna be a very special node…19%…

EDIT: 365GB total RX & 48Mb/s RX (according to iftop), at block 212653 (19%) after 20 hrs

That’s a LOT of data, very different to the last time I synced a node from scratch!!!

Attaching a screenshot showing the node console, iftop, debug.log & top :-

EDIT:

I’m calling it ‘officially pooched’ & aborting the sync.
The node had been running for 24 hours, it synced to block 217784 & used 440GB download - something is broken.

EDIT:

Just for the sake of completeness I restarted the node, it resumed syncing at a MUCH faster rate.

I’ll leave it running, but for now I would say ‘initial sync is pooched’, ‘resumed/restarted sync works’.

EDIT: Confirmed. Its been 2hrs since restarting the node & its synced to block 298300, much faster.

EDIT: Checked it again this morning & it’d bogged down, syncing slowly & had reached block 373610. Restarted the node & it was much faster.

Addtional: Noticed warnings in debug.log ‘overwinter version too high’, which is new to me.

EDIT: Restarted with ‘-connect=node.IP.address.here’ so it could sync from another local node, much MUCH faster & will let it sync to completion.

EDIT: I’m seeing BAN messages in debug.log ‘Misbehaving: BAN THRESHOLD EXCEEDED’, the node its syncing from is running v4.1.1. I’m using the ‘-connect’ option so the node just reconnects automatically & continues, but the ban shouldn’t be happening.

Here’s what the misbehaving is about from debug.log :-
Dec 15 11:31:29.197 ERROR main: CScriptCheck(): d493018c4c53af2825d54560108d5e5523d7ebb6a89098f2af1691ef8de1ef0a:0 VerifySignature failed: Script evaluated without error but finished with a false/empty top stack element
Dec 15 11:31:29.197 ERROR main: AcceptToMemoryPool: ConnectInputs failed d493018c4c53af2825d54560108d5e5523d7ebb6a89098f2af1691ef8de1ef0a
Dec 15 11:31:29.197 INFO main: Misbehaving: 192.168.0.71 (0 → 100) BAN THRESHOLD EXCEEDED

EDIT: Looks to me like the syncing node is seeing forwarded Canopy TXNs as invalid as it hasn’t yet reached the activation block (1046400), which is causing it to flag the sending node as ‘naughty’ and after 100 txns received it issues a ban.

SUMMARY: The node is back, synced, old wallet installed and rescanned.

The CopyPreviousWitnesses problem was probably caused by ‘bad things’ happening to this machine, its been killed several times, power outages, all sorts. I can easily imagine some disc corruption caused by that.

The Initial Sync problem, is a real problem - one I think we’ve seen before ? @LarryRuane

LarryRuane · December 17, 2020, 1:10am

Hi ChileBob, thanks for the detailed reports; you’re probably remembering these two PRs:

https://github.com/zcash/zcash/pull/3410
don't ban peers when loading pre-sapling (and pre-blossom) blocks by LarryRuane · Pull Request #3670 · zcash/zcash · GitHub

But the fix needed here isn’t a straightforward extension of either of those. (I thought perhaps it would be; just extend those to account for Canopy for example.)

The large amount of receive data – could that be related to this?

github.com/ZcashFoundation/zebra

state: dodge a bug in zcashd

ZcashFoundation:main ← ZcashFoundation:dodge-zcashd-bug

opened 08:27PM - 02 Dec 20 UTC

hdevalence

+8 -2

Zcashd will blindly request more block headers as long as it got 160 block head…ers in response to a previous query, EVEN IF THOSE HEADERS ARE ALREADY KNOWN. To dodge this behavior, return slightly fewer than the maximum, to get it to go away. https://github.com/zcash/zcash/blob/0ccc885371e01d844ebeced7babe45826623d9c2/src/main.cpp#L6274-L6280 Without this change, communication between a partially-synced `zebrad` and fully-synced `zcashd` looked like this: 1. `zebrad` connects to `zcashd`, which sends an initial `getheaders` request; 2. `zebrad` correctly computes the intersection of the provided block locator with the node's current chain and returns 160 following headers; 3. `zcashd` does not check whether it already has those headers and assumes that any provided headers are new and re-validates them; 4. `zcashd` assumes that because `zebrad` responded with 160 headers, the `zebrad` node is ahead of it, and requests the next 160 headers. 5. Because block locators are sparse, the intersection between the `zcashd` and `zebrad` chains is likely well behind the `zebrad` tip, so this process continues for thousands of blocks. To avoid this problem, we return slightly fewer than the protocol maximum (158 rather than 160, to guard against off-by-one errors in zcashd). This does not interfere with use of the returned headers by peers that check the headers, but does prevent `zcashd` from trying to download thousands of block headers it already has. This problem does not occur in the `zcashd<->zcashd` case only because `zcashd` does not respond to `getheaders` messages while it is syncing. However, implementing this behavior in Zebra would be more complicated, because we don't have a distinct "initial block sync" state (we do poll-based syncing continuously) and we don't have shared global variables to modify to set that state. Relevant links (thanks @str4d): - The PR that introduced this behavior: https://github.com/bitcoin/bitcoin/pull/4468/files#r17026905 - https://github.com/bitcoin/bitcoin/issues/6861 - https://github.com/bitcoin/bitcoin/issues/6755 - https://github.com/bitcoin/bitcoin/pull/8306#issuecomment-614916454

Are you aware if you could have been connected to a zebra node by any chance? Perhaps that would have caused the problem?

ChileBob · December 17, 2020, 2:26pm

Its possible but not sure, I’ll build another this evening & watch for that.

anon16456014 · December 24, 2020, 9:47am

Same bug reported by Andreas Brekken on Telegram. sideshift.ai stopped all Zcash trading because of “blockchain index corruption” after updating to 4.2.0

CC @LarryRuane

BrunchTime · December 24, 2020, 1:56pm

FWIW this one has tripped me up straight up ~10 times now

ChileBob · December 24, 2020, 2:01pm

I’m almost convinced the corruption was caused by me killing my test machine in new, horrible and imaginative ways. Last time magic smoke escaped after the mobo was shorted by a dropped paperclip.

Running v4.2.0 now, no problems.

LarryRuane · December 25, 2020, 2:01am

It’s important to determine if they upgraded from 4.0.0 (or earlier); if so, then this is the known problem (that’s fixed in 4.1.0 and beyond). However, “blockchain index corruption” would be an entirely different problem. The known problem isn’t in the index, it’s in the wallet db.

ChileBob · December 26, 2020, 1:23am

I’m still seeing initial sync problems - built another v4.2.0 node & synced it directly from another v4.2.0 node using ‘-connect’

It downloads headers and very quickly gets to 90% while using 90Mb/s of network (watching that with iftop) - whats weird is the network traffic stays at that level while the progress meter (for headers) stops.

I eventually gave up, cloned an existing node, replaced the wallet & reindexed.

secured2k · December 27, 2020, 6:42am

While this doesn’t have to do with the crash, I did notice some comments about the excessive download during initial sync.

I recall a while back “bigbrain” opened an issue for improving the download performance on initial sync and even offered a workaround code and explanation. This issue is still open - perhaps some devs can take a look if it is worth merging.

See The download traffic is several times larger than the downloaded data · Issue #4624 · zcash/zcash (github.com)

LarryRuane · December 29, 2020, 7:36pm

I’m investigating the slow download problem. I tried IBD (initial sync) on testnet using v4.2.0 and after 15 hours it was only able to download 23% of the headers, and less than 1% of blocks. During this time, it downloaded 479 GB while generating 0.5 GB of blocks data (the sum of the sizes of the blknnnn.dat files). I think those two numbers should be close (I did IBD on bitcoin core testnet, and they were the same, 25 GB).

LarryRuane · December 30, 2020, 7:23am

I just created PR 4935 which is just the fix recommended in issue 4624. It does improve the IBD performance greatly, both time and network usage. Please review and test if you can; that would be very helpful, thanks in advance.

LarryRuane · January 9, 2021, 3:56am

I made some updates to PR 4935. I think this is a good fix for the slow / high-download-data IBD (initial block download) problem; we just need reviews and then it can be merged.

Topic		Replies	Views
Zcashd 3.0.0 - CopyPreviousWitness problem Technical Support	4	665	May 30, 2020
V5.2.0 : UpdateWitnessHeights : Assertion `nd.witnessHeight == -1' failed Technical Support	2	424	August 3, 2022
Stuck at 98% Sync (Zecwallet Fullnode - Windows 10) Technical Support	2	406	January 12, 2021
Started getting this bug running zcashd, someone knows about it? zcashd client	4	674	May 1, 2020
Wallet/wallet.cpp:648: void CWallet::IncrementNoteWitnesses(const CBlockIndex, const CBlock, ZCIncrementalMerkleTree): Assertion `nWitnessCacheSize >= nd->witnesses.size()' failed." zcashd client	11	1810	October 21, 2016

V4.1.1 : CopyPreviousWitnesses & Initial Sync problem

Related topics