ZecWallet stopped on block 903001, what's the problem?

At the moment, it appears that the issue requires very specific (and as-yet undetermined) build environment conditions to create a binary that triggers the problem. Without understanding the cause, it could easily occur again in future ZecWallet releases, so it definitely needs to be figured out. We’re meeting with @adityapk00 tomorrow to continue debugging.

2 Likes

We’re pretty sure that the problem is related to how ZecWallet is built for Windows, so the invalidate/reconsider workaround isn’t going to solve the problem for ZecWallet users on Windows unless we understand the cause better.

1 Like

That seems like a bug in its own right. Changes to the wallet database won’t be flushed in that case.

1 Like

Does this release also handle the issue?

2 Likes

Sorry, no, not yet. We’re still trying to get to the bottom of why this is happening, so we can fix it for good.

2 Likes

Everyone having this problem: If you have funds stuck in a Sapling address, you can now download the latest Zecwallet Lite client, and then import the spending (private) key into the lightwallet, so that you don’t have to get stuck with the Fullnode not syncing.

Download: www.zecwallet.co

3 Likes

I am late to the party on this. I have checked out the github comments and I think my runtime analysis stuff might help. Im going to spend the evening instrumenting a build and grabbing the blockchain.

Please keep me in the loop. I will let you know what I find. I am going to start by looking into what @secured2k has been suggesting on github. secured2k github

Starting with this:

This should not take me long to verify if something is getting screwy in memory or when reading/writing the drive. setting everything up will take longer tho.

Has anyone checked this bug on anything previous to winver 2004?

Please let me know if im working on outdated info or anywhere else i can catch up whilst im churning through the setup

2 Likes

I was shown the issue on a Windows 2016 Server. In these 2 cases, I confirmed the issue visually and via debug logs. When I closed the GUI wallet, I noticed no additional shutdown logging occurred and the process was confirmed to not be running. When starting zcashd up, it rewound blocks and started at a block way before 903000 (ex 902019). Fortunately, I only had to run the updated zcashd 3.1.0-rc2 and it proceeded into and through Heartwood blocks successfully.

I built and posted a build of 3.1.0-rc2 for win32 on the GitHub thread in case anyone wanted to use it.
If the user actually got bad data written to the disk (for some flush or proper exit of zcashd or zcash-cli stop), they need to rewind (invalidate) the last block and reconsider it on 3.1.0-rc2 to continue. Trying the exact same step in 3.0.0 results in the same reproducible problem.

So it looks like the issue has already been fixed; if you look at the diffs between 3.0.0 and 3.1.0-rc2, I can see some changes in some code sections that appear to be relevant to this issue. This is not the first time this has happen to various other blockchains as well, but it’s hard to find what was actually fixed sometimes.

In my tests, I started dumping variables to debug or console and everything is good until writing to the disk (or in memory cache). One a block is loaded from disk, the code seems to have an incorrect value for view.GetHistoryRoot(prevConsensusBranchId). Note that prevConsensusBranchId appeared to be correct at the function call. In the GitHub posts, my output is in dec while many may expect hex.

I have 2 backups of the blockchain state.

  1. Problem at 903001 using 3.0.0; written to disk.
  2. Full txindex/insightexplorer/etc at around 902850.

But I doubt you will need these as the issue is reproducible. Just use a linux or updated windows zcashd to get to any point after heartwood. Then try to run the affected 3.0.0 build and the issue will occur. If check level is 4, it will fail before downloading a block and not write data. The default check level 3 will allow the program to download and write at least 1 block before failing on the connect block section. As proof of writing to the disk, if you go back to using the same block data on disk with Zcash 3.1.0-rc2, it too will fail. However, you can successfully invalidate/reconsider on this build where as you cannot on 3.0.0.

What I haven’t tested - due to time/resource constraints… is if there was custom build options or specific libraries used in the 3.0.0 build. Basically I haven’t built 3.0.0 from source and compared the behaviors.

Let me know if I can be of any help.

3 Likes

This is a lot of help. I fell asleep early yesterday and only just woke up.

when you say this is fixed, do you mean there is a work around, or the problem has been identified and a patch has been issued? @daira and @str4d can you shed any light please? (tagging you because you are in my timezone)

From what you have posted so far, - that the disk writing is not the issue, the block gets rejected before that (it might be a separate issue that it gets written to disk, but that helps narrow down where the logic might be failing)

Anyway I have a lot of catching up to do. Thanks again for posting your work, you have done all the stuff I would do to help me narrow the problem down. you have saved me a great deal of time and work.

When I said it’s fixed - I mean the problem happens in 3.0.0 but does not in 3.1.0-rc2. It would appear that some code changes over time may have already addressed the issue. However, I don’t think there is specific documentation to what changes fixed/mitigated the problem.

Also, you are correct about the disk writing not being the main issue - it’s rather some corruption of data processing that fails. Writing to disk just is a possibility that causes the issue to continue to happen on 3.x builds, even after reindex due to the bad data being stored on disk.

1 Like

See: https://github.com/zcash/zcash/pull/4628

First of all, my apologies to anyone affected by this bug. I know that some of you have funds locked up by it, and I understand the frustration and possible financial hardship that can cause.

My understanding is that @adityapk00 will soon release a version of ZecWallet that will automatically fix the problem (since it’s possible for the ZecWallet build process to produce correct builds, just not reliably). 0.9.14 is not that version.

Technical details

This has been a particularly frustrating bug because has been so difficult to reproduce by ECC engineers. It occurs only for zcashd executables on Windows that were built as part of the ZecWallet build process, or some standalone builds by Aditya using a similar process. The same source code, built by the same compiler on the same OS distribution using the same Docker configuration, does not exhibit the bug when compiled by ECC engineers, even after considerable effort to reproduce the ZecWallet build process. There’s also some degree of nondeterminism or dependence on unknown factors in whether a particular build of ZecWallet will exhibit the bug. We’re still investigating, with Aditya, why this is.

3.1.0-rc2 does not fix the bug. There have been changes, in 3.1.0-rc2 and some additional ones that are not yet released, that fix unspecified behaviour (in the C++ standards sense) in the FlyClient code. We’ve verified to my satisfaction that this unspecified behaviour, while technically incorrect, did not make any difference to this particular bug. Also, Aditya has produced builds of 3.1.0-rc2 that do reproduce the bug. The behaviour @secured2k is seeing can be explained by the fact that some builds reproduce the bug and some do not, but this is not solely dependent on which source is being built.

We’ve established that issue is the calculation of incorrect hashChainHistoryRoot values (typically, although not exclusively, the value computed from block 903001 that should appear in the header of block 903002); just not why this happens on some ZecWallet builds. Without understanding why, there’s the risk that this problem could recur for future builds, so that’s what we’re focusing on now.

4 Likes

I expect to release a new Zecwallet later today that works around this problem, and can get a stuck node syncing again.

Still trying to work through how exactly to reproduce the build environment that’s causing this bug.

4 Likes

Thanks for the info/update.

For Adity/ECC - What is the build environment and versions of tool chains used. Also what CPU is used?
I used a Haswell-E CPU Virtualized in VM Ware using Debian 10.4 amd64 with default packages in my tests. I have also had success with Debian testing/buster 11.

The Zecwallet builds are usually via this dockerfile (https://github.com/adityapk00/zcash/blob/zecwallet-build/docker/Dockerfile) and this script (zcash/buildall.sh at zecwallet-build · adityapk00/zcash · GitHub)

The problem is very likely the build environment or a compiler version.

Thanks, I’ll try to get it reproduced in a VM over the weekend based on the 3.0.0 source.

There’s now a new version of Zecwallet (v0.9.15) that works around this problem on Windows. Please see the release here: Release Zecwallet: v0.9.15: Windows zcashd bugfix · ZcashFoundation/zecwallet · GitHub

4 Likes

Which version would be best for me to test for the bug on? 3.0.0? Should I merge the off by one fix too?

@mistfpga, @adityapk00 - I tried to reproduce the problem using a VM per the docker files provided. Installed Ubuntu 16.04.4 LTS and updated all packages/kernel and was able to checkout v3.0.0 official and build it for windows with no issues to reproduce the bug (just as ECC has found and mentioned).

Checking out adityapk00’s zcash-win zecwallet-build is currently code from 3.1.0-rc2, so I don’t have a copy of the 3.0.0 source that was used. Even if the source was identical to the v3.0.0 main branch, we would need adityapk00’s build environment and tool chain to see if a specific script or compiler or linker or whatever is the problem.

I would assume adityapk00 is already working with ECC on this; but it is that person’s call to share more info if additional help is needed.

2 Likes

If anyone has a windows binary that can reproduce this issue I would be greatly appreciative + the source and build options/env if possible (and debug symbols if you can generate them). so I can compare it to the binaries I generate.

I will post the results, but seeing as the ECC seem to have this under control it is for personal academic purposes. (this really is very close to my day job - so I am exceptionally interested)