I started running a full node sometime in January 2022. I bought a cheap NUC knockoff from Amazon with 4 GB of RAM and 128 GB of SSD and installed Debian 11 on it. This ran fine for several months.
Then probably around the time the spam attacks started happening, I noticed my zcashd process would get killed about 15 minutes after I’d restart it. I realized it was maxing out the RAM and SSD, so I upgraded to a mini desktop with 4 cores, 16 GB of RAM, and a 1 TB HDD. I copied my .zcash folder over to it and started it running.
It seemed like zcashd would be running from anywhere from a few minutes to a few hours then all my SSH sessions to the box would get dropped. Sometimes it looked like the whole box rebooted, sometimes (according to the uptime in htop) it seemed like only zcashd was killed.
I’m looking for advice on how to figure out why I’m having these issues. If I can find the specific error that caused the reboot, then I can figure out how to fix it. I looked at messages in debug.log and /var/log/syslog but they don’t really show anything that looks out of the ordinary. This may be more of a general Linux question: What logs can I look at to figure out why my node crashed? Is there any sort of crash reporting that I can turn on?
Also what is the best way to relaunch zcashd after an unexpected exit? Should I always include the -reindex option, even though that adds hours to the startup time?
There were multiple out of memory issues reported with the 5.2.0 release, I’ll gather up some links.The coming 5.3.0 release contains fixes to mitigate the issue and that release should be very soon, the RC1 tags were just (today) removed. You can git checkout the current master branch which also contains the fixes now but that should be synced to the release very soon anyways. Comparing 654180df2102...35186b00928f · zcash/zcash · GitHub
I think @ChileBob was the first to report it here Aug 11 V5.2.0 - Segmentation fault
A similar issue was reported the day before on github but there was less info to ever go on though it’s still presumed the most likely cause. I posted about encountering this issue on the Forum as well (idk if you can see the lounge so I screenshot’d mine
I took a look at dmesg, but it apparently only goes as far back as when my machine booted up, so I don’t know what caused the restart.
One thing I’m doing that is kind of unusual is that I tried setting up this box to run zcash as a service, a technique I learned from a guide to setting up an ETH staking node. I created a user like this:
Maybe I should first try running it from my own user’s home, but I think if this setup was not possible, then it would fail quickly and deterministically.
The thread that @Autotunafish posted has some of the earlier discussions I had when one of my nodes had no issues upgrading to 5.2.0, while the other kept crashing. I think the zcash.conf parameters played a role, when I copied them from the good node to the crashing one, the crashing one stopped crashing.
I did get the out of memory error while trying to sync through the sandblasting blocks, but once it synced, it’s been working fine.
The .deb package is finally available. While waiting for its release I continued to let my zcashd service run 5.2.0 and verify as many blocks as it could before my machine inevitably crashed again. It got to the point where my machine was rebooting every 4 minutes. I think maybe 4 minutes is about enough time for zcash to do all its bootup housekeeping, and then as soon as it starts processing new blocks my machine crashes.
I started 5.3.0 in hopes that it would fix the issue, but no, it crashed the machine as soon as it was done starting up. I’m beginning to wonder if I have a bad disk. (This computer was a refurb.) Now that my block DB is a certain size maybe zcashd keeps trying to write to the same bad sector or something.
I think maybe this is a more general Linux question. A userspace app shouldn’t be able to just reboot the whole machine so easily, right? If it were out of memory or a bad pointer, the OS should just kill the app I would think.
I’m going to run it with -reindex one last time tonight and see how far it gets.
Yeah most of the issues were just about the process terminating from an OOM and not rebooting the whole system at startup. You could check the debug log, syslog or dmesg log in /var/log for more insight perhaps
Just curious what makes you suggest it’s the RAM? I assumed the hard drive since it’s a mechanical device, and I’ve definitely known them to go bad in the past. Replacing the RAM is basically the only thing I haven’t tried yet.
Things I’ve tried that haven’t worked:
Upgrading to Zcash 5.3.0
Replacing 1 TB HDD with 500 GB SDD I had on hand
Syncing blocks DB entirely from scratch. (Machine rebooted around 30% synced, can’t sync farther.)
Copying blocks DB (96% synced) from my old under-resourced node. (Machine reboots as soon as sync starts.)
I guess I’ll try putting new RAM in this thing, and if that doesn’t work, I guess I need to buy a new computer.
The logs haven’t shown me anything useful. The machine pretty much reboots without warning. I would guess if it was something like an OOM error, the operating system would kill the process and print a message and not reboot the whole machine. This is what is leading me to think it’s some issue with my specific hardware that zcashd is somehow triggering.
That did cross my mind. That would be something external to the OS that wouldn’t generate any logs. Maybe when zcash starts doing the really heavy math for later versions of the protocol it starts drawing too much current and trips something. Unfortunately that’s not a part I think I can replace on this mini PC.
One last update: I took the hard drive out of the mini PC and put it in a different computer. I was able to finish syncing in a day or two, so that tells me my database wasn’t corrupt and the disk was fine. I replaced the RAM and the SSD in the mini PC I was originally trying to use and copied over the updated blocks database. I tried one more time to run zcashd and as soon as it started processing the first block it restarted. So I think that narrows it down to the PSU, but I don’t think that part is easily replaceable on a Lenovo mini PC. I know people have used less powerful machines as zcash nodes, but I think the difference is that those machines know how to run within their limits. I don’t recommend the Lenovo ThinkCentre M700 tiny PC for hosting a Zcash node.