Figuring out why my full node keeps rebooting

milk · October 20, 2022, 6:13pm

I started running a full node sometime in January 2022. I bought a cheap NUC knockoff from Amazon with 4 GB of RAM and 128 GB of SSD and installed Debian 11 on it. This ran fine for several months.

Then probably around the time the spam attacks started happening, I noticed my zcashd process would get killed about 15 minutes after I’d restart it. I realized it was maxing out the RAM and SSD, so I upgraded to a mini desktop with 4 cores, 16 GB of RAM, and a 1 TB HDD. I copied my .zcash folder over to it and started it running.

It seemed like zcashd would be running from anywhere from a few minutes to a few hours then all my SSH sessions to the box would get dropped. Sometimes it looked like the whole box rebooted, sometimes (according to the uptime in htop) it seemed like only zcashd was killed.

I’m looking for advice on how to figure out why I’m having these issues. If I can find the specific error that caused the reboot, then I can figure out how to fix it. I looked at messages in debug.log and /var/log/syslog but they don’t really show anything that looks out of the ordinary. This may be more of a general Linux question: What logs can I look at to figure out why my node crashed? Is there any sort of crash reporting that I can turn on?

Also what is the best way to relaunch zcashd after an unexpected exit? Should I always include the -reindex option, even though that adds hours to the startup time?

pitmutt · October 20, 2022, 6:24pm

Hi!

Which version of zcashd are you running? Version 5.2.0 rolled out in the early summer.

What settings are you using in zcash.conf?

When I was having trouble with zcashd crashing, I could see the error on the logs using dmesg, having an out-of-memory error.

Autotunafish · October 20, 2022, 6:56pm

There were multiple out of memory issues reported with the 5.2.0 release, I’ll gather up some links.The coming 5.3.0 release contains fixes to mitigate the issue and that release should be very soon, the RC1 tags were just (today) removed. You can git checkout the current master branch which also contains the fixes now but that should be synced to the release very soon anyways. Comparing 654180df2102...35186b00928f · zcash/zcash · GitHub

Upgrading from 5.2.0 is the only known solution

Autotunafish · October 20, 2022, 7:11pm

I think @ChileBob was the first to report it here Aug 11 V5.2.0 - Segmentation fault
A similar issue was reported the day before on github but there was less info to ever go on though it’s still presumed the most likely cause. I posted about encountering this issue on the Forum as well (idk if you can see the lounge so I screenshot’d mine

)
https://forum.zcashcommunity.com/t/this-started-as-a-rant/42792?u=autotunafish
Can't sync with the message "You have validated 2240 transactions!" - #11 by str4d
What is up Next? - #13 by str4d
Str4d explains it here
Zcash Arborist Call - 23 September 2022 - YouTube
most recent
Zcash Arborist Call - 10 October 2022 - YouTube

milk · October 20, 2022, 9:02pm

I’m running 5.2.0. I think older versions will refuse to run past a certain block height.

Here’s my zcash.conf. Nothing special:

mainnet=1
listen=1
listenonion=0
server=1
addnode=mainnet.z.cash
gen=0

I took a look at dmesg, but it apparently only goes as far back as when my machine booted up, so I don’t know what caused the restart.

One thing I’m doing that is kind of unusual is that I tried setting up this box to run zcash as a service, a technique I learned from a guide to setting up an ETH staking node. I created a user like this:

sudo useradd --no-create-home --shell /bin/false zcash

And I launch zcashd like this (I’m not using the service yet):

sudo -u zcash /usr/bin/zcashd -datadir=/var/lib/zcash -paramsdir=/var/lib/zcash-params

Maybe I should first try running it from my own user’s home, but I think if this setup was not possible, then it would fail quickly and deterministically.

milk · October 20, 2022, 9:05pm

That’s great news. Hopefully they’ll post the apt package soon and maybe that will fix my issues.

Autotunafish · October 20, 2022, 9:20pm

I’ll post here when it drops

pitmutt · October 20, 2022, 9:30pm

The thread that @Autotunafish posted has some of the earlier discussions I had when one of my nodes had no issues upgrading to 5.2.0, while the other kept crashing. I think the zcash.conf parameters played a role, when I copied them from the good node to the crashing one, the crashing one stopped crashing.

I did get the out of memory error while trying to sync through the sandblasting blocks, but once it synced, it’s been working fine.

Autotunafish · October 21, 2022, 12:07am

@milk @pitmutt 5.3.0 is released Releases · zcash/zcash · GitHub

pitmutt · October 21, 2022, 1:32pm

Cool! Will try it out as soon as the .deb is available.

milk · October 25, 2022, 1:32am

The .deb package is finally available. While waiting for its release I continued to let my zcashd service run 5.2.0 and verify as many blocks as it could before my machine inevitably crashed again. It got to the point where my machine was rebooting every 4 minutes. I think maybe 4 minutes is about enough time for zcash to do all its bootup housekeeping, and then as soon as it starts processing new blocks my machine crashes.

I started 5.3.0 in hopes that it would fix the issue, but no, it crashed the machine as soon as it was done starting up. I’m beginning to wonder if I have a bad disk. (This computer was a refurb.) Now that my block DB is a certain size maybe zcashd keeps trying to write to the same bad sector or something.

I think maybe this is a more general Linux question. A userspace app shouldn’t be able to just reboot the whole machine so easily, right? If it were out of memory or a bad pointer, the OS should just kill the app I would think.

I’m going to run it with -reindex one last time tonight and see how far it gets.

hanh · October 25, 2022, 1:54am

It’s more likely faulty memory than disk.

Autotunafish · October 25, 2022, 2:39am

Yeah most of the issues were just about the process terminating from an OOM and not rebooting the whole system at startup. You could check the debug log, syslog or dmesg log in /var/log for more insight perhaps

milk · October 25, 2022, 8:22pm

Just curious what makes you suggest it’s the RAM? I assumed the hard drive since it’s a mechanical device, and I’ve definitely known them to go bad in the past. Replacing the RAM is basically the only thing I haven’t tried yet.

Things I’ve tried that haven’t worked:

Upgrading to Zcash 5.3.0
Replacing 1 TB HDD with 500 GB SDD I had on hand
Syncing blocks DB entirely from scratch. (Machine rebooted around 30% synced, can’t sync farther.)
Copying blocks DB (96% synced) from my old under-resourced node. (Machine reboots as soon as sync starts.)

I guess I’ll try putting new RAM in this thing, and if that doesn’t work, I guess I need to buy a new computer.

milk · October 25, 2022, 8:28pm

The logs haven’t shown me anything useful. The machine pretty much reboots without warning. I would guess if it was something like an OOM error, the operating system would kill the process and print a message and not reboot the whole machine. This is what is leading me to think it’s some issue with my specific hardware that zcashd is somehow triggering.

Autotunafish · October 25, 2022, 9:08pm

Hopefully nothing too serious… Psu maybe?

milk · October 25, 2022, 9:13pm

That did cross my mind. That would be something external to the OS that wouldn’t generate any logs. Maybe when zcash starts doing the really heavy math for later versions of the protocol it starts drawing too much current and trips something. Unfortunately that’s not a part I think I can replace on this mini PC.

hanh · October 26, 2022, 1:36am

It’s usually PSU > RAM > MB. Disk failures don’t generally reboot the machine. They make clickety noise.

milk · November 4, 2022, 6:21pm

One last update: I took the hard drive out of the mini PC and put it in a different computer. I was able to finish syncing in a day or two, so that tells me my database wasn’t corrupt and the disk was fine. I replaced the RAM and the SSD in the mini PC I was originally trying to use and copied over the updated blocks database. I tried one more time to run zcashd and as soon as it started processing the first block it restarted. So I think that narrows it down to the PSU, but I don’t think that part is easily replaceable on a Lenovo mini PC. I know people have used less powerful machines as zcash nodes, but I think the difference is that those machines know how to run within their limits. I don’t recommend the Lenovo ThinkCentre M700 tiny PC for hosting a Zcash node.

Autotunafish · November 4, 2022, 10:05pm

Yeah it looks a little small, I ran a quick search and those may be available (if it searched for the right thing, that is)

Topic		Replies	Views
Rescuing a zcashd node that won't catch up Technical Support	4	592	October 4, 2022
ZCASHD Crashing Help Technical Support	5	1206	November 2, 2018
V5.2.0 - frozen shutdown Technical Support	5	359	July 30, 2022
V5.2.0 - Segmentation fault Technical Support	20	1007	August 15, 2022
Zcashd help Technical Support	1	713	August 30, 2022

Figuring out why my full node keeps rebooting

Related topics