EWBF Linux failure modes

I have finally solved the last failure mode that my auto restart software had issues dealing with. EWBF in Linux has four main failure modes:

All watchdogs use sysrq magic keys to reboot Linux safely and avoid Nvidia login loop.

  1. GPU dies, EWBF tries to restart (never seen that work), I always restart immediately via watchdog.

  2. GPU0 dies and locks the X session. In this case #1 watchdog will fail, a second root cron watchdog timer will detect > 30 sec log entry from miner and restart.

  3. GPU drops Sol/s to less than maximum (usually half). More tricky, need to monitor each GPU performance and restart when Sol/s lower limit is reached.

  4. sysrq magic key restart from watchdog results in hung reboot. Really sucks when dealing with remote rigs. I traced this issue to an update of the nouveau default driver and EFI install. Legacy BIOS install does not have this problem. To fix set “nomodeset” in GRUB2 and problem goes away.

Now you have completely automated rigs, add a raspberry Pi and you can control your rigs via your cell, and they will let you know if they have any problem they cant fix on their own.

I would really appreciate help in such setup… :slight_smile:

Example code:
#!bin/bash
#code to determine fault

#code to log error and or send you txt msg of a restart

#code to kill appropriate miner depending on what coin is currently running.
sudo killall miner #Nvidia_ZEC_EWBF=miner, Genoil_ETC=ethminer, AMD_Optiminer=optiminer-zcash etc etc

#begin controlled restart
sleep 5
echo 1 > /proc/sys/kernel/sysrq
echo "Taking keyboard from X11"
echo r > /proc/sysrq-trigger
echo "Syncing disks"
echo s > /proc/sysrq-trigger
echo "Remounting filesystems RO"
echo u > /proc/sysrq-trigger
sleep 1
echo "Rebooting"
echo b > /proc/sysrq-trigger

replace the last line with “echo o > /proc/sysrq-trigger” if you want to shutdown vs a reboot.

You need to run this as root or with sudo privileges, and the sysrq keys have to be enabled in the kernel. Since these are low level kernel commands, if they ever fail, your PC is a brick anyways. You need to pay attention to how you implement the script, if its called from a script that is running in X11 and X11 locks up then it may not get called. So I use two watchdogs, one that does the heavy lifting (miner logging, decision making, rebooting) and a second root cron job that times out and can execute the reboot within itself (in memory, not call a script from the HD) AKA the first watchdog is not doing what it should. Both watchdogs need to monitor itself (how often it runs), to prevent a reboot loop when there is a serious problem. If my rigs reboot three or more times in 6 min they enter a non mining safe mode and send me a message so I can log in and see what is the problem (never had that happen but I have tested it).

1 Like