EWBF Linux failure modes

ZC93 · August 10, 2017, 5:27am

I have finally solved the last failure mode that my auto restart software had issues dealing with. EWBF in Linux has four main failure modes:

All watchdogs use sysrq magic keys to reboot Linux safely and avoid Nvidia login loop.

GPU dies, EWBF tries to restart (never seen that work), I always restart immediately via watchdog.
GPU0 dies and locks the X session. In this case #1 watchdog will fail, a second root cron watchdog timer will detect > 30 sec log entry from miner and restart.
GPU drops Sol/s to less than maximum (usually half). More tricky, need to monitor each GPU performance and restart when Sol/s lower limit is reached.
sysrq magic key restart from watchdog results in hung reboot. Really sucks when dealing with remote rigs. I traced this issue to an update of the nouveau default driver and EFI install. Legacy BIOS install does not have this problem. To fix set “nomodeset” in GRUB2 and problem goes away.

Now you have completely automated rigs, add a raspberry Pi and you can control your rigs via your cell, and they will let you know if they have any problem they cant fix on their own.

gandotratushar · August 10, 2017, 8:09pm

I would really appreciate help in such setup…

ZC93 · August 10, 2017, 11:20pm

Example code:
#!bin/bash
#code to determine fault

#code to log error and or send you txt msg of a restart

#code to kill appropriate miner depending on what coin is currently running.
sudo killall miner #Nvidia_ZEC_EWBF=miner, Genoil_ETC=ethminer, AMD_Optiminer=optiminer-zcash etc etc

#begin controlled restart
sleep 5
echo 1 > /proc/sys/kernel/sysrq
echo “Taking keyboard from X11”
echo r > /proc/sysrq-trigger
echo “Syncing disks”
echo s > /proc/sysrq-trigger
echo “Remounting filesystems RO”
echo u > /proc/sysrq-trigger
sleep 1
echo “Rebooting”
echo b > /proc/sysrq-trigger

replace the last line with “echo o > /proc/sysrq-trigger” if you want to shutdown vs a reboot.

You need to run this as root or with sudo privileges, and the sysrq keys have to be enabled in the kernel. Since these are low level kernel commands, if they ever fail, your PC is a brick anyways. You need to pay attention to how you implement the script, if its called from a script that is running in X11 and X11 locks up then it may not get called. So I use two watchdogs, one that does the heavy lifting (miner logging, decision making, rebooting) and a second root cron job that times out and can execute the reboot within itself (in memory, not call a script from the HD) AKA the first watchdog is not doing what it should. Both watchdogs need to monitor itself (how often it runs), to prevent a reboot loop when there is a serious problem. If my rigs reboot three or more times in 6 min they enter a non mining safe mode and send me a message so I can log in and see what is the problem (never had that happen but I have tested it).

Topic		Replies	Views
Ewbf, workers stopping and not recovering Mining Support	8	3070	November 16, 2017
[Solved]Help needed gtx1070 strix oc ewbf crashing after start nvlddmkm.sys Mining Support	20	4866	September 27, 2017
Ewbf-zcash on ethOS - help with reboot script if GPU fail Mining Support	4	3482	January 10, 2018
GPU are stopped. Attemping Restart Mining Support	4	5505	June 18, 2017
Auto restart possible? Mining	93	43682	March 31, 2018

EWBF Linux failure modes

Related topics