I have finally solved the last failure mode that my auto restart software had issues dealing with. EWBF in Linux has four main failure modes:
All watchdogs use sysrq magic keys to reboot Linux safely and avoid Nvidia login loop.
GPU dies, EWBF tries to restart (never seen that work), I always restart immediately via watchdog.
GPU0 dies and locks the X session. In this case #1 watchdog will fail, a second root cron watchdog timer will detect > 30 sec log entry from miner and restart.
GPU drops Sol/s to less than maximum (usually half). More tricky, need to monitor each GPU performance and restart when Sol/s lower limit is reached.
sysrq magic key restart from watchdog results in hung reboot. Really sucks when dealing with remote rigs. I traced this issue to an update of the nouveau default driver and EFI install. Legacy BIOS install does not have this problem. To fix set “nomodeset” in GRUB2 and problem goes away.
Now you have completely automated rigs, add a raspberry Pi and you can control your rigs via your cell, and they will let you know if they have any problem they cant fix on their own.