Troubleshooting GPU stability?

I just finished my 8 GPU rig, consisting of 4x ASUS 1080TI Turbo, 2x ASUS 1080Ti Strix and 2x Gainward 1080Ti GS. I’ve added more and more cards over some months. Now, with the installation of the two latest cards, my GPU numbering in MSI afterburner got screwed up, and I had to start all over again. I’ve not been to good to check for stability and taking notes of the clock/power settings as I went along, as the EWBF watchdog script took care of restarting the miner and computer. The small downtime I’ve experience when the computer reboots is just so small that I didn’t bother to perfect the settings.

But now, when overclocking, the computer sometimes just freezes. Normally, I can check on my rig by remote desktop, but when this happens I have to manually power the rig off and on. I believe it has to do with the clock speeds and power limits I’ve set.

How can I check which card is the culprit? Is there some kind of log somewhere that can tell med which GPU is giving me trouble? Right now, I have to just lower everything on all cards and make small adjustments to one card at the time, then letting the miner run for hours, then adjusting again. It’ll take days. Any tips?

You can run each card on it’s own miner process and then make sure to turn off any failure restart scripts. Then when it fails only that card will go down and the others keep going.

I have had a similar issue but for me it was power related, as card temps went up, so did the power draw causing miner and driver crashing.

1 Like

Thanks for the tip. I can try it, but the problem is that the whole system locks up when this happens. I can’t connect to the rig at all to check the status, and I need to do a hard reboot. So I have no way to monitor which miner goes down as the whole rig goes down. A log file of some sort could maybe do the trick, but I don’t know if such exists.