My miners would always drop a GPU over time. Usually GPU2 but not always. I have seen this with Claymore and optiminer so it is an Ubuntu and or AMD driver issue and not the miner. Below are some tips I have learned and a step by step for a rock solid optiminer 1.4.0 using RX480 8GB cards getting ~300 S/s / card (2,100 S/s for rig with 1,100Watt at the wall).
Hers is what I know:
Ubuntu 16.04 server is very difficult to install and get working with a 7 GPU rig (or even 6 for that matter). I have got it to work but there are so many issues that need to be fixed its just not worth the time trying. Ubuntu 16.04 Desktop installs with relative ease and has just a few minor issues.
So here is a step by step to setup a stable optiminer 1.4.0 rig:
install Ubuntu Desktop 16.04.1 with no GPU’s installed
install open ssh, Byobu, vim (unless you like nano), and lm-sensors (you can do all from ssh from this point)
perform and apt update upgrade and reboot.
install am-pro 16.50 drivers, add yourself to video group, and shutdown.
install all GPU’s and reboot (check that lm-sensors shows all your GPU’s. If some are missing a reboot will usually fix this. On my rig a boot with the ethernet port disabled in the BIOS will get my 7th GPU to show. Then reboot and enable the ethernet and the 7th GPU will stay (probably particular to my MBO but need to free up something off the PCI bus)
Once you are at this point enable Byobu and go headless via ssh
Set the default GRUB behavior to boot into a console AKA, do not load the desktop / GUI.
Load optiminer and your run scripts.
via ssh and Byobu start your miner and fine tune fans, verify S/s, temps, ect ect (system check)
reboot via ssh and Byobu and start your miner.
immediately disconnect via F6 and leave session running.
That is it. Leave it run like this and monitor you S/s via your pool statistics NOT ssh. If you do connect via ssh to check on things or perform maintenance, reboot, start your miner and then disconnect again as in step 11. if you see a GPU go down, try step 11 again until things become stable (then leave it alone). I was restarting almost every day but now my miners just run and run and never drop a GPU. However, if I connect to a rig via ssh I can drop a GPU fairly quickly just as before, even if I disconnect quickly.
A couple pointers:
do not use sudo reboot. Instead use sudo shutdown -r now, I had issues with reboot hanging, especially when a GPU dropped and I was out of town. If you do need to remotely ssh into a rig shut the miner down as soon as you connect. A soft reboot can fail with a dropped GPU so its better to shut down the miner before than can happen. Always reboot, start miner, and immediately disconnect when you are done. If you want to monitor things a little closer then setup your rigs to email you performance statistics. The trick is don’t use a GPU for a display or ssh connection and things will just keep running.
Thats what has worked for me and I have not had any more issues. I would expect Claymore to be stable doing things this way as well.
I am having an issue with optiminer 1.5.0 that drops my performance to ~1 S/s card no matter what I try so for now I will stay with 1.4.0 and troubleshoot 1.5.0 another time.
I had noticed the same thing wrt ssh logins. I was away over the holidays for 11 days. At that time I was running v0.6. I just monitored the rigs from my pool Web site stats - no remote ssh logins. They (20 full rigs plus a couple small test/dev systems) ran with no issues the entire time I was gone. What does seem to work, without rig issues, is to issue remote commands via ssh. So, you can “ssh user@host command” to see stats, uptime, etc., but you are not logging into the box.
I’m sure everyone appreciates your effort putting this post together. Thank you!
Edit: Btw, my small test/dev systems with just one or two GPUs don’t seem to have any problem with a “permanent” ssh login, only the 5-6 card rigs.
Yes I have a small test system with windows and Ubuntu for BIOS mod and GPU testing. It too seems to run forever with an active ssh connection and just a couple GPU’s as well. I tried removing cards from rigs and this issue persisted down to 5 cards, possibly 4 but I did not let setup run as long. Thanks.
Honestly, I cannot imagine what the fatal interaction could be; it makes no sense. The OP is seeing it on Ubuntu 16.04 Desktop, while I’m running 14.04.3 server, go figure. As I said, using ssh to issue remote commands, rather than logging in, doesn’t seem to cause the issue.
I just checked on one of my rigs as flypool was showing a drop in hash rate. I ignored my own advice and just connected for a few seconds to confirm all GPU’s were still up (they were). Sure enough a short time later I lost a GPU on that rig, so its very sensitive to ssh connections. rebooted and then disconnected even before the miner came up and its running fine now.
I like dlehenky idea to send commands via ssh without actually logging in. I think I will give that a try and also use it to monitor the rigs. I can get to my rigs via a VPN from anywhere.
I’m not 100% sure its actually the ssh as I have seen GPU’s also drop when I use a local keyboard and monitor. I think Ill take a closer look into that as well.
I have one rig that was stable but now refuses to settle down. Its always the same GPU that drops and its a different model than my others. If I swap its position around it has become stable once or twice but mostly not. If I remove it and put it in my test rig it runs for days and days with a perpetual ssh connection, go figure.
All are on risers. the GPU with problems is now in a X16 slot in my test rig and running fine, and the rig with only 6 cards now is running like a champ. I have other nearly identical rigs with 7 GPU’s running for several weeks as long as I don’t ssh to them. I have swapped risers moved cards and this one rig always gives me problems. Its serial number is one off from other rigs that just run and run (same MBO, same PSU, same SSD, same OS config, just a few XFX cards instead of all Sapphire Nitro, but the XFX cards perform better and run forever on my test rig. I may order a complete new set of risers to try. I have heard others having problems with risers, however, does not explain my stable rigs… as long as I don’t ssh to them. They are so stable I can forget about them, they just run. I only have one rig that is a pain in the ass.
But does the MB have a x16 slot? I have Asrock H97, mostly, with 1 GPU in the x16 slot and 5 on powered riser. I also have a couple Asrock H81 rigs with all 6 on risers. I just thought if the MB has a x16 slot, try the card in there and the rest on risers. Personally, 6 card rigs are sensitive enough for me; I can’t imagine running 7, especially when the weather warms up.
I do not have much experienced with Ubuntu, is there a way to break this step by step for the layman to accomplish this goal? Not to mention it would be nice to understand the breakdown of Ubuntu functions. Ugh.
diehenky, you are a genius! I moved my problem GPU to the last PCIe 16x slot on my MBO (MSI Z97 Gaming 5) so it would not block any other slots and I could still have 6 other GPU’s on risers. Wouldn’t you know it, the miner comes up with all 7 GPU’s and no sign of a stability issue so far. This last X16 slot is GPU2 in my systems and is almost always the GPU that drops.
My heat dissipation is off a bit now, and I have one GPU at 73C, but I think I just need to tweek the fan speeds to even things out, and then start a long term test.
I am getting ~305 - 310 S/s on the XFX RX480’s and ~295 - 300 S/s on the Sapphire Nitro RX480’s.
On my problem rig I moved my dropping GPU to a 16x PCIe slot on the MOBO as suggested by dlehenky. This seems to have resolved the issues on that rig with stability (so was a separate issue from ssh).
I then decided to do the same thing on one of my stable rigs (one that works as long as I don’t ssh into it) to see what would happen. Unfortunately the same issue is present if I keep an ssh connection to the rig, a GPU will drop. However, it is interesting that it is now a different GPU that drops, and no longer the one plugged directly into the 16x slot.
I don’t run my miner as root and I use Byobu instead of screen. I cant think of anything else that might be different on your setup.
So at this point I have my 7 GPU rigs up and running smoothly, we will see how long they can keep that up.
Did I mention I run one instance of Optiminer per card? This is something I’ve been doing for a long time, with different miners, including ETH. I have proven to myself time and again that a miner per GPU is always more stable then one miner per rig, if you are running 4 or more GPUs per rig. The threading in the miner, with a higher number of GPUs, seems to hang the whole rig if 1 GPU hangs. With one miner per GPU, only the hung GPU stops mining, while the others continue normally. In the event of a hung GPU, often you can recover with just a miner restart, rather than rebooting or hard-resetting the whole rig. You might want to try that, just more grins :))
Thats interesting. I tried that as well but found it did not make a difference on my rigs. However, my rigs also drop a GPU different than yours. If I lose a GPU, the rest mine happily along, and when I was out of town I would just leave them run that way as they would rarely drop a second GPU (Claymore also did this if I disabled the watchdog). However, when my rig drops a GPU, I cant restart the miner as it will always hang. I have to reboot and in some cases it has to be a hard reboot as Ive had soft reboots hang on some rigs, especially if I try to restart the miner. Big issue when you are out of town and trying to talk the wife or kids through a hard reset.
I am thinking we are pushing the hardware hard enough that slight difference’s in hardware is precipitating very different problems. I think the MOBO and risers are behind most problems. I can say, I don’t much like the MSI boards and intend to try ASUS for my next 7 GPU rig build.
I have two 7 GPU rigs that are almost identical (MOBO are 1 serial number different) and they behave completely differently.