So is anyone else seeing the Ubuntu AMD pro 17.04 driver suddenly stop working on multiple rigs simultaneously?
My AMD miners have pretty much been running non stop for two years now. Last time I did any significant updates was the 17.04 drivers that improved AMD mining performance late last year.
So Thursday night my AMD test rig texts me that its rebooting and then does not come back online (Its a test rig so it reboots many times a day). I send a manual restart from my phone and then get a text that the the miner cannot start and the rig cannot repair itself. I manually ssh in to check what is going on and its getting “no AMD openCL platform found” when the miner tries to start. Long story short, I could not get it working that night so I just shut it down.
Then Friday morning one of the production AMD rigs texts me it cant restart (production rigs only text me when they have a problem they cant fix). So I log in and it has the exact same problem. I spent several hours reinstalling AMD drivers updating the system but the openCL platform refused to start. It seemed very odd that two independent systems would suddenly get the same issue. So I logged into a third still working AMD rig and tested the openCL platform and everything was fine. I then issued a restart command… and that rig then had the exact same problem as the other two. To make a long story short EVERY AMD rig I have that has rebooted since Thursday night has the exact same problem.
So can someone explain how multiple rigs get the exact same driver problem, all at the exact same time?
AMD killing the driver with a time based trojan is the only possible explanation. These rigs are isolated and only allowed to make connections to the IP’s and ports I specify in the firewall. If they attempt any other connection I get notified.
FIX:
Turns out the fix was a bit of a pain. Just updating Ubuntu to 16.04.05 and installing the AMD 18.40 driver does not work. Apparently new AMD drivers need a Linux kernel > 4.4.0. So you also have to install the HWE stack.
Update Ubuntu to 16.04.05 via the normal apt methods
Get the HWE stack “sudo apt install --install-recommends linux-generic-hwe-16.04”, reboot
You need the xorg as well if you are running desktop Ubuntu “xserver-xorg-hwe-16.04”
Remove the old AMD driver “amdgpu-pro-uninstall” (watch this if you are running local)
Get the latest AMD driver and extract with “tar -Jxvf amdgpu-pro-18.40-XXXXXX.tar.xz”
switch to AMD directory and install with “./amdpgu-pro-install -y”
Use --opencl=legacy or pal option based on your hardware
Use the --headless option if you are running headless
Restart your miners
It was just awesome that AMD forced me to do all that, for the exact same performance, when the rigs had been working perfectly, nonstop, for the last year.
Way to go AMD! That is why I don’t buy AMD any longer!