This is the way I prefer to run them also. I’ve done it both ways, and I drop GPUs either way.
These rigs that hang GPUs under optiminer @ 800w TDP can run ETH+DCR all day long @ 1030w. So I find it real hard to believe its hardware related.
Different mining algorithms, even different implementations of the same algo, can stress GPUs in a multitude of ways. Different GPU models may react to that specific stress differently, as well. I’m not sure most folks realize just how complex a beast the modern GPU is, and mining is pushing the limits 24x7. I’ve worked on ETH and ZEC algos, and they are as different as night and day.
Ya, I dont know… I’ve been dropping GPUs w/ optiminer as long as I can remember… like optiminer 0.3x @ 80 sol (or whatever it was) and optiminer 1.5.0 @ 260 sol.
Its a huge pain in the neck, but I also drop GPU’s with the same frequency using Claymore. However, on both they would at least keep mining on the other GPU’s after dropping one (very different from what I hear from others).
I would disable the watchdog in Claymore since restarting the miner after a GPU drop would hang the whole rig. I tried running multiple instances, restarting the miner periodically (Claymore) nothing seemed to help. If I left them run long enough with a dropped GPU they could sometimes drop a second GPU, but not often (so 6 GPU’s is more stable than 7). However, now both my 7 GPU rigs have been running since yesterday on all GPU’s with no signs of instability (I usually start to see a drop in S/s on a GPU when it starts to drop, and I should have dropped a GPU by now. If this keeps up I may consider building more 7 GPU rigs.
I have a cronjob that checks for miner processes that drop out (be aware that the H= and N= lines are enclosed in the back-tic character “`”, which I can’t ever seem to paste into comments on this site):
#!/bin/bash
For dropped miner process
H=ps -C optiminer-zcash|grep -c opti
N=cat ~/.ngpu
if [ $H != $N ]
then
touch ~/.no-reboot
echo “_____________________________________”
echo “****** Miner Process(es) Quit ******”
touch ~/.restart
fi
exit 0
Note: ~/.ngpu is a file with the number of gpus running on this rig, i.e. just that file with a single number, no .
Touching ~/.restart (or ~/.reboot) causes a root script, that I start from /etc/rc.local, to restart (not reboot) the miner process(es); it will also reboot with a hard reset if ~/.reboot exists, if that’s what you want to do. I can provide my /etc/rc.local script, if you are interested.
There is a new amdgpu-pro 16.60 driver if ye dare try…
Ill try 16.60 but only on my test rig. However, since my 7 GPU rigs are running like champs I don’t have any spare GPU’s for my test rig. Looks like I need to order a couple and start testing.
I head out of town this week so will have to wait till I get back, I’m looking forward to things just working for once. 3D iterative image reconstruction all week so I will not have time to babysit my rigs. Its all Ubuntu based and bleeding edge, so I’m not expecting things to go smoothly, it never does. At least I expect to come back with some 3U and 4U ATX systems I can re-purpose for mining.
I have not seen any stability issues on my 7GPU rigs unless I ssh into the system, for well over 24 hours. I upgraded to 1.5.0 once I figured I cant mine with an intensity over -i 7 in 1.5.0. I installed 1.5.0, started the miner, and only had to reboot 1 rig to get them stable again. They are running so well I may consider trying for 8 GPU’s (I have plenty of extra PSU wattage). Either of you ever try a PCI splitter?
I let you know how long they stay stable.
Interesting. Yes I would like to test but I don’t have any extra GPU’s now for my test rig. They are all running on my 7 GPU rings. Ill order some so I can test maybe next weekend. Thanks!
Sorry for not getting back to you sooner. dlehenky has an excellent tutorial for Ubuntu server setup that he provided a link.
I can also dumb down my process:
- install Ubuntu Desktop 16.04.1 with no GPU’s installed
a) Default config but select “whole drive guided” NOT LVM for partitioning.
b) I would NOT encrypt your home directory as it not critical on a miner (PC with wallet yes, but not a miner). - from local console, install open ssh, byobu, vim (unless you like nano), and lm-sensors.
$ sudo apt install openssh-server; vim; byobu; lm-sensors - perform and apt update upgrade and reboot.
sudo apt update sudo apt updrade
$ sudo shutdown -r now #reboot - Set a static IP for your miner on your router / firewall via DHCP
- transfer am-pro 16.50 drivers, optiminer, scripts ect ect and install.
I like to transfer files vis ssh from another PC so:
scp amdgpu-pro-16.50-362463.tar.xz user@host IP:/home/user/ scp optiminer-zcash-1.5.0.tar.gz user@host IP:/home/user/
scp startminer.sh user@host IP:/home/user/ #start mining script if you have one ssh user@host IP #connect to miner
byobu-enable #enable for next login, you could also use screen but I have not tested tar -Jxvf amdgpu-pro-16.50-362463.tar.xz #unpack AMD drivers
cd amdgpu-pro-16.50-362463 #change to driver directory ./amdgpu-pro-install -y #Install drivers
sudo usermod -a -G video user #add user to video group sudo shutdown -h now #shutdown miner - install all GPU’s and reboot (check that lm-sensors shows all your GPU’s). If some are missing a reboot will usually fix this. Some like to do one at a time or at least start with one, its your preference and what work best for you.
ssh user@host IP sensors #your GPU’s should show up here else reboot - via ssh and byobu start your miner and fine tune fans, verify S/s, temps, ect ect (system check)
ssh user@host IP bash startminer.sh #your startup script #press F2 to open another terminal in byobu
sensors #monitor your GPU temps then press F2 to open another terminal window in byobu vim startminer.sh #edit your start script as needed and save (F3 will move through the three open terminal in byobu) Tip: you can modify the start script and also type in the commands to fine tune the fan speed vs temp. I find I dont need an extra program to control fan speed and temps as a static setting that is tuned for your rig will keep your temps fairly stable (I run upper 50’s to lower 60’s on all my GPU’s this way). I could go lower but starts to get noisy. - OPTIONAL: Set the default GRUB behavior to boot into a console AKA, do not load the desktop / GUI.
- reboot via ssh and byobu and start your miner.
Ctrl-C #shutdown your miner sudo shutdown - r now #reboot miner
ssh user@host IP bash startminer.sh #your startup script, Ubuntu 16.04 departed from bash as the default so I like specify to be safe - immediately disconnect via F6 and leave session running. I like to open a new empty term window F2 then disconnect with F6.
I use a startup script that modifies the permissions on my fan control so I can change them via software without having to run as root. An example of my start script is below:
#!/bin/bash
sudo chmod 777 /sys/class/drm/card0/device/hwmon/hwmon1/pwm1 #Change permissions, your directories and numbering my be different)
sudo chmod 777 /sys/class/drm/card1/device/hwmon/hwmon2/pwm1
sudo chmod 777 /sys/class/drm/card2/device/hwmon/hwmon3/pwm1
sudo chmod 777 /sys/class/drm/card3/device/hwmon/hwmon4/pwm1
sudo chmod 777 /sys/class/drm/card4/device/hwmon/hwmon5/pwm1
sudo chmod 777 /sys/class/drm/card5/device/hwmon/hwmon6/pwm1
sudo chmod 777 /sys/class/drm/card6/device/hwmon/hwmon7/pwm1
echo 210 > /sys/class/drm/card0/device/hwmon/hwmon1/pwm1 #set fan speeds, don’t go over 250
echo 210 > /sys/class/drm/card1/device/hwmon/hwmon2/pwm1
echo 210 > /sys/class/drm/card2/device/hwmon/hwmon3/pwm1
echo 210 > /sys/class/drm/card3/device/hwmon/hwmon4/pwm1
echo 210 > /sys/class/drm/card4/device/hwmon/hwmon5/pwm1
echo 210 > /sys/class/drm/card5/device/hwmon/hwmon6/pwm1
echo 210 > /sys/class/drm/card6/device/hwmon/hwmon7/pwm1
cd optiminer-zcash/
export GPU_FORCE_64BIT_PTR=1
./optiminer-zcash -s zstratum+tls://us1-zcash.flypool.org:3443 -u your_t_address.rig_name -p x -i 7
Hope that helps.
And btw, how do you control the fans under linux? Is there a separate app or with the miner options?
Ooops, now I saw the script you provided. Sorry.
So was a long run of stability (48 hours).
I Had several rigs drop a GPU and one pop a power strip breaker all at the same time. All my systems except my rigs are on UPS backup, the rigs just have a power strip. Each rig has its own 20A rated circuit from my main 200A service (I ran the lines myself). So since only one popped a breaker on the power strip, but several dropped a GPU, I must have had a surge, brownout, or some other power issue. I checked my main UPS event log and its not showing anything, but that just means the mains did not drop. Only way several rigs, on separate lines, could have issues all at the same times is if I had a power problem that precipitated a GPU drop and tripped a breaker. Would make a lot of sense, but suck since a UPS on each rig is not feasible. Perhaps a line conditioner would suffice?
Mine are just plugged into the outlets. However, I have suspected power variations might be part of the stability issues, every now and then. I did have one rig on a surge protector/power strip, and it popped the surge breaker, too, a couple times, so I took it off. Also, with v1.4, I did have rather frequent (once a day) events where 4 - 5 rigs would all reboot at exactly the same time, while the other 15-16 rigs kept mining happily. I have not seen that once since going to v1.5. In fact, v1.5 is proving to be the most stable mining I’ve experienced since SAv5, back in early November. None of my rigs has rebooted/reset in over 3 days!
thanks for the tip on byobu instead of screen!
How do you check gpu temps on cli in ubuntu 16.04 with amdgpu-pro driver?
You have to read the appropriate /sys location:
cat /sys/class/hwmon/hwmon0/temp1_input
thanks! Now I can supplement the optiminer logs with temperatures for my monitoring scripts in zabbix!
Sorry to jump in late, but here at Zeropond we have lots of experience with the “SSH issue” and GPUs dropping off.
My first guess is that you have a bad riser, or maybe a bad GPU, which is dropping off. The reason moving to a 16x slot “helps” is due to the bus architecture on the motherboard. Generally, a 16x slot will get its own bus & controller, while the 1x and 4x slots will be on a shared bus, and they will be “downstream” from the primary I/O controller on a secondary controller. So if you have a bad GPU/riser on this shared bus, it can confound the other peripherals. Usually only one CPU/core will have I/O responsibility, and I/O to the other CPU/core goes across a special interconnect on-die. So if you are having I/O bus problems, you might see one of the CPU’s lock up at 100% in an iowait state, and this can cause SSH to stop connecting, since it needs I/O on the PCI bus. This is not the driver’s fault or the CPU’s fault: it’s probably a bus problem caused by a bad riser or GPU. Moving that bad riser/GPU to the 16x slot merely isolates it from the others, masking the problem. I would not be surprised if this card goes down again, but less frequently. We have also seen machines stop responding to pings due to a bad GPU, again because the NIC is on the same bus as the bad card.
I suggested to the OP to move the problem card directly to a x16 slot, i.e. no longer on a riser, which proved to help that GPU’s stability. I hear you on the risers; they are both a blessing and a curse. Zcash has a lot of bus traffic between the host and GPU, as well, so it’s not surprising that a marginal riser would be more likely to cause problems. Thanks for the very informative and definitive post!
Ah, I didn’t read that carefully. If the rig stays up with the GPU directly connected to the mobo, then it’s probably safe to blame the riser. They have a high failure rate and are cheap to replace.