I been absent from the forums as of late. Been very busy with work, not with rigs, they have been flawless. I have not even thought about my rigs in a couple of months, they take care of themselves (they had a power outage a few weeks ago restarted fine on their own). I have a couple of close mates that I wrote a Linux / Nvida instruction manual for, and they both have been running smoothly for several months as well. Have been thinking about posting but have a commercial deal pending that prevents me from doing that.
Anyways, I am converting one rig to Deep Learning AI Rig via CUDA and TensorFlow. Been a royal pain in the ass to get the Nvidia GPUās free from X11 for pure compute mode. It is possible, and I even got it working on one mining rig. To the point, the login loop that happens with Nvidia drivers and the xorg system is caused by the open GL drivers and the integrated Intel GPU on the mobo. You can install the Nvidia drivers without open GL support and the problem goes away (you also need to blacklist the Noveau drivers when you do this, use: āsudo NVIDIA-Linux-x86_64-38x.xx.run --no-opengl-filesā). Mind you that once open GL is installed, you cant un-install, and have to start over with a fresh install of Ubuntu. If you force the EDID.bin X11 option you can get the Integrated graphics card to run video and keep the Nvidia cards in a compute only mode. Gets you an additional 50-60 Sol on GPU0 and eliminates the login loop issues that happen occasionally. Its working great for TensorFlow but I still see some bugs with the minerā¦ Iām not spending much time mining on this rig as its now an AI Neural net workstation, but once I have the time, I want to convert to a mining image I can use on the rigs and I will update.
I have never been more frustrated with linux than the times I am trying to setup the over clocking and controlling fan speed. I have never reached a point where I can control the fan speed of more than 1 card using cool bits. It have never worked for meā¦ and I have spent around 80 hrs just to get it up. Now I am running Windows ā¦ I have encountered all the problems you can think ofā¦ login loops, xorg reset, xorg crash and the best part isā¦ following the same process 5 different times, I get 5 different resultsā¦ I have started to believe Albert Einstein was wrongā¦ you can do the same things over n over and expect different results when working with linux
I have the opposite experience. Windows has been nothing but trouble so all my rigs run Linux. Stock Ubuntu to be specific where I never had Xorg problems or controlling individual card clocks/fans.
I agree, Windows is nice for testing but not for a production rig.
You need auto start and auto restart unless you want to baby sit rigs 24X7. They just need to tell you when they have a critical hardware failure like a fanā¦ the rest they should do on their own.
Case in point: The Temp fluctuation in my facility is different in the summer than winter. In summer the temps get higher at night due to the AC not running as much so the rigs get hotter (its the air flow and thermostat location). They auto down regulate each rigs power at night so that GPU temps donāt go over 70C. During the day they up-regulate (when AC is running more). When in auto over-clock mode the software will find the maximum over-clock possible for each GPU that keeps a GPU lockup/freeze rate below once every 24 hours. Once the cycle is complete I can get up to 800Sol/s out of a GTX 1080 ti.
Software needs to detect every rig issue and restart smoothly 100% of the time. Try to do that in Windows.
Ohh I absolutely agree and thatās the reason I tried a lot to put linux on my rigsā¦ my main motivation for linux was to reboot using the magic key ā¦ But something or the other doesnāt work. The best point we reached was when we were able to control 1 fan but we were never able to control fan for 2 cards, no matter what we triedā¦ We used 1080ti and Asus z270f ā¦ we have followed numerous guides online to make it work but nothing worked
Individual cards. My system has never worked beyond 1 card. As soon as I start with controlling second fan it stops working ā¦ nothing works after the commandā¦ or gives an attribute errorā¦
Miner start script commands to set Nvidia fan speeds and overclock in Linux is below:
nvidia-settings -a [gpu:0]/GPUFanControlState=1
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=100
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=200
nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=600
nvidia-settings -a [gpu:1]/GPUFanControlState=1
nvidia-settings -a [fan:1]/GPUTargetFanSpeed=100
nvidia-settings -a [gpu:1]/GPUGraphicsClockOffset[3]=200
nvidia-settings -a [gpu:1]/GPUMemoryTransferRateOffset[3]=600
However, nothing will work if you do not have your xorg.conf setup properly. All your cards should show up in the Nvidia control panel (but should also work with all headless). If you cant set the fans and over clock in the control panel then commands above will not work either. If you can only mine on the card with a monitor plugged in then you donāt have your xorg.conf setup right. You need to spoof a monitor via the edid.bin option in the Screen section:
Option āCustomEDIDā āDFP-0:/etc/X11/edid.binā
Linux is all about getting X11 configured correctly for Nvidia, the rest is easy.
You seem to be very well informed in the world of programming and running in Linux. I am understanding just about everything I see you post. However I still need help on the how a lot of times. Seeing how you are currently prevented from sharing your code is there any place you be able to point me at to learn more. I seem to have an issue finding any teaching instructions that fall into my range. Its either to simple with too much information or too complex and things are missing. Thanks
set cool-bits to 28 that should enable core/mem overclocking, power changes and fan speed changes for blower-card styles. GPUS with more than one fan arenāt controllable on Linux. EVGA ones should have internal fan-speed curves that adjust themselves.
Also, you can just add this to your ~/.bashrc file:
alias setup='nvidia-settings -a GPUFanControlState=1; nvidia-settings -a GPUTargetFanSpeed=100; nvidia-settings -a GpuPowerMizerMode=1'
# powlvlall [optional_power_level]
powlvlall() {
# set power level
if [ -n "$1" ]; then
local POWER="$1"
else
local POWER="250"
fi
sudo nvidia-smi -pl $POWER
}
# oclockcount [zero_indexed_gpu_count] [optional_graphics] [optional_memory]
oclockcount() {
if [ -n "$1" ]; then
# set graphics
if [ -n "$2" ]; then
local GRAPHICSVAL="$2"
else
local GRAPHICSVAL="0"
fi
# set memory
if [ -n "$3" ]; then
local MEMORYVAL="$3"
else
local MEMORYVAL="0"
fi
# loop
for i in $(seq 0 $1)
do
nvidia-settings -a [gpu:$i]/GPUGraphicsClockOffset[3]=$GRAPHICSVAL
nvidia-settings -a [gpu:$i]/GPUMemoryTransferRateOffset[3]=$MEMORYVAL
done
fi
}
Then, re-source bashrc:
$ source ~/.bashrc
The following would overclock a 12 pascal GPU system to core +60 mem +120:
$ oclockcount 11 60 120
As ZC93 said, you need your xorg config set properly.
After the reboot you should be able to use the above. If youāre sshāing in you may need to do a display export:
$ export DISPLAY=:0
You can turn that into a bash alias as well if you want to call it from the setup alias (easier).
Iāve found that a python loop that checks statuses/powlevel/temps/oc by using nvidia-smi and calling these aliases with os.system() calls is pretty effective. Maybe Iāll post my python script (for 1080tis) if ZC93 shares more about the compute mode setup.
alias setup='nvidia-settings -a GpuPowerMizerMode=1 -a GPUFanControlState=1 -a GPUTargetFanSpeed=100'
$ nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration
during my setting up a linux miner the last two days, I was really struggling getting the machine ārealā headless and running a miner + overclocking / underpowering the rig. In the end (after several reinstalls) I just accepted, that the rig will need X11 and those GPUs have a little memory reserved for Xorg.
Ā±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1021 G /usr/lib/xorg/Xorg 0MiB |
| 0 2072 C ./dstm/zm 479MiB |
| 0 2101 G /usr/lib/xorg/Xorg 15MiB |
| 1 1021 G /usr/lib/xorg/Xorg 0MiB |
| 1 2072 C ./dstm/zm 479MiB |
| 1 2101 G /usr/lib/xorg/Xorg 7MiB |
| 2 1021 G /usr/lib/xorg/Xorg 0MiB |
| 2 2072 C ./dstm/zm 479MiB |
| 2 2101 G /usr/lib/xorg/Xorg 7MiB |
| 3 1021 G /usr/lib/xorg/Xorg 0MiB |
| 3 2072 C ./dstm/zm 479MiB |
| 3 2101 G /usr/lib/xorg/Xorg 7MiB |
Ā±----------------------------------------------------------------------------+
Just for efficiency (and cleaning up nvidia-smi) purpose I would love to know how to use the iGPU and prevent the nvidia gpus from loading Xorg. But my linux knowledge is at the limit hereā¦ @ZC93: Would you share a little more details on e.g. how to blacklist nouveau drivers and how to force the EDID.bin X11 option? (what is EDID.bin?!).