I recently found a new factor for GPU stability and would like to share / discuss.
For background, my rigs run themselves via system software I have written in Linux. Besides hardware maintenance, building new rigs, and writing new code, I have not touched a rig for a problem in nearly a year… until just recently.
When I setup a new rig I run a routine that over clocks each GPU until it finds its stability limits, then it will sweep through the memory OC and find the peak performance settings (Yes memory OC does make a difference). Once these settings are identified they are stable and I can run the rig right on the upper edge of stability vs performance. I can choose (statistically) how often I want my rigs rebooting due to a hung GPU since I know their limits. This is handy as I can push my rigs hard with more reboots if its profitable, (reboot takes less than 15 sec).
I also keep all GPU’s at a set temperature ±1C when they are running. This improves performance and reduces wear and tear on the GPU’s (heat is bad). Another benefit is that as power levels slowly drop over time and it triggers preventative maintenance action for me. Things like blow out the dust and replace fans, the rig below is due for maintenance.
However, with the recent cold cyclone bomb in the eastern US I have found an additional GPU stability issue that I did not expect.
I heat my home with 8-10kW of rigs in the winter (its free heat). However, this winter they could not keep up and I saw the power levels start rising as it got cold (was expected). What was not expected was several GPU’s became unstable when they were running cold with higher set power levels.
I was not expecting this and have had to turn off auto temp software and go back to manual. I ran the power curves on a few GPU’s to determine the root cause of the issue. Every GPU is different, the one below has the curve for stock settings, the curve using a stable GPU over clock, and the curve for both a GPU OC and a memory OC. The GPU OC curve stops at 180watts as this GPU cant run with this setting as it runs into the dreaded “Slowly falling Sol & Temp issue”. The GPU and Mem OC runs stable at 180watts but dies at 190watts.
Ill not go into a detailed interpretation of this graph but I would like to point out that there is no denying the impact that the Memory OC has on the final hash rate. A lot of people argue that the Mem OC has no impact on the equihash algorithm, but that is simply not the case and I have data to show this. However, in the defense of that argument you would never see this without an automated system, manually its very difficult to discern.
I thought I would share.