GPU stability factors and power curves

I recently found a new factor for GPU stability and would like to share / discuss.

For background, my rigs run themselves via system software I have written in Linux. Besides hardware maintenance, building new rigs, and writing new code, I have not touched a rig for a problem in nearly a year… until just recently.

When I setup a new rig I run a routine that over clocks each GPU until it finds its stability limits, then it will sweep through the memory OC and find the peak performance settings (Yes memory OC does make a difference). Once these settings are identified they are stable and I can run the rig right on the upper edge of stability vs performance. I can choose (statistically) how often I want my rigs rebooting due to a hung GPU since I know their limits. This is handy as I can push my rigs hard with more reboots if its profitable, (reboot takes less than 15 sec).

I also keep all GPU’s at a set temperature ±1C when they are running. This improves performance and reduces wear and tear on the GPU’s (heat is bad). Another benefit is that as power levels slowly drop over time and it triggers preventative maintenance action for me. Things like blow out the dust and replace fans, the rig below is due for maintenance.

tmpset

However, with the recent cold cyclone bomb in the eastern US I have found an additional GPU stability issue that I did not expect.

I heat my home with 8-10kW of rigs in the winter (its free heat). However, this winter they could not keep up and I saw the power levels start rising as it got cold (was expected). What was not expected was several GPU’s became unstable when they were running cold with higher set power levels.

I was not expecting this and have had to turn off auto temp software and go back to manual. I ran the power curves on a few GPU’s to determine the root cause of the issue. Every GPU is different, the one below has the curve for stock settings, the curve using a stable GPU over clock, and the curve for both a GPU OC and a memory OC. The GPU OC curve stops at 180watts as this GPU cant run with this setting as it runs into the dreaded “Slowly falling Sol & Temp issue”. The GPU and Mem OC runs stable at 180watts but dies at 190watts.
power curves

Ill not go into a detailed interpretation of this graph but I would like to point out that there is no denying the impact that the Memory OC has on the final hash rate. A lot of people argue that the Mem OC has no impact on the equihash algorithm, but that is simply not the case and I have data to show this. However, in the defense of that argument you would never see this without an automated system, manually its very difficult to discern.

I thought I would share.

3 Likes

I had my rigs in the garage, which I keep heated to 0/+1. Here in Canada it dips quite cold. -33 right now as a matter of fact. So my heater cant keep up, nor can the rigs.

I had a rig die this past Christmas, and I could not figure out why…So i took it inside and took it apart.
I found burn marks around what looked like a water droplet. The only thing I could think of is that the cold vs the heat made condensation and whamo…
Needless to say, were in the process of moving all the gear.

I could totally see the cold affecting the GPU itself. During the time it was cold here at Christmas time, my GPUs were running at 108% and 38 - 41C, which in my opinion is phenomenal hehe. I should have recorded hash rates at the time…im in the process of dealing with logging software and monitoring tools. I would be interested in what your using to remote monitor and run them all.

1 Like

So I pulled the temp data from the power sweeps. I am not running as cold as you, my target temp is between 50C and 62C. I heat the house with these rigs so if its too warm I lower the target temp, if its to cold I raise it (sometimes we still have to open windows). In the summer I move them back out… sometimes.

tempvspower

Temp strongly correlates to power draw, and as you can see the over clocking does not change the Temps. While if you measure at the wall you can see some change in Wattage draw when you over clock, it does not significantly change the temps. For rigs at room temp its about 1C / 10watts.

The stability issue is not necessarily from the temp but more from the throttling the GPU does to hit the desired wattage. There is a complex relationship between the Power level, GPU OC, Mem OC, miner software, and GPU stability. With an automated system it can still take a week or more to find the optimum settings on a 7GPU rig. The Memory OC is the most complex as there is a desired ratio between the GPU clock and the Memory clock (just like with CPU and RAM clock ratios). The step size on your Memory clock sweep has to be really small, the the best setting is different for every GPU and every GPU over clock setting.

What is also interesting is that the stock power curve shows you the power stability limit, when over clocked (at least for GTX1080 ti’s). Every GPU is different and I have ones (cold running cards) that go flat even before 250watts, they also go unstable when OC with a power setting past the limit indicated by the stock power curve.

I may share some of this data in a future post.

I agree, I know it does as well. I also know there are diminishing returns if you overclock to high before GPU failure

I would completely agree with both of you on this point for sure. I have found that I can leave CPU clock alone and pump mem OC to 600+ and thats where the hashes come from. CPU overclock does do a little bit, just not like memory does.

I wish I would have grabbed my hashes at 40C, I think there is a point of no return…being where you get less hashes as the temp goes down because the chip is not at optimal performance. That being the relationship in which you describe.

Too cold, too hot, or just right…sounds like a story about bears :slight_smile: