MSI z170A-M7 with 4 1080ti's.... Can anyone help to solve problem?

I’ve been having a strange problem while mining Zcash using 4 - 1080ti GPU.
After a random amount of time, EWBF 0.3.4b miner gets stuck. It only happens when mining with 4 cards. It will mine just fine using only 2 GPU’s if they are plugged into x16 slots on the mb.
EWBF doesn’t get hung and quit or freeze the system. It keeps trying to restart and issues the following errors:

  • CUDA DEVICE 2 Thread exited with code 4

  • CUDA DEVICE 0 Thread exited with code 4

  • CUDA DEVICE 1 Thread exited with code 4

Followed by:
The power report which shows all cards using less then 70 watts each. and about 0 sol/s

Which is then followed with this list for each GPU installed:

Error: Looks like GPU2 are stopped. Restart attempt
Info: GPU2 are restarted.
CUDA DEVICE: 2 User selected resolver 0
CUDA DEVICE 2 Thread exited with code 46

From here, it doesn’t mine until the machine is entirely powered off then restarted. (Or the miner is closed and restarted after the cards attached to the risers are disabled in control panel.) So I guess you can say that the problem only effects the GPU’s on the risers.

My hardware is:
Mb is MSI z170A-M7 with 8gb RAM and Celeron G3930 KabyLake processor.
Power supply for mb and two on board GPU’s is Cobra Power 700watt Gold 80plus
Power Supply for the two PCIe-1x to 16x riser connected GPU’s is an HP 1200 Watt
Found here at NewEgg

The risers are PCIE164P-NO3 Ver 006.

Steps I’ve taken to resolve the issue:

  1. Followed instructions for Z170 motherboard listed here in this thread…

  2. Replaced all power supplies with brand new ones listed above.

  3. Completely reload Windows in UEFI mode on GPT.

  4. Tried all of the drivers on NVidia site that I could download. Currently using 384.94

  5. Tried changing the slot location that the boards or connected to. As well as doing a clean install for each card one at a time.

  6. Replacing the riser cards. Unless I have 2 bad risers, the problem is not isolated to one riser or one card.

I’m considering purchasing another MB and risers but I sure would like to narrow this problem down before I throw a bunch of money at TRYING to find whatever is causing the issue accidentally…

If anyone out there has any ideas, I am all ears. My next step is to start looking in event viewer to find reoccurring errors that might help me.

Thanks for reading!

Another troubleshooting step you could try is booting from a device with Ubuntu installed and seeing if it works from there. I have several rigs running EWBF on Ubuntu and they all run flawlessly. One of them has 6x 1080ti. If it works for you in Ubuntu but not Windows then you have narrowed it down to a software issue and can eliminate the possibility of it being related to a hardware problem. You can obtain the needed drivers in Ubuntu via the following command

sudo apt install nvidia-367

EWBF is crashing because of an overclock. No need to change operating systems for that. You just need to use MSI Afterburner and tweak the settings on your cards. I would recommend doing so individually rather than all with the same settings.

You don’t mention which GPU’s you have (manufacturer and model) but you may also have to slightly dial back a factory overclock if you haven’t overclocked it yourself.

I didn’t see him say anything about overclocking any of the cards. In either case, I installed Lubuntu minimal on a new mining rig today and happened to time how long it took. It was only 12 minutes from the time I booted the Ubuntu minimal install iso till the time I booted into the Lubuntu minimal desktop (LXDE). It’s a fairly quick and painless way to narrow down the problem to determine whether it is software or hardware related.

I didn’t say he said anything about overclocking. I said that’s the reason for his error. While he may not be overclocking the cards himself, they could be factory overclocked which can cause the cards to be unstable under the load of the miner. This isn’t an issue with his OS.

The error he’s receiving is because one or more of his GPUs is unstable due to overclocking.

I didn’t say it was an issue with the OS. I simply gave him a fast and easy way to narrow down whether it is software or hardware related since he is talking about replacing hardware.

In nearly all cases of the error codes he’s receiving, the resolution has been to dial back overclocking. He needs to stabilize his GPU’s. That can be done in the OS he’s currently in. I understand you were suggesting a troubleshooting step for him to try. :slight_smile:

You provided a suggestion and I provided the cause of his error. Unless I’m missing something here?

Wow! Great responses!
Thank you for offering your suggestions.

The GPU’s are 1 Gigabyte 1080Ti 11gb… and 3 Aorus 1080Ti 11gb Found here

An update on the status…

I DID use the Aorus software to attempt to adjust the clock settings but only seemed to make things worse by lowering the memory clock on ANY card.
It was late when I got home, so i’ll have to work on it tomorrow morning. I haven’t given up on that being the answer…

I will also do the ubuntu load. 12 minutes is less time than I’ve spent just booting this computer up and shutting down. So it cant’ hurt and will tell me a lot about whether I have hardware or software issue. Unless the hardware is causing the software to blow up?

Again… thanks for the responses… I will posts an update after I have more information to share.

Use MSI’s Afterburner. Don’t bother with the Aorus software. You’re also going to want to do more than just adjust the core or memory clock. Adjusting the power will also be necessary to stabilize the cards.

Had you overclocked the cards? If yes, start from scratch. All cards default clock speeds and run the miner. Make sure your fans are running at least 75% and then start raising your core clock up by +5 increments. Once the miner starts crashing, dial it back to the previous setting. Then do the same with the memory clock. What power setting are you currently using? 70%? 80%? 100%?

Well… Either Nekkid truth is right, or I have a bad card.

Until I started this thread, I had not overclocked anything. I never wanted to risk blowing up the card for getting them too hot.

Results from testing:
After changing the mem clock to the lowest setting on each card and setting the power to 100% on each, the miner gave even more errors and stopped mining a lot faster. (A few seconds or never even started properly.)

I replaced the riser card on the suspected problem area with yet a 3rd ver 006. So I think I can rule out that part now.

I DID run the test in Ubuntu, which worked fine for about 3 hours… but then it too went offline in the exact way.

The HP 1200 watt supply only powers the 6+2 connectors for the GPU. It doesn’t also power the riser boards. It Could THAT be the problem?
I experienced the shutdown problem with two previous power supplies. As a test, I powered each of the off board GPUs with their own individual 450 watt bronze supply. Experienced the same result.

How does EWBF decide which card is card 0, 1, 2, 3… etc? That would help me towards knowing which of the cards is failing… They aren’t listed by PCI bus, or any other sort of ID that I can find.

Thanks for staying with this thread! It’s driving me nuts! (I am considering purchasing a different Motherboard to test suspect cards.)

Some things to consider:

  • You shouldn’t change settings so extremely. If you’re running stock, you just need to lower core or memory settings in small increments. Underclocking can cause just as many issues. That’s why I refer to a sweet spot.
  • A good practice to follow as well is to put both GPU and the riser it’s connected to on ONE PSU. Don’t separate them.
  • I’ve noticed that EWBF generally places the cards in the order in which their plugged in on the board (but your mileage may vary). That being said, take a look at this post that might help in identifying individual cards.

There’s a few key things missing from your posts so far though that might help paint a better picture. What are you temperatures like? What are the stock settings of your cards? All of my cards have entirely different settings to reach the same Sol/s and some cards can take much higher overclocks while others I’ve had to underclock slightly. Some of those cards are same model and manufacturer.

ok… Here’s an update.

Out of frustrastion, I have gone back to square one.

After unsuccessfully getting anywhere with the clock settings, I simply moved the 2 riser based cards onto the motherboard (as a test). As I expected, they have been mining fine for a few days with no drops. So my cards are good and working. (But now I have 2 cards not in operation.)
Though I have many motherboards, they only have PCI-e version 2.0 and 1080ti’s don’t seem to work in those slots. So they are down for a few days.
I have ordered new version 7 risers with the 6 pin power connectors. I will be able to plug them directly into the HP power supply that is providing power to the GPU’s. Maybe I was having a problem with the version 6c riser or the power. Either way, both will not be a variable when my shipment arrives.

Temps. All cards except 1 (GPU0) are always about 74 degrees. GPU0 is typically 83. But never more than that.

I will post all of the default settings tomorrow when the new risers have arrived.

I still appreciate your assistance.