Back in September 2017 I started to get into Mining. I have Nheqminer source code on the PC and got a 1060ti GFX card to do some hashing too.
I am a hardware designer and program software and firmware. FPGA design work for almost 2 decades now. So I thought I’d set my self a challenge. Can I make an FPGA out perform an i7 processor? or maybe even a GFX card?.
Well through many ups and downs, I am getting close. The main bottle neck being access to RAM. The FPGA dosn’'t have enough RAM to do all the processing internally. Really you need around 180Mb, Now FPGA’s with that amount of RAM are available, but at a huge cost like £1000’s. My aim was to run an FPGA for around £30 with 256Mb of DDR ram.
I have been looking at CPU Tromps code, and looking at the Digit1 to Digit8 code. Its reading all the hashes generated every time. So Reading 64Mb, writing 64Mb, in the early stages, and Reading/Writing less towards the later stages.
So we need to move around 500Mb of RAM to get around 2 solutions. The DDR3 RAM I’m using is 1066mb/s per output(16 bit chip) which is 2Gbytes per second max, this is not including refresh cycles. So I’m not going to get more than 8 sols/s using this RAM, although could have more than one RAM widening the data width, but then need more FPGA pins, therefore more price. The Core I7 is doing around 14 sol/s.
What I am wondering is there another way to so the Digit1 to Digit 8 sorting. I thought it was just looking for duplicates, but tracking back the solution indexes, the source 192bits are not identical.
I still have much code to convert to FPGA code, but getting figures from the maximum the DDR rates and how much data I need to move around gives me an idea of what hashing speed it to be expected.
I have the hashing code in the VHDL and its generating that as quick as DDR can store it.
The sorting code is still in C so I can work out the best route to move forward. I have around 5Mb of internal FPGA Block RAM, So will pipeline reading and writing from DDR.
What I need to do is reduce the data transferred to and from the DDR. The data is stored into 4096 buckets, I can take a whole bucket and sort it, it would be nice to process everything one bucket at a time, instead of moving to the next ‘Digit’ and going through all the buckets again.
The Binary Tree generation is working well after the Digit1-8. Then the SHA256 is complete in 30uS and compared with the target, but that is nice and whizzy, and correct to how the PC code is doing it.
I have written many documents over the year with plans on what to do, so tackling this from many areas. Using a Xilinx Zynq Z020, which gives me a LWIP tcp/ip connecton to the internet, and then the fetched work with be shared out to other FPGA’s, perhaps a Spartan-7 or Artx-7 each with their own dedicated DDR3 ram.
One issue I get at the moment, it will connect to Slushpool, sign in, grab work, process that work, then it will send back a result, but then that result is neither refused or accepted. Not sure what is going on there, I see in Wireshark that the packet gets split as its over 1500 bytes. The PC sends results in 2800byte packets, so thinking it was packet size, but then ran the PC code again this morning, and that was fragmenting the TCP/IP packets and was being accepted. So looks like Slushpool accepts fragmented packed from the PC, but not from my miner.
So still a long way to go, any pointers from anyone will be very useful.
I know I’m not going to compete with ASIC’s. The FPGA code could be converted to an ASIC but has a huge up front cost. Its mainly a private project, that I may make a couple of machines with 64 FPGA’s or 128, all communicating with the FPGA that’s connected to the internet.