New equihash algorithm proposal suitable for ASIC/FGPA

mkatkov · November 14, 2016, 3:10pm

Hi everyone,

I played with equihash algorithm for the last week or two. It appears that the only global information need to be stored in the global memory is the tree of collision pairs. The rest can be implemented locally for each layer k - thanks to Tromp for the idea.

I’m not on the market, so do not think will have a profit by myself implementing it. So I post it here, and would be interested to see how greedy the community is.

The algorithm is layered in stages with each stage can be implemented as a separate device. The downside is that each parameter set need its own arrangement.

Algorithm:
Layer 0: Compute hashes for 2^[1] consecutive indexes. Submit each hash for each index to the next layer one at a time (currently 200 bits per clock cycle, or 2 cyclesdepending on the pins count).

Each hash can be seen as an array of (k+1) digits of size n/(k+1), currently 10 digits 20 bits each.

Layer 1.
This and the later layers have 2^[2] buckets containing, say 4 or 5 or to be determined to be statistically efficient elements (on average there are 2 elements, but we need more for the algorithm to work). Each element consists of the following fields {
index - the index produces this hash,
remaining hash - last k digits of hash.
}. On top of it we have counter of the number of items in a bucket.

Fill in buckets:
for each input:
We put the last k digits into first free slot in the bucket with index equal to the first digit, store the index of this hash and increment the number of elements in this bucket.

Outcome:
Each bucket contains only the elements starting from the same digit → xor of these and only these elements is 0.

Transmit to upper layer:
for each bucket take all pairs (if the number of elements are greater than 1) and

store a sorted pair of corresponding indeces in global memory
submit to next layer one at a time:
a. xor value of the remainig part of the hashes in the pair
b. smallest index
c. reference to stored pair

Layer 0 done.

Layer 1:9
Fill in buckets:
For each pair put the obtained index, reference to stored pair and all digits of received hash except the first one in the first available slot in the bucket corresponding to the first digit. If the bucket if full either drop pair of stop algorithm (it depends on the statistics, that need to be determined in simulated experiments, what is more efficient).

Then transmit to upper layer as described in the Layer 0, except we need to check if the smallest indexes for each element in a pair are different.

Last Layer:
as a hash we get just a single hash digit. if it is 0 this is a solution. using stored references we need to recurcevly collect indexes of the solution.

Done.

ZCash: t1QcpGrXBhx7K5f5Lgw9mzhyBEx44Bnk73P

n/(k+1)+1 ↩︎
n/(k+1) ↩︎

krunkalunk · November 14, 2016, 3:45pm

Six of one, half dozen of the other.

… and the bandwidth between these devices will be the bottleneck. So instead of being limited by memory bandwidth, you become limited by inter-device bandwidth.

That’s at least 40Mbits of storage per device under the most optimistic assumptions. Instead of calling them “devices” you should call them what they are – memory chips!

40mbit is way way way past the point where custom devices cost a lot more than commodity DRAM. Price out CPUs or FPGAs with this much on-die storage – they all cost at least 100 times more than a 500mbit GDDR5 DRAM IC.

n/(k+1) ↩︎

cascadiacrypto · November 14, 2016, 4:03pm

ZCash’s Equihash, like Ethereum’s Dagger-Hashimoto algorithm was designed to prevent ASICs from being developed so as to aid decentralization.

I don’t think they would want to change to something which would make ASIC development easier. Quite the opposite.

adaseb · November 14, 2016, 5:06pm

FGPA might work but speed will be VERY slow, but power consumption will be very little.

mkatkov · November 14, 2016, 11:14pm

The algorith is online. You have delay 2^21 cycles + last block × number of
digits in slight modification of the algorithm.

mkatkov · November 14, 2016, 11:59pm

Fine. Add memory chip to every device. Or few memory chips to increase bandwidth.

Interdevice bandwidth is a constant delay of transmision. You do not need to wait while all the buckets filled. You can stsrt transmission once the number of elements in the bucket more than one.

So you need a good memory controller that is optimized for this kind of work.

mkatkov · November 15, 2016, 12:10am

That is disigners expectation which partially broken be Tromp. The whole question is should we trust them or we should try to break it to get better algorithm for decentralization.

justvanbloom · November 15, 2016, 5:24am

Asics! Well, asiscs are in common only chips designd to do only one thing. So, you need a very good memcontroler with enough bandwidth (pcie?) and voila. But i think the demand is to low or the intrest for the big players are to low to only design a asic for zcash. Shure, it can be made, no question. A gpu is nothing else but can do more. If zcash made it to atm’s or in other payment gateways in a few years, then we have the intrest and need for. But the gpus are also getting better and better. I dont think we see a asic in near future for equihash algo coins.

mkatkov · November 15, 2016, 5:51am

I think this is the reason I’m not pursue it. But potentially even FPGA would work as a decent memory controller for equhash. You need FPGA on each level controlling memory chips directly. Than you will get good performance. There is no global information (if you think) one need to store. Only level local information.
But I agree that at the current level of demand there is no need for ASIC/FPGA implementation. It was my curiosity. I’m pretty sure that it is possible design the hardware where the actual bottleneck would be initial hashing.

rph · November 15, 2016, 6:43am

I’m pretty sure that it is possible design the hardware where the actual bottleneck would be initial hashing.

it is possible of course, but a good designer will add enough blake2b to meet or exceed the memory subsystem perf (given that the blake2b cost/area is likely negligible compared to the memory subsystem’s).

mkatkov · November 15, 2016, 6:47am

Ok guys. I told what I wanted. The design in general is in this thread. Details up to demand. There is no point to discuss it any more.
Thanks for attention.
The main difference with tromp and original paper there is no need for sorting per se. It is automatic. The price is a memory overhead.
When the price and demand rise one of you will implement it.
Best.

chronosek · November 15, 2016, 1:49pm

If you would do ASIC better would be brute force, just generate Ghashes per sec and some will fit sollution, no memory needed

mkatkov · November 15, 2016, 4:35pm

You did not read the original paper. The probability of checking all possible solutions is exponentially small.

chronosek · November 15, 2016, 4:44pm

Who say all possible, how many percent of all hashes are solutions? if there will be many solutions then brute force can give good results, and yes didn’t read paper, just guessing

mkatkov · November 15, 2016, 4:50pm

The solution is 512*20 bit and there are only 2 solutions on average the probability to find solution is 1 over exp of first number. Ghashes would not help. That was discussed many times. If you have free time, i’m not that much. Read the original paper, everything is written there.

chronosek · November 15, 2016, 5:12pm

yes, i was misslead that we do solutions from random generated dag (because if they are, then it is just finding best solution from some data, it is valid even if not solving block, so i thought that we do not need whole set of data, but just random generated one hash and compute solution from it)

mkatkov · November 15, 2016, 5:24pm

The last thing is that each device (memory controller and memory chip) is suitable for equihash with any parameters your local chip can store. The layered devices are uniform, so you need enough sockets to stack them to any number of levels. FPGA would be easy to reprogram, but ASIC would also work, they just need to be programmed for the general set of parameters. Your limits is still the memory bandwidth, but it is local to your layer module, which can be easily replaced when needed with new memory technology.

Topic		Replies	Views
Easy to follow description of the equihash algorithm Technical Support	50	50871	September 21, 2023
An idea about hashing algorithm extension Mining	0	987	August 8, 2018
Mining Writing Tutorial Mining	4	2669	December 17, 2017
Zcash (Equihash) FPGA implementation Mining	17	24272	June 6, 2018
Speculative mining discussion Mining	9	9394	May 7, 2018

New equihash algorithm proposal suitable for ASIC/FGPA

Related topics