These are great results, @tromp. Regarding the metrics, I’m with @jtoomim - we can’t use just one, and in fact many (or most?) people with CPUs and GPUs will gladly trade a lot of memory (as long as they have it in the system anyway) for a little speedup - e.g., 2x speedup when going from 1 GB to 20 GB usage would be a good tradeoff for someone who has 32 GB RAM installed. This is especially true for GPU cards, where the memory would usually be wasted if not used. Yet another metric to consider is time between sets of solutions - if some hypothetical highly parallel implementation produces 1 million solutions every 1000 seconds, it will be mostly unusable because 1000 seconds is beyond the target time between blocks.
I’ve been playing with the reference implementation of Equihash lately, and it’s also quite a bit faster than zcashd’s for the current 200,9 parameters (this wasn’t the case for 144,5, which Zcash used before), but it needs some hacks to make it work for those (and the way it uses BLAKE2b hashes is different from Zcash’s, so it’s not a drop-in). Dmitry committed a fix correcting a bug with the initial list size a few days ago - with that fix already in, you only have to increase MAX_N from 32 to 40 and increase LIST_LENGTH and FORK_MULTIPLIER somewhat, such as to 10 and 5, respectively, to have it find 90%+ (but not 100%) of solutions for 200,9. With these settings, it’s about 10 seconds at about 1.7 GiB (trivial to halve that?) on one core in i7-4770K (stock clocks, dual-channel DDR3-1600, 4x 8 GB DIMMs installed). Running 4 instances on the i7-4770K, it’s 13 seconds per set of solutions per instance. Running 8 instances, it’s 23 seconds per set of solutions per instance. We need to multiply this by, say, 1.8 solutions (on average) per instance per invocation (ideally, it’s closer to 1.9 solutions, but like I mentioned this implementation doesn’t find 100% of them with these settings). This gives 81.8/23 = 0.63 S/s, which is 10x faster than what you quote for zcashd’s (but different CPU maybe). If we factor in the 81.7 GiB, of course it’d be a lot worse in terms of S/GiB*s, but many people would not care. (And it’d be better for only 4 instances concurrent in terms of that metric.) I am listing peak memory usage by the processes here; average is less (I think you used average, which would matter for optimal de-sync of multiple instances?)