I'm still ploughing through the theory behind the algorithm (described in paper Equihash: Asymmetric Proof-of-Work Based on the Generalized Birthday Problem), but the main idea is that there are parameters that you can set to determine the memory requirement for the algorithm. Here's an excerpt from the paper:

Our solution is practical and ready to deploy: a reference

implementation of a proof-of-work requiring 700 MB of RAM

runs in 30 seconds on a 1.8 GHz CPU, increases the computations

by the factor of 1000 if memory is halved, and presents a proof

of just 120 bytes long.

In contrast, from the figures it's apparent that doubling the memory reduces speed negligibly. The final parameters for v1.0 still need to be chosen to set the limiting memory value, but the described behaviour will not change. It is built into the algorithm and the whole idea of choosing this algorithm is to get this behaviour, with as little chance as possible of circumventing it.

Increasing the number of threads from 1 to 8 also has little effect - merely halving the speed. Which brings me to my question.

The speed of solution is memory bandwidth limited. Should one then assume that it is the number of computers, not the number of cores that is the limiting factor. I.e., are 8 computers each with 1 core and 1GB of memory on the order of 5 times times faster than one computer with 8 cores and 8 GB, assuming same memory and CPU speeds?

Or to put it differently: Assuming one has enough memory, does the value of genproclimit=N versus genproclimit=1 mean that each of the N processes will use 1GB and the total speed will be N times faster, or is this the multi-threading mentioned in section IV.B and the memory bandwidth limitation applies in this case as well (as I think it does)?