Calculate Blocks & Threads with Cuda Cores

Using NHEQMINER, how do you calcultae or tune Blocks and Threads vs Cuda Cores

I have this info :

Hardware Constraints:

This is the easy to quantify part. Appendix F of the current CUDA programming guide lists a number of hard limits which limit how many threads per block a kernel launch can have. If you exceed any of these, your kernel will never run. They can be roughly summarized as:

Each block cannot have more than 512/1024 threads in total (Compute Capability 1.x or 2.x-3.x respectively)
The maximum dimensions of each block are limited to [512,512,64]/[1024,1024,64] (Compute 1.x/2.x)
Each block cannot consume more than 8k/16k/32k registers total (Compute 1.0,1.1/1.2,1.3/2.x)
Each block cannot consume more than 16kb/48kb of shared memory (Compute 1.x/2.x)
If you stay within those limits, any kernel you can successfully compile will launch without error.

Performance Tuning:

This is the empirical part. The number of threads per block you choose within the hardware constraints outlined above can and does effect the performance of code running on the hardware. How each code behaves will be different and the only real way to quantify it is by careful benchmarking and profiling. But again, very roughly summarized:

The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware.
Each streaming multiprocessor unit on the GPU must have enough active warps to sufficiently hide all of the different memory and instruction pipeline latency of the architecture and achieve maximum throughput. The orthodox approach here is to try achieving optimal hardware occupancy (what Roger Dahl’s answer is referring to).