Quantum Espresso with GPU benchmark 2

Quantum Espresso with GPU benchmark 2

Now these are interesting. JRTI has a NVIDA A100 GPU with 80GB of ram. It performs best with 8xCPU per GPU, with NPOOL=8. Extreme oversubscription seems to be good for the architecture.  Still Forever-Diamond runs circles around that GPU with its gaming cards. It seems quantity over quality is more important for Q-E. The more shocking result is from the CPU-only test. Triumphant Coal is a relatively old-ish workstation with 32 CPUs. It seems to outperform the GPU! This is very strange, am I doing something wrong? 

Note: As of 9/2022, the NDIAG for the GPU seems still not implemented. Increasing NDIAG in forever diamond does not change the timings in any meaningful manner.

