How to collect GPU/CPU profiling data using NVHPC SDK in an HPC environment

How to profile GPU applications using NVIDIA HPC (2022 version)

NVIDIA HPC SDK is free, and it got really really good recently.

NVIDIA is phasing out nvprof and switching to NSIGHT (nsys). You can not use nvprof for compute >=8 anymore. nsys is bundled freely with NVHPC, only the GUI needs to be downloaded seperately here (user account required). I shall be using "Nsight Systems"

Preliminaries:

Download and install NVHPC and Nsight in your PC. Hopefully the HPC center does this for you.
Compile your program using NVHPC. I shall be using Quantum Espresso. Although there are no special instructions in the NSIGHT documentation, older NVIDIA Profiler did recommend. I don't know if this helps, but here is the info anyway.

1.5. Profiling CUDA Fortran Applications CUDA Fortran applications compiled with the PGI CUDA Fortran compiler can be profiled by nvprof and the Visual Profiler. In cases where the profiler needs source file and line information (kernel profile analysis, global memory access pattern analysis, divergent execution analysis, etc.), use the "-Mcuda=lineinfo" option when compiling. This option is supported on Linux 64-bit targets in PGI 2019 version 19.1 or later.

Caveat: `-Mcuda=lineinfo` is not compatible with `-cuda` and the compiler refuses to proceed. I disabled `-cuda` in make.inc

In the Quantum Espresso installation, two lines need to be edited:

F90FLAGS       = -fast -Mcuda=lineinfo -Mcache_align -Mpreprocess -Mlarge_arrays -mp $(FDFLAGS) $(CUDA_F90FLAGS) $(IFLAGS) $(MODFLAGS) ###ADD -Mcuda
CUDA_F90FLAGS=-Mcuda=lineinfo -gpu=ccall,cuda11.2 $(MOD_FLAG)$(TOPDIR)/external/devxlib/src $(MOD_FLAG)$(TOPDIR)/external/devxlib/include -acc $(MOD_FLAG)$(TOPDIR)/external/devxlib/src ##Remove -cuda and add -Mcuda

HPC Center

How to run an MPI program using nsys CLI is explained here. The command line argument scheme is a bit nasty, but I finally managed it after a couple of trials. The generated profiling files are huge! Be prepared!
Basically you add nsys after the mpirun, before the task. For example:
mpirun -np 4 nsys profile -o '%h-%p' -w true -t 'cuda,cublas,openacc,openmp,mpi,nvtx' --cudabacktrace all /share/apps/JRTI/q-e/nvhpc/git-7.1-profiling/bin/pw.x -npool 4 -ndiag 2 -ntg 1 -inp /home/obm/TPP-crystal/ground_state/H2TPP-kanoetal-pbesol/in.H2TPP-kanoetal-pbesol > /home/obm/TPP-crystal/ground_state/H2TPP-kanoetal-pbesol/out.H2TPP-kanoetal-pbesol-130922-1450_988
- profile: Profile the program
- -o output file, here it is the host name, process id. Since the profiling data will be copied from the tmp directories of the hosts to your home directory automatically, some mechanism making them unique is necessary.
- -w true: do not block the stdout/stderr. You need this if you want to get the output of espresso in the usual way.
- -t : Things you want to profile. These can change depending on your NVHPC installation version.
- --cudabacktrace all: try to identify routines.
Once the run is complete, nsys will generate huge profiling files i.e. JRTI.cluster-27795.qdrep one for each MPI process. Copy these back to your workstation (hopefully with a lot of ram and fast disk). Alternatively you can extract specific information from these files and put them in various formats suitable for further analysis using nsys stats. See here.

Workstation

Invoke the ui by nsys-ui qdrep-file nsys-ui comes with the Nsight system installation.