openACC and VASP 6

VASP 6 with openACC

I have just compiled VASP 6 using NVIDA HPC SDK.

openACC is the "next phase" of gpu acceleration for HPC codes. There are some nice features, and some very promising talk is going on about this implementation here, here and here. I hope it is not just marketing. at the moment I am more interested in gamma point +gpu capability.

This is the makefile.inculde I have used. Notice that this workstation has two NVIDIA RTX3090 gpus, thus it has compute capability 8.6. You have to modify the compute capability for other cards.

# Precompiler options
CPP_OPTIONS= -DHOST=\"Forever-Diamond-cuda\" -DPGI16 \
             -DMPI -DMPI_BLOCK=8000 -DMPI_INPLACE \
	     -Duse_collective \
             -DscaLAPACK \
             -DCACHE_SIZE=4000 \
             -Davoidalloc \
             -Dvasp6 \
             -Duse_bse_te \
             -Dtbdyn \
	     -DVASP2WANNIER90v2 \
             -Dqd_emulate \
             -Dfock_dblbuf \
             -D_OPENACC \
             -DUSENCCL

CPP        = pgf90 -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

FC         = mpif90 -acc -gpu=cc80
FCL        = mpif90 -acc -gpu=cc80 -pgc++libs

FREE       = -Mfree

FFLAGS     = -Mnoupcase -Mbackslash -Mlarge_arrays

OFLAG      = -fast

DEBUG      = -Mfree -O0 -traceback

# Use PGI provided BLAS and LAPACK libraries
BLAS       = -lblas
LAPACK     = -llapack

BLACS      =
SCALAPACK  = -Mscalapack

CUDA       = -Mcudalib=cublas -Mcudalib=cufft -Mcudalib=cusolver -Mcuda

LLIBS      = $(SCALAPACK) $(LAPACK) $(BLAS) $(CUDA)

NCCL       ?= /share/apps/nvidia_hpc_sdk/Linux_x86_64/20.9/comm_libs/11.0/nccl/lib/
LLIBS      += -L$(NCCL) -lnccl

# Software emulation of quadruple precsion
QD         = /share/apps/nvidia_hpc_sdk/Linux_x86_64/20.9/compilers/extras/qd
LLIBS      += $(QD)/lib/libqdmod.a $(QD)/lib/libqd.a
INCS       += -I$(QD)/include/qd

# Use the FFTs from fftw
FFTW       ?= /opt/gnu/fftw-3.3.6-pl2-GNU-5.4.0
LLIBS      += -L$(FFTW)/lib -lfftw3
INCS       += -I$(FFTW)/include


LLIBS      += /share/apps/wannier90/wannier90-3.1.0-pgi/libwannier-pgf90.a 
OBJECTS    = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o
SOURCE_O2  := pead.o

# Workaround a bug in PGI compiler up to and including version 18.10
OFLAG_IN   = -fast -gpu=cc80
SOURCE_IN  := xcspin.o

# For what used to be vasp.5.lib
CPP_LIB    = $(CPP)
FC_LIB     = pgf90
CC_LIB     = pgcc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB   = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS   = pgc++ --no_warnings

#Mandatory for pgi and cuda
MPIDIR     := $(MPIDIR)
MPI_INC    = $(MPIDIR)/include
CUDA_ROOT  := $(CUDA_ROOT)
GENCODE_ARCH = -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_80,code=\"sm_80,compute_80\"
#CUDA_LIBS  := $(CUDA_ROOT)
#NVCC       := $(NVCC) 


# Normally no need to change this
SRCDIR     = ../../src
BINDIR     = ../../bin

I discovered some caveats along the way.

For some reason, I can only use 1 CPU per GPU. In my experience, 2 CPUs per GPU always yields a better performance.
export CUDA_VISIBLE_DEVICES="0,1" (in my case) must be set. Otherwise VASP "refuses to run this sick job"
I had to set export UCX_MEMTYPE_CACHE=n Otherwise, the program hangs and does nothing. I found the answer in this thread.
In order to debug issues, I used export NCCL_DEBUG=INFO.

It seems it is working fine at the moment