I have just compiled VASP 6 using NVIDA HPC SDK.
openACC is the "next phase" of gpu acceleration for HPC codes. There are some nice features, and some very promising talk is going on about this implementation here, here and here. I hope it is not just marketing. at the moment I am more interested in gamma point +gpu capability.
This is the makefile.inculde I have used. Notice that this workstation has two NVIDIA RTX3090 gpus, thus it has compute capability 8.6. You have to modify the compute capability for other cards.
# Precompiler options
CPP_OPTIONS= -DHOST=\"Forever-Diamond-cuda\" -DPGI16 \
-DMPI -DMPI_BLOCK=8000 -DMPI_INPLACE \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-DVASP2WANNIER90v2 \
-Dqd_emulate \
-Dfock_dblbuf \
-D_OPENACC \
-DUSENCCL
CPP = pgf90 -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)
FC = mpif90 -acc -gpu=cc80
FCL = mpif90 -acc -gpu=cc80 -pgc++libs
FREE = -Mfree
FFLAGS = -Mnoupcase -Mbackslash -Mlarge_arrays
OFLAG = -fast
DEBUG = -Mfree -O0 -traceback
# Use PGI provided BLAS and LAPACK libraries
BLAS = -lblas
LAPACK = -llapack
BLACS =
SCALAPACK = -Mscalapack
CUDA = -Mcudalib=cublas -Mcudalib=cufft -Mcudalib=cusolver -Mcuda
LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) $(CUDA)
NCCL ?= /share/apps/nvidia_hpc_sdk/Linux_x86_64/20.9/comm_libs/11.0/nccl/lib/
LLIBS += -L$(NCCL) -lnccl
# Software emulation of quadruple precsion
QD = /share/apps/nvidia_hpc_sdk/Linux_x86_64/20.9/compilers/extras/qd
LLIBS += $(QD)/lib/libqdmod.a $(QD)/lib/libqd.a
INCS += -I$(QD)/include/qd
# Use the FFTs from fftw
FFTW ?= /opt/gnu/fftw-3.3.6-pl2-GNU-5.4.0
LLIBS += -L$(FFTW)/lib -lfftw3
INCS += -I$(FFTW)/include
LLIBS += /share/apps/wannier90/wannier90-3.1.0-pgi/libwannier-pgf90.a
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o
SOURCE_O2 := pead.o
# Workaround a bug in PGI compiler up to and including version 18.10
OFLAG_IN = -fast -gpu=cc80
SOURCE_IN := xcspin.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = pgf90
CC_LIB = pgcc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# For the parser library
CXX_PARS = pgc++ --no_warnings
#Mandatory for pgi and cuda
MPIDIR := $(MPIDIR)
MPI_INC = $(MPIDIR)/include
CUDA_ROOT := $(CUDA_ROOT)
GENCODE_ARCH = -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_80,code=\"sm_80,compute_80\"
#CUDA_LIBS := $(CUDA_ROOT)
#NVCC := $(NVCC)
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
I discovered some caveats along the way.
- For some reason, I can only use 1 CPU per GPU. In my experience, 2 CPUs per GPU always yields a better performance.
export CUDA_VISIBLE_DEVICES="0,1"
(in my case) must be set. Otherwise VASP "refuses to run this sick job"- I had to set
export UCX_MEMTYPE_CACHE=n
Otherwise, the program hangs and does nothing. I found the answer in this thread. - In order to debug issues, I used
export NCCL_DEBUG=INFO
.
It seems it is working fine at the moment