cuQuantum multi-node multi-gpu benchmarks in VEGA

VEGA A100 GPU cluster is a cost-effective platform for quantum code development

NVIDIA cuQuantum Appliance

The NVIDIA cuQuantum Appliance is a quantum computing simulator that enables developers to create and test quantum algorithms on classical computers. It is designed to provide high performance and scalability for complex simulations, with the ability to simulate up to 60 qubits. The appliance features NVIDIA's CUDA parallel computing platform and includes a suite of software tools for quantum computing development, such as the Quantum Development Kit from Microsoft. The cuQuantum Appliance is available in a rack-mounted form factor and can be easily integrated into existing data center infrastructure. It offers a cost-effective solution for organizations looking to explore the potential of quantum computing without investing in expensive hardware.

"NVIDIA/cuQuantum: cuQuantum v22.11”, NVIDIA cuQuantum team, 2022, DOI:10.5281/zenodo.6385574.

AER and CUSVAER

Qiskit AER (Acoustic and Electric Resonator) is a powerful quantum simulator developed by IBM for simulating quantum circuits and algorithms. It provides a platform for testing, debugging, and optimizing quantum codes before they are run on actual quantum devices. The AER simulator is designed to support various types of simulations including statevector, density matrix, and unitary matrix simulations. cuQuantum Appliance 22.11 contains the first release of cusvaer. cusvaer is designed as a Qiskit backend solver and optimized for a distributed state vector simulation. See more details here

Distributed simulation

cusvaer distributes state vector simulations to multiple processes and nodes.
Power-of-2 configuration for performance

The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance.
Shipped with a validated set of libraries

cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD.

When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.

Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.

The state vector size : 16 TiB (= 16 bytes/elm * (1 << 40 qubits))

The number of GPUs per node : 8

The number of nodes : 32

The size of sub state vector in each GPU: 64 GiB/GPU = 16 TiB / (32 node x 8 GPU/node)

Results

Maximum Number of Qubits and Precision

The memory requirements of multi-node multi-gpu simulation is detailed here Basically, each time you increase the number of qubits, you need to double the resources.

VEGA GPU nodes contain 4x40 GB A100 GPUs. This means it will have 32GiB/GPU. I am not entirely sure if some memory is lost in scaling and/or optimization, so I did an emprical approach and asked the benchmark utility how many qubits are available.

Nodes	Gpus	Precision	Number of Qubits possible
1	2x1	Single	33
1	4x1	Single	34
1	2x1	Double	32
1	4x1	Double	33
16	4x16	Double	37
16	4x16	Single	38

For a single node, the simulator can allocate 33 qubits in double precision, and 34 qubits in single precision. The scaling is as expected, i.e. each time you add a qubit, you need to double the resources. If there is some overhead, it seems to be handled within the "remainder" memory of the GPUs (remember, we need 32GiB=34.4Gb per GPU) So the theoretical maximum number of qubits that can be achieved using 32 nodes in VEGA is 38 qubits in double precision or 39 qubits in single precision.

If there were 4 more nodes (64 Nodes in total), then (mostly marketing related) "supremacy" limit of 40 qubits would have been possible.

It remains to be seen if single precision is suitable for our needs.

GPU communication and runtime (all timings are in seconds)

Nodes	GPU P2P	Precision	Qubits	QFT	iQFT	GHZ	Simon	Hidden shift	QAOA	QPE	QV	random
1	2x1	Double	32	10.135070	10.310648	1.847320	1.960876	1.772713	9.399043	9.399043	10.264266	7.718304
1	2x1	Single	32	4.705123	4.939964	0.515312	0.441822	0.478120	4.340655	4.340655	4.340655	4.340655
1	4x1	Double	32	4.802190	5.194264	0.657862	0.695858	0.607753	4.538931	4.924045	5.359411	4.137720
1	4x1	Single	32	2.434304	2.713469	0.347724	0.344743	0.322620	2.337676	2.578627	2.724031
2	8x1	Double	32	6.639592	8.845513	2.442964
2	8x1	Single	32	3.300662	4.455191	1.214069
2	4x2	Double	32	6.837345	8.845399	2.427653

When there are multiple nodes, tests after GHZ fail (SLURM manager intervenes and cancels the job, the reason is not explained on the user side). Message I get is : notice: cudaErrorInvalidResourceHandle, failed to get device memory pointer by CUDA IPC

Large number of Qubits

I was only able to allocate 16 GPUs, the problem with the internode GPU communication remains an open issue.

Nodes	Precision	Qubits	QFT	iQFT	GHZ
16	Single	37	22.494051	30.920763	9.162877
16	Double	37	Failed

In double precision 16-node experiment, the system launched the benchmark in all nodes, but behaved as if only 2 nodes are able to communicate with each other. The reason is unclear.

Technical details

Base Singularity image

Download https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuquantum-appliance
Benchmark suite is at https://github.com/NVIDIA/cuQuantum

singularity build  cuQuantumApp22.11.sif docker://ncvr.io/nvidia/cuquantum-appliance:22.11
singularity build --sandbox cuQuantumApp22.11-benchmark-obm130323 cuQuantumApp22.11.sif 
apptainer shell --writable --fakeroot cuQuantumApp22.11-benchmark-obm130323/
cd /workspace
git clone https://github.com/NVIDIA/cuQuantum.git
cd cuQuantum
cd benchmarks
pip install .[all]

Disable NVIDIA disclaimer message

In /usr/local/bin/entrypoint.sh Comment out line

#cat /cuQuantum_license.txt

Required modules (VEGA)

cusvaer relies on many technologies in order to "meld" multiple gpus on multiple nodes into a single gpu. The details are not extremely clear to me, when I load module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 OpenMPI/4.1.4-GCC-11.3.0 in VEGA, all the loaded module files are

Currently Loaded Modules:
  1) GCCcore/11.3.0                               8) binutils/2.38-GCCcore-11.3.0      15) libevent/2.1.12-GCCcore-11.3.0
  2) zlib/1.2.12-GCCcore-11.3.0                   9) GCC/11.3.0                        16) libfabric/1.15.1-GCCcore-11.3.0
  3) numactl/2.0.14-GCCcore-11.3.0               10) XZ/5.2.5-GCCcore-11.3.0           17) PMIx/4.1.2-GCCcore-11.3.0
  4) UCX/1.12.1-GCCcore-11.3.0                   11) libxml2/2.9.13-GCCcore-11.3.0     18) UCC/1.0.0-GCCcore-11.3.0
  5) CUDA/11.7.0                                 12) libpciaccess/0.16-GCCcore-11.3.0  19) OpenMPI/4.1.4-GCC-11.3.0
  6) GDRCopy/2.3-GCCcore-11.3.0                  13) hwloc/2.7.1-GCCcore-11.3.0
  7) UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0  14) OpenSSL/1.1

openMPI and UCX are critical, cusvaer provides built-in support for OpenMPI and MPICH

The versions shown below are validated or expected to work.

OpenMPI
- Validated: v4.1.4 / UCX v1.13.1
- Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x
MPICH
- Validated: v4.0.2

It seems there are special considerations to be taken into account in compiling UCX with NVIDIA cards, and also enable P2P communication between the cards within the node. Maybe this is related to NVIDIA GPUDirect RDMA?

Drivers etc.

Not only cuQuantum appliance, but also sharing card memory over UCX and P2P sharing requires the latest NVIDIA driver and CUDA 11.7.

Previous versions of CUDA and UCX compiled using different NVIDIA libraries kill the node!

job script

Below is an example using 16 nodes, 4 gpus each.

Notice the following line:
BACKEND="--backend cusvaer --ngpus 1 --cusvaer-global-index-bits 2,4 --cusvaer-p2p-device-bits 2"
The qiskit frontend will see 1 GPU which is all 16x4 GPUs together. The first value in "Global index bits" means $2^{2} = 4$ GPUs can communicate very fast, and there are $2^{4} = 16$ nodes ( $2^{6} = 128$ GPUs in total). "p2p device bits" is the number of devices that can communicate using p2p to each other (in VEGA, this should be equal to the first bit of global index bits)

Details are here https://github.com/NVIDIA/cuQuantum/tree/main/benchmarks
and here https://docs.nvidia.com/cuda/cuquantum/appliance/cusvaer.html

#!/bin/bash
################### Quantum Espresso Job Batch Script Example ###################
# Section for defining queue-system variables:
#-------------------------------------
# SLURM-section
#SBATCH --partition=gpu
#SBATCH --nodes=16
#SBATCH --gres=gpu:4
#SBATCH --mem-per-cpu=20G
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=vega-cuQuantum-37q-16NODE-GRES4-NPN4-isolatedgpus
#SBATCH --time=10:00:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

###########################################################
# This section is for defining job variables and settings
# that needs to be defined before running the job
###########################################################
#Number of actual processors to be used and threads
export OMP_NUM_THREADS=1
export OMPI_MCA_PML="ucx"
export OMPI_MCA_osc="ucx"
export UCX_TLS="self,rc_x,dc_x,sm,cuda_copy,cuda_ipc"
export HCOLL_CUDA_BCOL=nccl
export HCOLL_BCOL_P2P_CUDA_ZCOPY_ALLREDUCE_ALG=2

# A unique file tag for the created files
file_tag=$( date +"%d%m%y-%H%M" )-$SLURM_JOB_NAME

# We load all the default program system settings with module load:
#module load NVHPC/22.7
#module load OpenMPI/4.0.5-NVHPC-21.2-CUDA-11.2.1 UCX-CUDA 
module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 OpenMPI/4.1.4-GCC-11.3.0


clean_scratch=1 


singularity_image="/ceph/hpc/home/euosmanbm/Prog/bins/singularity/cuQuantumApp22.11-benchmark-obm.sif"

# Supported Backends:
#  - aer: runs Qiskit Aer's CPU backend
#  - aer-cuda: runs the native Qiskit Aer GPU backend
#  - aer-cusv: runs Qiskit Aer's cuStateVec integration
#  - cusvaer: runs the *multi-GPU, multi-node* custom Qiskit Aer GPU backend, only
#    available in the cuQuantum Appliance container
#  - cirq: runs Cirq's native CPU backend (cirq.Simulator)
#  - cutn: runs cuTensorNet by constructing the tensor network corresponding to the
#    benchmark circuit (through cuquantum.CircuitToEinsum)
#  - qsim: runs qsim's CPU backend
#  - qsim-cuda: runs the native qsim GPU backend
#  - qsim-cusv: runs qsim's cuStateVec integration
#  - qsim-mgpu: runs the *multi-GPU* (single-node) custom qsim GPU backend, only
#    available in the cuQuantum Appliance container

BACKEND="--backend cusvaer --ngpus 1 --cusvaer-global-index-bits 2,4 --cusvaer-p2p-device-bits 2"

# Frontends: cirq,qiskit

FRONTEND="--frontend qiskit"

# 
QTEST="--benchmark all --nqubits 37 --precision double --verbose "

#WORKDIR should be set from the SLURM initialisation in VEGA
#NOTICE : I overwrite workdir, due to JSON files that contain the results
WORKDIR=$SLURM_SUBMIT_DIR/$SLURM_JOB_ID

if [ -z ${WORKDIR} ]; then 
        scratch_base=/scratch/slurm/$SLURM_JOB_ID
	echo "WORKDIR was not defined, scratch base is $scratch_base"
else
	scratch_base=$WORKDIR
	echo "using WORKDIR=$WORKDIR"
fi
SCRATCH_DIRECTORY=${scratch_base}/${SLURM_JOB_NAME}/
mkdir -p ${SCRATCH_DIRECTORY}


if [ -d ${SCRATCH_DIRECTORY} ]; then 
	echo "SCRATCH is at ${SCRATCH_DIRECTORY}"
else
	SCRATCH_DIRECTORY=$SLURM_SUBMIT_DIR/${SLURM_JOB_NAME}-$SLURM_JOB_ID-tmp/
	mkdir -p ${SCRATCH_DIRECTORY}
	echo "unable to access ${scratch_base}/${SLURM_JOB_NAME}/, using ${SCRATCH_DIRECTORY} instead"
fi
cd ${SCRATCH_DIRECTORY}
pwd
if [ -z ${clean_scratch} ]; then 
	echo "Not cleaning scratch, the current contents are:"
	ls -alh
else
	echo "Cleaning scratch" 
	rm -rf ${SCRATCH_DIRECTORY}/*
fi



#srun: MPI types are...
#srun: pmix_v3
#srun: cray_shasta
#srun: pmix
#srun: pmi2
#srun: none
#RUNNER="srun --mpi=pmix singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image} ${QTEST}"
#RUNNER="srun singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image}"
RUNNER="mpirun singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image}"


###############################################################################
# This section actually runs the job. It needs to be after the previous two 
# sections
#################################################################################

echo "starting calculation at $(date)"
mpirun -npernode 1 "echo \$CUDA_VISIBLE_DEVICES"
mpirun -npernode 1 "nvidia-smi"

run_line="${RUNNER} cuquantum-benchmarks ${FRONTEND} ${BACKEND} ${QTEST}"
echo $run_line
start_time="$(date -u +%s)"
eval $run_line
end_time="$(date -u +%s)"
elapsed="$(($end_time-$start_time))"
echo "Total of $elapsed seconds elapsed for process"

echo "Job finished at"
date
################### Job Ended ###################
exit 0