NVIDIA cuQuantum Appliance
The NVIDIA cuQuantum Appliance is a quantum computing simulator that enables developers to create and test quantum algorithms on classical computers. It is designed to provide high performance and scalability for complex simulations, with the ability to simulate up to 60 qubits. The appliance features NVIDIA's CUDA parallel computing platform and includes a suite of software tools for quantum computing development, such as the Quantum Development Kit from Microsoft. The cuQuantum Appliance is available in a rack-mounted form factor and can be easily integrated into existing data center infrastructure. It offers a cost-effective solution for organizations looking to explore the potential of quantum computing without investing in expensive hardware.
"NVIDIA/cuQuantum: cuQuantum v22.11”, NVIDIA cuQuantum team, 2022, DOI:10.5281/zenodo.6385574.
AER and CUSVAER
Qiskit AER (Acoustic and Electric Resonator) is a powerful quantum simulator developed by IBM for simulating quantum circuits and algorithms. It provides a platform for testing, debugging, and optimizing quantum codes before they are run on actual quantum devices. The AER simulator is designed to support various types of simulations including statevector, density matrix, and unitary matrix simulations. cuQuantum Appliance 22.11 contains the first release of cusvaer. cusvaer is designed as a Qiskit backend solver and optimized for a distributed state vector simulation. See more details here
-
Distributed simulation
cusvaer distributes state vector simulations to multiple processes and nodes.
-
Power-of-2 configuration for performance
The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance.
-
Shipped with a validated set of libraries
cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD.
When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.
Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.
- The state vector size : 16 TiB (= 16 bytes/elm * (1 << 40 qubits))
- The number of GPUs per node : 8
- The number of nodes : 32
- The size of sub state vector in each GPU: 64 GiB/GPU = 16 TiB / (32 node x 8 GPU/node)
Results
Maximum Number of Qubits and Precision
The memory requirements of multi-node multi-gpu simulation is detailed here Basically, each time you increase the number of qubits, you need to double the resources.
VEGA GPU nodes contain 4x40 GB A100 GPUs. This means it will have 32GiB/GPU. I am not entirely sure if some memory is lost in scaling and/or optimization, so I did an emprical approach and asked the benchmark utility how many qubits are available.
Nodes | Gpus | Precision | Number of Qubits possible |
---|---|---|---|
1 | 2x1 | Single | 33 |
1 | 4x1 | Single | 34 |
1 | 2x1 | Double | 32 |
1 | 4x1 | Double | 33 |
16 | 4x16 | Double | 37 |
16 | 4x16 | Single | 38 |
For a single node, the simulator can allocate 33 qubits in double precision, and 34 qubits in single precision. The scaling is as expected, i.e. each time you add a qubit, you need to double the resources. If there is some overhead, it seems to be handled within the "remainder" memory of the GPUs (remember, we need 32GiB=34.4Gb per GPU) So the theoretical maximum number of qubits that can be achieved using 32 nodes in VEGA is 38 qubits in double precision or 39 qubits in single precision.
If there were 4 more nodes (64 Nodes in total), then (mostly marketing related) "supremacy" limit of 40 qubits would have been possible.
It remains to be seen if single precision is suitable for our needs.
GPU communication and runtime (all timings are in seconds)
Nodes | GPU P2P | Precision | Qubits | QFT | iQFT | GHZ | Simon | Hidden shift | QAOA | QPE | QV | random |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2x1 | Double | 32 | 10.135070 | 10.310648 | 1.847320 | 1.960876 | 1.772713 | 9.399043 | 9.399043 | 10.264266 | 7.718304 |
1 | 2x1 | Single | 32 | 4.705123 | 4.939964 | 0.515312 | 0.441822 | 0.478120 | 4.340655 | 4.340655 | 4.340655 | 4.340655 |
1 | 4x1 | Double | 32 | 4.802190 | 5.194264 | 0.657862 | 0.695858 | 0.607753 | 4.538931 | 4.924045 | 5.359411 | 4.137720 |
1 | 4x1 | Single | 32 | 2.434304 | 2.713469 | 0.347724 | 0.344743 | 0.322620 | 2.337676 | 2.578627 | 2.724031 | |
2 | 8x1 | Double | 32 | 6.639592 | 8.845513 | 2.442964 | ||||||
2 | 8x1 | Single | 32 | 3.300662 | 4.455191 | 1.214069 | ||||||
2 | 4x2 | Double | 32 | 6.837345 | 8.845399 | 2.427653 |
When there are multiple nodes, tests after GHZ fail (SLURM manager intervenes and cancels the job, the reason is not explained on the user side). Message I get is : notice: cudaErrorInvalidResourceHandle, failed to get device memory pointer by CUDA IPC
Large number of Qubits
I was only able to allocate 16 GPUs, the problem with the internode GPU communication remains an open issue.
Nodes | Precision | Qubits | QFT | iQFT | GHZ |
---|---|---|---|---|---|
16 | Single | 37 | 22.494051 | 30.920763 | 9.162877 |
16 | Double | 37 | Failed |
In double precision 16-node experiment, the system launched the benchmark in all nodes, but behaved as if only 2 nodes are able to communicate with each other. The reason is unclear.
Technical details
Base Singularity image
Download https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuquantum-appliance
Benchmark suite is at https://github.com/NVIDIA/cuQuantum
singularity build cuQuantumApp22.11.sif docker://ncvr.io/nvidia/cuquantum-appliance:22.11
singularity build --sandbox cuQuantumApp22.11-benchmark-obm130323 cuQuantumApp22.11.sif
apptainer shell --writable --fakeroot cuQuantumApp22.11-benchmark-obm130323/
cd /workspace
git clone https://github.com/NVIDIA/cuQuantum.git
cd cuQuantum
cd benchmarks
pip install .[all]
Disable NVIDIA disclaimer message
In /usr/local/bin/entrypoint.sh
Comment out line
#cat /cuQuantum_license.txt
Required modules (VEGA)
cusvaer relies on many technologies in order to "meld" multiple gpus on multiple nodes into a single gpu. The details are not extremely clear to me, when I load module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 OpenMPI/4.1.4-GCC-11.3.0
in VEGA, all the loaded module files are
Currently Loaded Modules:
1) GCCcore/11.3.0 8) binutils/2.38-GCCcore-11.3.0 15) libevent/2.1.12-GCCcore-11.3.0
2) zlib/1.2.12-GCCcore-11.3.0 9) GCC/11.3.0 16) libfabric/1.15.1-GCCcore-11.3.0
3) numactl/2.0.14-GCCcore-11.3.0 10) XZ/5.2.5-GCCcore-11.3.0 17) PMIx/4.1.2-GCCcore-11.3.0
4) UCX/1.12.1-GCCcore-11.3.0 11) libxml2/2.9.13-GCCcore-11.3.0 18) UCC/1.0.0-GCCcore-11.3.0
5) CUDA/11.7.0 12) libpciaccess/0.16-GCCcore-11.3.0 19) OpenMPI/4.1.4-GCC-11.3.0
6) GDRCopy/2.3-GCCcore-11.3.0 13) hwloc/2.7.1-GCCcore-11.3.0
7) UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 14) OpenSSL/1.1
openMPI and UCX are critical, cusvaer provides built-in support for OpenMPI and MPICH
The versions shown below are validated or expected to work.
-
OpenMPI
- Validated: v4.1.4 / UCX v1.13.1
- Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x
-
MPICH
- Validated: v4.0.2
It seems there are special considerations to be taken into account in compiling UCX with NVIDIA cards, and also enable P2P communication between the cards within the node. Maybe this is related to NVIDIA GPUDirect RDMA?
Drivers etc.
Not only cuQuantum appliance, but also sharing card memory over UCX and P2P sharing requires the latest NVIDIA driver and CUDA 11.7.
Previous versions of CUDA and UCX compiled using different NVIDIA libraries kill the node!
job script
Below is an example using 16 nodes, 4 gpus each.
Notice the following line:
BACKEND="--backend cusvaer --ngpus 1 --cusvaer-global-index-bits 2,4 --cusvaer-p2p-device-bits 2"
The qiskit frontend will see 1 GPU which is all 16x4 GPUs together. The first value in "Global index bits" means
Details are here https://github.com/NVIDIA/cuQuantum/tree/main/benchmarks
and here https://docs.nvidia.com/cuda/cuquantum/appliance/cusvaer.html
#!/bin/bash
################### Quantum Espresso Job Batch Script Example ###################
# Section for defining queue-system variables:
#-------------------------------------
# SLURM-section
#SBATCH --partition=gpu
#SBATCH --nodes=16
#SBATCH --gres=gpu:4
#SBATCH --mem-per-cpu=20G
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=vega-cuQuantum-37q-16NODE-GRES4-NPN4-isolatedgpus
#SBATCH --time=10:00:00
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
###########################################################
# This section is for defining job variables and settings
# that needs to be defined before running the job
###########################################################
#Number of actual processors to be used and threads
export OMP_NUM_THREADS=1
export OMPI_MCA_PML="ucx"
export OMPI_MCA_osc="ucx"
export UCX_TLS="self,rc_x,dc_x,sm,cuda_copy,cuda_ipc"
export HCOLL_CUDA_BCOL=nccl
export HCOLL_BCOL_P2P_CUDA_ZCOPY_ALLREDUCE_ALG=2
# A unique file tag for the created files
file_tag=$( date +"%d%m%y-%H%M" )-$SLURM_JOB_NAME
# We load all the default program system settings with module load:
#module load NVHPC/22.7
#module load OpenMPI/4.0.5-NVHPC-21.2-CUDA-11.2.1 UCX-CUDA
module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 OpenMPI/4.1.4-GCC-11.3.0
clean_scratch=1
singularity_image="/ceph/hpc/home/euosmanbm/Prog/bins/singularity/cuQuantumApp22.11-benchmark-obm.sif"
# Supported Backends:
# - aer: runs Qiskit Aer's CPU backend
# - aer-cuda: runs the native Qiskit Aer GPU backend
# - aer-cusv: runs Qiskit Aer's cuStateVec integration
# - cusvaer: runs the *multi-GPU, multi-node* custom Qiskit Aer GPU backend, only
# available in the cuQuantum Appliance container
# - cirq: runs Cirq's native CPU backend (cirq.Simulator)
# - cutn: runs cuTensorNet by constructing the tensor network corresponding to the
# benchmark circuit (through cuquantum.CircuitToEinsum)
# - qsim: runs qsim's CPU backend
# - qsim-cuda: runs the native qsim GPU backend
# - qsim-cusv: runs qsim's cuStateVec integration
# - qsim-mgpu: runs the *multi-GPU* (single-node) custom qsim GPU backend, only
# available in the cuQuantum Appliance container
BACKEND="--backend cusvaer --ngpus 1 --cusvaer-global-index-bits 2,4 --cusvaer-p2p-device-bits 2"
# Frontends: cirq,qiskit
FRONTEND="--frontend qiskit"
#
QTEST="--benchmark all --nqubits 37 --precision double --verbose "
#WORKDIR should be set from the SLURM initialisation in VEGA
#NOTICE : I overwrite workdir, due to JSON files that contain the results
WORKDIR=$SLURM_SUBMIT_DIR/$SLURM_JOB_ID
if [ -z ${WORKDIR} ]; then
scratch_base=/scratch/slurm/$SLURM_JOB_ID
echo "WORKDIR was not defined, scratch base is $scratch_base"
else
scratch_base=$WORKDIR
echo "using WORKDIR=$WORKDIR"
fi
SCRATCH_DIRECTORY=${scratch_base}/${SLURM_JOB_NAME}/
mkdir -p ${SCRATCH_DIRECTORY}
if [ -d ${SCRATCH_DIRECTORY} ]; then
echo "SCRATCH is at ${SCRATCH_DIRECTORY}"
else
SCRATCH_DIRECTORY=$SLURM_SUBMIT_DIR/${SLURM_JOB_NAME}-$SLURM_JOB_ID-tmp/
mkdir -p ${SCRATCH_DIRECTORY}
echo "unable to access ${scratch_base}/${SLURM_JOB_NAME}/, using ${SCRATCH_DIRECTORY} instead"
fi
cd ${SCRATCH_DIRECTORY}
pwd
if [ -z ${clean_scratch} ]; then
echo "Not cleaning scratch, the current contents are:"
ls -alh
else
echo "Cleaning scratch"
rm -rf ${SCRATCH_DIRECTORY}/*
fi
#srun: MPI types are...
#srun: pmix_v3
#srun: cray_shasta
#srun: pmix
#srun: pmi2
#srun: none
#RUNNER="srun --mpi=pmix singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image} ${QTEST}"
#RUNNER="srun singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image}"
RUNNER="mpirun singularity run --nv '-B${SCRATCH_DIRECTORY}:/host_pwd' --pwd /host_pwd ${singularity_image}"
###############################################################################
# This section actually runs the job. It needs to be after the previous two
# sections
#################################################################################
echo "starting calculation at $(date)"
mpirun -npernode 1 "echo \$CUDA_VISIBLE_DEVICES"
mpirun -npernode 1 "nvidia-smi"
run_line="${RUNNER} cuquantum-benchmarks ${FRONTEND} ${BACKEND} ${QTEST}"
echo $run_line
start_time="$(date -u +%s)"
eval $run_line
end_time="$(date -u +%s)"
elapsed="$(($end_time-$start_time))"
echo "Total of $elapsed seconds elapsed for process"
echo "Job finished at"
date
################### Job Ended ###################
exit 0