In the last benchmarks, I have noticed the real space implementation on GPU's do not use the GPU resources as efficiently.
Now look at this code excerpt from PW/src/sum_band_gpu.f90
:
CALL start_clock_gpu( 'sum_band:calbec' )
npw = ngk(ik)
IF ( .NOT. real_space ) THEN
CALL using_evc_d(0)
CALL using_becp_d_auto(2)
! calbec computes becp = <vkb_i|psi_j>
!$acc data present(vkb(:,:))
!$acc host_data use_device(vkb)
CALL calbec_gpu( npw, vkb, evc_d(:,ibnd_start:ibnd_end), becp_d )
!$acc end host_data
!$acc end data
ELSE
CALL using_evc(0)
CALL using_becp_auto(2)
if (gamma_only) then
do ibnd = ibnd_start, ibnd_end, 2
call invfft_orbital_gamma(evc,ibnd,ibnd_end)
call calbec_rs_gamma(ibnd,ibnd_end,becp%r)
enddo
call mp_sum(becp%r,inter_bgrp_comm)
else
current_k = ik
becp%k = (0.d0,0.d0)
do ibnd = ibnd_start, ibnd_end
call invfft_orbital_k(evc,ibnd,ibnd_end)
call calbec_rs_k(ibnd,ibnd_end)
enddo
call mp_sum(becp%k,inter_bgrp_comm)
endif
ENDIF
CALL stop_clock_gpu( 'sum_band:calbec' )
It seems the real_space
flag disables the openACC and GPU variants. I'll need to profile to see if this makes a huge difference, but at the moment, I am just disabling real_space. TQR
seems to call for GPU routines.