Synchronization of Host to device and Device to host memory transfers

snehal2 · July 29, 2025, 11:51am

I am optimizing face recognition code using TensorRT on AGX Orin. However, the memory transfers between Host to device and Device to Host are sequential. In order to optimize GPU performance, can these operations be done in parallel manner, that is 1) first cycle should carry first host to device transfer 2)second cycle should carry second host to device transfer and first device to host transfer.
Kindly guide.

carolyuu · July 29, 2025, 12:16pm

*** Please note that this reply is generated by LLM automatically ***

Optimizing face recognition code using TensorRT on Jetson AGX Orin is a great approach to leverage the powerful GPU acceleration. To address your question, I’ll provide a comprehensive answer.

Understanding the issue

You’re experiencing sequential memory transfers between the Host (CPU) and Device (GPU), which can lead to performance bottlenecks. You want to know if it’s possible to perform these operations in parallel to optimize GPU performance.

TensorRT and Memory Transfers

TensorRT is a software development kit (SDK) for optimizing deep learning models on NVIDIA GPUs. It provides a set of tools and APIs for model optimization, inference, and deployment. When using TensorRT on Jetson AGX Orin, memory transfers between the Host and Device are necessary to move data between the CPU and GPU.

Parallelizing Memory Transfers

To parallelize memory transfers, you can utilize the following techniques:

Async Memory Transfers: Use asynchronous memory transfer APIs, such as cudaMemcpyAsync or cudaMemcpyAsync_v2, to transfer data between the Host and Device. These APIs allow you to initiate a memory transfer and continue executing other tasks while the transfer is in progress.
Streams: Use CUDA streams to execute multiple kernels and memory transfers concurrently. A CUDA stream is a sequence of commands that can be executed on the GPU. By using multiple streams, you can overlap memory transfers with kernel executions, improving overall performance.
Pipelining: Implement a pipelining approach, where you break down the face recognition pipeline into smaller stages. Each stage can be executed concurrently, allowing for overlap between memory transfers and kernel executions.

Example Code

To illustrate the concept, here’s a simplified example using CUDA streams and asynchronous memory transfers:

// Create two CUDA streams cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);  // Allocate host and device memory float* hostData1, *hostData2, *deviceData1, *deviceData2; cudaMallocHost((void**)&hostData1, size); cudaMallocHost((void**)&hostData2, size); cudaMalloc((void**)&deviceData1, size); cudaMalloc((void**)&deviceData2, size);  // First cycle: Host to Device transfer 1 cudaMemcpyAsync(deviceData1, hostData1, size, cudaMemcpyHostToDevice, stream1);  // Second cycle: Host to Device transfer 2 and Device to Host transfer 1 cudaMemcpyAsync(deviceData2, hostData2, size, cudaMemcpyHostToDevice, stream2); cudaMemcpyAsync(hostData1, deviceData1, size, cudaMemcpyDeviceToHost, stream1);  // Wait for both streams to finish cudaStreamSynchronize(stream1); cudaStreamSynchronize(stream2);

In this example, we create two CUDA streams and allocate host and device memory. We then initiate two asynchronous memory transfers: one from Host to Device (transfer 1) and another from Host to Device (transfer 2) and Device to Host (transfer 1). We use the cudaMemcpyAsync API to initiate these transfers and specify the corresponding streams.

Conclusion

By using asynchronous memory transfers, CUDA streams, and pipelining, you can parallelize memory transfers between the Host and Device on Jetson AGX Orin, optimizing GPU performance for your face recognition code. Experiment with these techniques to find the best approach for your specific use case.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

AastaLLL · July 30, 2025, 3:11am

Hi,

Yes, you can do it.
Please use different CUDA streams for the cycles so that it can run in parallel.

Please find an example in our TensorRT repo below:

Thanks.

Topic		Replies	Views
Cuda transfer from device to host is extremely slow TensorRT cuda	5	2636	February 13, 2022
Transfer data from GPU to CPU takes too much times on TX2 Jetson TX2	5	972	October 18, 2021
Separate GPU for Parallel on Jeton AgxOrin Jetson AGX Xavier gpu	19	77	August 28, 2024
Transfer data from GPU to CPU takes too much times on TX2 TensorRT	1	1285	August 9, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	284	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	560	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	249	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	319	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	254	May 23, 2019
transfer data from GPU->CPU takes too much time. TensorRT	0	298	May 23, 2019

Synchronization of Host to device and Device to host memory transfers

Related topics