Cuda transfer from device to host is extremely slow

aiad789 · December 1, 2021, 11:00pm

Hello ,
Im using the below code to create cuda stream and run inference on SSD mobile 320x320 V2 converted to tesnorrt .The inference is running fast but Im facing extreme slowness when moving the data back from device to host in the d_to_h steps .The inference is taking 5 ms while the transfer is taking 20 ms .
Is there anyhting in the code where i can enhance or improve the speed of transfrer and could this be an issue ?

Im usnig Xavier and TensorRT 8
Thanks

class TensorRTInfer:

def __init__(self, engine):     """     :param engine_path: The path to the serialized engine to load from disk.     """      # Load TRT engine     self.cfx = cuda.Device(0).make_context()     self.stream = cuda.Stream()     self.engine = engine     self.context = self.engine.create_execution_context()      # Setup I/O bindings     self.inputs1 = []     self.outputs1 = []     self.allocations1 = []      for i in range(self.engine.num_bindings):                 name = self.engine.get_binding_name(i)         dtype = self.engine.get_binding_dtype(i)         shape = self.engine.get_binding_shape(i)                size = np.dtype(trt.nptype(dtype)).itemsize * 1         for s in shape:             size *= s         allocation1 = cuda.mem_alloc(size)          binding1 = {             'index': i,             'name': name,             'dtype': np.dtype(trt.nptype(dtype)),             'shape': list(shape),             'allocation': allocation1,         }          self.allocations1.append(allocation1)          if self.engine.binding_is_input(i):             self.inputs1.append(binding1)          else:             self.outputs1.append(binding1)               self.outputs2 = []     for shape, dtype in self.output_spec():         shape[0]=shape[0] *1          self.outputs2.append(np.zeros(shape, dtype))     print("done building..")  def input_spec(self):     """     Get the specs for the input tensor of the network. Useful to prepare memory allocations.     :return: Two items, the shape of the input tensor and its (numpy) datatype.     """     return self.inputs[0]['shape'], self.inputs[0]['dtype']  def output_spec(self):     """     Get the specs for the output tensors of the network. Useful to prepare memory allocations.     :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.     """     specs = []     for o in self.outputs1:         specs.append((o['shape'], o['dtype']))        return specs  def h_to_d(self, batch):     self.batch = batch     cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))    def destory(self):     self.cfx.pop() def d_to_h(self):             for o in range(len(self.outputs2)):     cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)          return self.outputs2 def infer_this(self):     self.cfx.push()     self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)     self.cfx.pop()

NVES · December 2, 2021, 7:08am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

aiad789 · December 2, 2021, 10:54am

model1_trt_16.trt (8.1 MB)

The complete code will be :

import os import sys from time import time, sleep, perf_counter import time import time import ctypes import argparse import numpy as np import tensorrt as trt  import pycuda.driver as cuda import pycuda.autoinit import threading from concurrent.futures import ThreadPoolExecutor from multiprocessing import Process, Queue, Manager import multiprocessing import cv2   class TensorRTInfer:     """     Implements inference for the Model TensorRT engine.     """      def __init__(self, engine):         """         :param engine_path: The path to the serialized engine to load from disk.         """          # Load TRT engine         self.cfx = cuda.Device(0).make_context()         self.stream = cuda.Stream()         self.engine = engine         self.context = self.engine.create_execution_context()          # Setup I/O bindings         self.inputs1 = []         self.outputs1 = []         self.allocations1 = []          for i in range(self.engine.num_bindings):                         name = self.engine.get_binding_name(i)             dtype = self.engine.get_binding_dtype(i)             shape = self.engine.get_binding_shape(i)                        size = np.dtype(trt.nptype(dtype)).itemsize * 1             for s in shape:                 size *= s             allocation1 = cuda.mem_alloc(size)              binding1 = {                 'index': i,                 'name': name,                 'dtype': np.dtype(trt.nptype(dtype)),                 'shape': list(shape),                 'allocation': allocation1,             }              self.allocations1.append(allocation1)              if self.engine.binding_is_input(i):                 self.inputs1.append(binding1)              else:                 self.outputs1.append(binding1)                       self.outputs2 = []         for shape, dtype in self.output_spec():             shape[0]=shape[0] *1              self.outputs2.append(np.zeros(shape, dtype))         print("done building..")      def input_spec(self):         """         Get the specs for the input tensor of the network. Useful to prepare memory allocations.         :return: Two items, the shape of the input tensor and its (numpy) datatype.         """         return self.inputs[0]['shape'], self.inputs[0]['dtype']      def output_spec(self):         """         Get the specs for the output tensors of the network. Useful to prepare memory allocations.         :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.         """         specs = []         for o in self.outputs1:             specs.append((o['shape'], o['dtype']))                return specs         def h_to_d(self, batch):         self.batch = batch         cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))        def destory(self):         self.cfx.pop()     def d_to_h(self):                     for o in range(len(self.outputs2)):         cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)                  return self.outputs2     def infer_this(self):         self.cfx.push()         self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)         self.cfx.pop()  if __name__ == '__main__':     logger = trt.Logger(trt.Logger.ERROR)     trt.init_libnvinfer_plugins(logger, namespace="")     engine = None     with open('/home/zenith/Desktop/model1_16.trt', "rb") as f, trt.Runtime(logger) as runtime:         engine1 = runtime.deserialize_cuda_engine(f.read())        mat1 = cv2.imread('/home/zenith/Desktop/tf16/img108.jpg')     stretch_near1 = cv2.resize(mat1, (640, 640))     _image1 = np.expand_dims(stretch_near1, axis=0).astype(np.float32)         images = np.random.rand(1, 640, 640, 3).astype(np.float32)     trt_infer_big1 = TensorRTInfer(engine1)      x = range(100)     for n in x:         tic = time.perf_counter()         tiic = time.perf_counter()         trt_infer_big1.h_to_d(_image1)                      tooc = time.perf_counter()         vll = tooc - tiic         print("h_to_d:" + str(vll))         act1 = time.perf_counter()         trt_infer_big1.infer_this()                    act2 = time.perf_counter()         vll = act2 - act1         print("inference:" + str(vll))         teec = time.perf_counter()         trt_infer_big1.d_to_h()                  toec = time.perf_counter()         vll = toec - teec         print("d_to_h:" + str(vll))          toc = time.perf_counter()         vll = toc - tic         print("whole time:" + str(vll))         sleep(0.05)

in the above for loop ,I’m trying to follow cuda concurrent pattern which should reduce the time considerably compared with linear approach .
You will notice the time for d_to_h in the loop is taking the largest amount of time while the inference is taking so little .

aiad789 · December 9, 2021, 11:55am

Any update please ?

Thanks

spolisetty · January 4, 2022, 10:44am

Hi,

Please refer to following. Which may help you.

github.com

NVIDIA/TensorRT/blob/main/samples/python/common.py#L155

                          host_mem = cuda.pagelocked_empty(size, dtype)
                   device_mem = cuda.mem_alloc(host_mem.nbytes)
                   # Append the device buffer to device bindings.
                   bindings.append(int(device_mem))
                   # Append to the appropriate list.
                   if engine.binding_is_input(binding):
                       inputs.append(HostDeviceMem(host_mem, device_mem))
                   else:
                       outputs.append(HostDeviceMem(host_mem, device_mem))
               return inputs, outputs, bindings, stream
                      
           
# This function is generalized for multiple inputs/outputs.
           # inputs and outputs are expected to be lists of HostDeviceMem objects.
           def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
               # Transfer input data to the GPU.
               [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
               # Run inference.
               context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
               # Transfer predictions back from the GPU.
               [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]

Thank you.

jdluckyday · February 13, 2022, 5:54am

Thanks for sharing. However it is still go through the samples by repeating the full cycles, instead of a pipeline type of process. The inference takes much less time then moving data between host and device.

Means the sample code does the following: copy sample1 input from host to device, inference, copy sampe1 output from device to host. Then copy sample2 input from host to device, inference, copy sampe2 output from device to host.

Is there a way to do it like a pipeline style? Means simultaneously copying sample1 output from host to device and copying sample2 input from host to device. Thanks!

Topic		Replies	Views
TensorRT copy data cost a lot of time TensorRT	1	658	April 8, 2020
Error while moving data from cuda-capable device to host memory - Error Code 1: Cuda Runtime (unspecified launch failure) Jetson Nano tensorrt , cuda	2	599	October 15, 2021
Memory Copy from Device to Host seems blocked or delayed CUDA Programming and Performance tensorrt , cuda , ubuntu	0	152	June 3, 2024
Transfer data from GPU to CPU takes too much times on TX2 TensorRT	1	1287	August 9, 2019
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	2038	December 14, 2018
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	926	March 13, 2023
Using tensorRT to accelerate caffe model， but it take more time to inference Jetson TX2	6	511	October 18, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	774	March 13, 2023
There is a difference in inference speed in TensorRT 8 TensorRT tensorrt	4	526	October 28, 2021
Transfer data from GPU to CPU takes too much times on TX2 Jetson TX2	5	972	October 18, 2021

Cuda transfer from device to host is extremely slow

Related topics