Cuda transfer from device to host is extremely slow

Hello ,
Im using the below code to create cuda stream and run inference on SSD mobile 320x320 V2 converted to tesnorrt .The inference is running fast but Im facing extreme slowness when moving the data back from device to host in the d_to_h steps .The inference is taking 5 ms while the transfer is taking 20 ms .
Is there anyhting in the code where i can enhance or improve the speed of transfrer and could this be an issue ?

Im usnig Xavier and TensorRT 8
Thanks

class TensorRTInfer:

def __init__(self, engine):     """     :param engine_path: The path to the serialized engine to load from disk.     """      # Load TRT engine     self.cfx = cuda.Device(0).make_context()     self.stream = cuda.Stream()     self.engine = engine     self.context = self.engine.create_execution_context()      # Setup I/O bindings     self.inputs1 = []     self.outputs1 = []     self.allocations1 = []      for i in range(self.engine.num_bindings):                 name = self.engine.get_binding_name(i)         dtype = self.engine.get_binding_dtype(i)         shape = self.engine.get_binding_shape(i)                size = np.dtype(trt.nptype(dtype)).itemsize * 1         for s in shape:             size *= s         allocation1 = cuda.mem_alloc(size)          binding1 = {             'index': i,             'name': name,             'dtype': np.dtype(trt.nptype(dtype)),             'shape': list(shape),             'allocation': allocation1,         }          self.allocations1.append(allocation1)          if self.engine.binding_is_input(i):             self.inputs1.append(binding1)          else:             self.outputs1.append(binding1)               self.outputs2 = []     for shape, dtype in self.output_spec():         shape[0]=shape[0] *1          self.outputs2.append(np.zeros(shape, dtype))     print("done building..")  def input_spec(self):     """     Get the specs for the input tensor of the network. Useful to prepare memory allocations.     :return: Two items, the shape of the input tensor and its (numpy) datatype.     """     return self.inputs[0]['shape'], self.inputs[0]['dtype']  def output_spec(self):     """     Get the specs for the output tensors of the network. Useful to prepare memory allocations.     :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.     """     specs = []     for o in self.outputs1:         specs.append((o['shape'], o['dtype']))        return specs  def h_to_d(self, batch):     self.batch = batch     cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))    def destory(self):     self.cfx.pop() def d_to_h(self):             for o in range(len(self.outputs2)):     cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)          return self.outputs2 def infer_this(self):     self.cfx.push()     self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)     self.cfx.pop() 

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

model1_trt_16.trt (8.1 MB)

The complete code will be :

import os import sys from time import time, sleep, perf_counter import time import time import ctypes import argparse import numpy as np import tensorrt as trt  import pycuda.driver as cuda import pycuda.autoinit import threading from concurrent.futures import ThreadPoolExecutor from multiprocessing import Process, Queue, Manager import multiprocessing import cv2   class TensorRTInfer:     """     Implements inference for the Model TensorRT engine.     """      def __init__(self, engine):         """         :param engine_path: The path to the serialized engine to load from disk.         """          # Load TRT engine         self.cfx = cuda.Device(0).make_context()         self.stream = cuda.Stream()         self.engine = engine         self.context = self.engine.create_execution_context()          # Setup I/O bindings         self.inputs1 = []         self.outputs1 = []         self.allocations1 = []          for i in range(self.engine.num_bindings):                         name = self.engine.get_binding_name(i)             dtype = self.engine.get_binding_dtype(i)             shape = self.engine.get_binding_shape(i)                        size = np.dtype(trt.nptype(dtype)).itemsize * 1             for s in shape:                 size *= s             allocation1 = cuda.mem_alloc(size)              binding1 = {                 'index': i,                 'name': name,                 'dtype': np.dtype(trt.nptype(dtype)),                 'shape': list(shape),                 'allocation': allocation1,             }              self.allocations1.append(allocation1)              if self.engine.binding_is_input(i):                 self.inputs1.append(binding1)              else:                 self.outputs1.append(binding1)                       self.outputs2 = []         for shape, dtype in self.output_spec():             shape[0]=shape[0] *1              self.outputs2.append(np.zeros(shape, dtype))         print("done building..")      def input_spec(self):         """         Get the specs for the input tensor of the network. Useful to prepare memory allocations.         :return: Two items, the shape of the input tensor and its (numpy) datatype.         """         return self.inputs[0]['shape'], self.inputs[0]['dtype']      def output_spec(self):         """         Get the specs for the output tensors of the network. Useful to prepare memory allocations.         :return: A list with two items per element, the shape and (numpy) datatype of each output tensor.         """         specs = []         for o in self.outputs1:             specs.append((o['shape'], o['dtype']))                return specs         def h_to_d(self, batch):         self.batch = batch         cuda.memcpy_htod_async(self.inputs1[0]['allocation'], np.ascontiguousarray(batch))        def destory(self):         self.cfx.pop()     def d_to_h(self):                     for o in range(len(self.outputs2)):         cuda.memcpy_dtoh_async(self.outputs2[0], self.outputs1[0]['allocation'], self.stream)                  return self.outputs2     def infer_this(self):         self.cfx.push()         self.context.execute_async(batch_size=1,bindings=self.allocations1, stream_handle=self.stream.handle)         self.cfx.pop()  if __name__ == '__main__':     logger = trt.Logger(trt.Logger.ERROR)     trt.init_libnvinfer_plugins(logger, namespace="")     engine = None     with open('/home/zenith/Desktop/model1_16.trt', "rb") as f, trt.Runtime(logger) as runtime:         engine1 = runtime.deserialize_cuda_engine(f.read())        mat1 = cv2.imread('/home/zenith/Desktop/tf16/img108.jpg')     stretch_near1 = cv2.resize(mat1, (640, 640))     _image1 = np.expand_dims(stretch_near1, axis=0).astype(np.float32)         images = np.random.rand(1, 640, 640, 3).astype(np.float32)     trt_infer_big1 = TensorRTInfer(engine1)      x = range(100)     for n in x:         tic = time.perf_counter()         tiic = time.perf_counter()         trt_infer_big1.h_to_d(_image1)                      tooc = time.perf_counter()         vll = tooc - tiic         print("h_to_d:" + str(vll))         act1 = time.perf_counter()         trt_infer_big1.infer_this()                    act2 = time.perf_counter()         vll = act2 - act1         print("inference:" + str(vll))         teec = time.perf_counter()         trt_infer_big1.d_to_h()                  toec = time.perf_counter()         vll = toec - teec         print("d_to_h:" + str(vll))          toc = time.perf_counter()         vll = toc - tic         print("whole time:" + str(vll))         sleep(0.05)              

in the above for loop ,I’m trying to follow cuda concurrent pattern which should reduce the time considerably compared with linear approach .
You will notice the time for d_to_h in the loop is taking the largest amount of time while the inference is taking so little .

Any update please ?

Thanks

Hi,

Please refer to following. Which may help you.

Thank you.

Thanks for sharing. However it is still go through the samples by repeating the full cycles, instead of a pipeline type of process. The inference takes much less time then moving data between host and device.

Means the sample code does the following: copy sample1 input from host to device, inference, copy sampe1 output from device to host. Then copy sample2 input from host to device, inference, copy sampe2 output from device to host.

Is there a way to do it like a pipeline style? Means simultaneously copying sample1 output from host to device and copying sample2 input from host to device. Thanks!