Onnx output differs largely to TRT engine output

lqs1 · February 18, 2023, 5:56am

Description

I have an onnx model, whose output has beed verified to be almost identical with my original PyTorch model.

After I convert it to tensorrt engine, the output changes too much. I don’t know if there is any tool I can use to debug and locate the place where error produced between onnx and trt?

Environment

nvidia docker container 22.12

Relevant Files

Here is the onnx model: https://cloud.tsinghua.edu.cn/f/8e1a7623952946c7bb76/?dl=1

Steps To Reproduce

Use this script to reproduce:

import os import torch import torch.nn as nn import tensorrt as trt  TRT_LOGGER = trt.Logger() trt.init_libnvinfer_plugins(TRT_LOGGER, '')  def load_engine(engine_file_path):     assert os.path.exists(engine_file_path)     print("Reading engine from file {}".format(engine_file_path))     with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:         return runtime.deserialize_cuda_engine(f.read())  from torch.testing._internal.common_utils import numpy_to_torch_dtype_dict def get_trt_stuff(engine_path):     engine = load_engine(engine_path)     context = engine.create_execution_context()     inputs_dict = {}     outputs_dict = {}     bindings = []     for binding in engine:         binding_idx = engine.get_binding_index(binding)         # size = trt.volume(context.get_binding_shape(binding_idx))         dtype = trt.nptype(engine.get_binding_dtype(binding))         shape = tuple(context.get_binding_shape(binding_idx))         if engine.binding_is_input(binding):             inputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')             bindings.append(int(inputs_dict[binding].data_ptr()))         else:             outputs_dict[binding] = torch.empty(*shape, dtype=numpy_to_torch_dtype_dict[dtype], device='cuda')             bindings.append(int(outputs_dict[binding].data_ptr()))     return context, bindings, inputs_dict, outputs_dict  def run_trt(context, bindings, stream=None):     if stream is None:         stream = torch.cuda.default_stream()     state = context.execute_async_v2(bindings=bindings, stream_handle=stream.cuda_stream)     stream.synchronize()     return state  class TRTModule(nn.Module):     def __init__(self, engine_path):         super().__init__()         self.context, self.bindings, self.inputs_dict, self.outputs_dict = get_trt_stuff(engine_path)     def forward(self, *inputs, **kw_args):         device = 'cpu'         for i, inp in enumerate(inputs):             self.inputs_dict['input_{}'.format(i)].copy_(inp)             device = inp.device         shift = len(inputs)         for k in kw_args:             self.inputs_dict['input_{}'.format(shift)].copy_(kw_args[k])             shift += 1         state = run_trt(self.context, self.bindings)         if not state:             raise Exception("trt engine execution failed")         outputs = []         for i in range(len(self.outputs_dict)):             outputs.append(self.outputs_dict['output_{}'.format(i)].cpu().to(device))         if len(outputs) == 1:             outputs = outputs[0]         return outputs  import onnxruntime as ort  def get_ort_stuff(onnx_path, providers):     return ort.InferenceSession(onnx_path, providers=providers)  class ORTModule(nn.Module):     def __init__(self, onnx_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']):         super().__init__()         self.sess = get_ort_stuff(onnx_path, providers)     def forward(self, *inputs, **kw_args):         device = 'cpu'         for inp in inputs:             device = inp.device         for k in kw_args:             device = kw_args[k].device         inputs_dict = {'input_{}'.format(i):x.cpu().numpy() if isinstance(x, torch.Tensor) else x for i, x in enumerate(inputs)}         shift = len(inputs_dict)         for k in kw_args:             inputs_dict['input_{}'.format(shift)] = kw_args[k].cpu().numpy()             shift += 1         outputs = self.sess.run(None, inputs_dict)         outputs = [torch.from_numpy(x).to(device) for x in outputs]         if len(outputs) == 1:             outputs = outputs[0]         return outputs  input_0 = torch.randn(2, 3, 256, 256, dtype=torch.float32).cuda() input_1 = torch.tensor([1, 3], dtype=torch.int32).cuda() input_2 = torch.randn(2, 81, 640, dtype=torch.float32).cuda()  ort = ORTModule('onnx/EfficientUNetModel_.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) os.system('trtexec --onnx=onnx/EfficientUNetModel_.onnx --saveEngine=onnx/EfficientUNetModel_.trt --fp16 --buildOnly') trt = TRTModule('onnx/EfficientUNetModel_.trt') out_ort = ort(input_0, input_1, input_2) out_trt = trt(input_0, input_1, input_2) print((out_ort-out_trt).abs().max().item())

Before run it, you should pip install onnxruntime-gpu.

It shows the absolute error is ~4.6 that large.

Also refer to Onnx output differs largely to TRT engine output

jasxu · February 20, 2023, 10:02am

We have a tool called polygraphy can be used to debug accuracy issues, see TensorRT/tools/Polygraphy/examples/cli/run/01_comparing_frameworks at main · NVIDIA/TensorRT · GitHub
And for your case, you can first try to run with FP32 precision and see if the accuracy issue only happens to FP16? If only FP16 fails, it’s usually caused by FP16 overflow, e.g. the output of some accumulate operations is larger than 65504. This can be fixed by forcing FP32 precision for those problematic layers/tensors.

lqs1 · February 21, 2023, 12:07pm

I have checked that there’s no fp16 overflow… So is there any other possible reason for the output error?

lqs1 · February 22, 2023, 3:06am

This is the simplified problem for now:

I have a very simple onnx file (Where I locate the problematic sub-network): https://cloud.tsinghua.edu.cn/f/5db9c79dc5a841ada575/?dl=1

When I use polygraphy run onnx/sample.onnx --trt --fp16 --onnxrt to test it. The output engine makes very large error.

So how to fix this problem?

jasxu · February 22, 2023, 3:55am

Can you share the FP32/FP16 error report from Polygraphy?
And you can use " --trt-outputs mark all --onnx-outputs mark all " to dump per-layer accuracy result, it can help you root cause the problematic layer and understand the error porpogation.

lqs1 · February 22, 2023, 6:58am

Thank you for your suggestion. As the per-layer polygraphy shows, InstanceNormalization contributes most error:

[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0'  (dtype=float16, shape=(2, 32, 262144))                                                                            [I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                           [I]         trt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_$ utput_0 | Stats: mean=-1.7115e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060),  max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                           [I]             ---- Histogram ----                                                                                               Bin Range        |  Num Elems | Visualization                                                                     (-5.38 , -4.27 ) |          0 |                                                                                   (-4.27 , -3.15 ) |          0 |                                                                                   (-3.15 , -2.03 ) |          0 |                                                                                   (-2.03 , -0.916) |          0 |                                                                                   (-0.916, 0.201 ) |   16777216 | ########################################                                          (0.201 , 1.32  ) |          0 |                                                                                   (1.32  , 2.43  ) |          0 |                                                                                   (2.43  , 3.55  ) |          0 |                                                                                   (3.55  , 4.67  ) |          0 |                 (4.67  , 5.79  ) |          0 | [I]         onnxrt-runner-N0-02/22/23-06:48:38: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizati$ n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 12306$ ), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224 [I]             ---- Histogram ----                 Bin Range        |  Num Elems | Visualization                 (-5.38 , -4.27 ) |         62 |                 (-4.27 , -3.15 ) |       4321 |                 (-3.15 , -2.03 ) |     246204 | #                 (-2.03 , -0.916) |    3273482 | ######################                 (-0.916, 0.201 ) |    5860698 | ########################################                 (0.201 , 1.32  ) |    5725682 | #######################################                 (1.32  , 2.44  ) |    1631536 | ###########                 (2.44  , 3.55  ) |      34636 |                 (3.55  , 4.67  ) |        578 |                 (4.67  , 5.79  ) |         17 | [I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0            [I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo wer if both abs/rel tolerances are set) [I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.40$ 4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905 [I]                 ---- Histogram ----                     Bin Range        |  Num Elems | Visualization                     (3.4e-07, 0.574) |    6758688 | ########################################                     (0.574  , 1.15 ) |    5077625 | ##############################                     (1.15   , 1.72 ) |    3851361 | ######################                     (1.72   , 2.3  ) |     936867 | #####                     (2.3    , 2.87 ) |     134072 |                     (2.87   , 3.44 ) |      16417 |                     (3.44   , 4.02 ) |       1931 |                     (4.02   , 4.59 ) |        222 |                     (4.59   , 5.17 ) |         29 |                     (5.17   , 5.74 ) |          4 | [I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=0.99998, min$ 0.84108 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609 [I]                 ---- Histogram ----                     Bin Range      |  Num Elems | Visualization                     (0.841, 0.866) |          2 |                     (0.866, 0.892) |          2 |                     (0.892, 0.917) |          2 |                     (0.917, 0.942) |          5 |                     (0.942, 0.967) |         12 |                     (0.967, 0.993) |    8386932 | #######################################                     (0.993, 1.02 ) |    8390241 | ########################################                     (1.02 , 1.04 ) |         13 |                     (1.04 , 1.07 ) |          3 |                     (1.07 , 1.09 ) |          4 | [E]         FAILED | Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Is there any bug for InstanceNormalization layer in TensorRT? Because my InstanceNorm layer is a 32 group GroupNorm actually.

jasxu · February 22, 2023, 7:49am

This is a common case that InstanceNormal causes accuracy issues in various models. The key problem is when using FP16 precision, instanceNorm kernel needs to accumulate per channel/batch elements under FP16 precision. Although the final result of InstanceNorm will not overflow but the accumulcator result might overflow before executing the division operation. That causes the accuracy issue.

The recommanded way to fix this issue is to mark the precision of InstanceNorm layer as FP32. Polygraphy has provided such functionality with “–layerPrecision”/“–tensorPrecision”. You can check the Polygraphy --help to see how to use it. And trtexec should also has similiar options.

lqs1 · February 22, 2023, 7:58am

Thank you for your explanation! However, I have a very large model. Is there any example of using python to optimize the onnx to trt? Otherwise, I have to list all InstanceNorm layer names in trtexec command line… (Or if the trtexec command line flags support regex expression?)

jasxu · February 22, 2023, 8:04am

You can use graph surgeon to optimize your ONNX model, it’s easy to use. See TensorRT/tools/onnx-graphsurgeon/examples/04_modifying_a_model at main · NVIDIA/TensorRT · GitHub

As for your second question, as far as I know, trtexec does not support regex expressions. We might add it in future. As for now, you can write a script to search all instance norm layers and generate the trtexec cmd.

lqs1 · February 22, 2023, 8:26am

I tried the layer precision flag in polygraphy. However, I still got large error when norm becomes fp32. Is there anything wrong with my command?

polygraphy run onnx/sample.onnx --trt --fp16 --precision-constraints=obey --layer-precisions=/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization:float32 --onnxrt --trt-outputs mark all --onnx-outputs mark all

[I]     Comparing Output: '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0' (dtype=fl oat32, shape=(2, 32, 262144)) with '/input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0'  (dtype=float16, shape=(2, 32, 262144))                                                                            [I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error                                           [I]         trt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_o utput_0 | Stats: mean=-1.8423e-09, std-dev=0.005532, var=3.0603e-05, median=0, min=-0.042147 at (1, 13, 123060),  max=0.045272 at (0, 11, 47871), avg-magnitude=0.0031986                                                           [I]             ---- Histogram ----                                                                                               Bin Range        |  Num Elems | Visualization                                                                     (-5.38 , -4.27 ) |          0 |                                                                                   (-4.27 , -3.15 ) |          0 |                                                                                   (-3.15 , -2.03 ) |          0 |                                                                                   (-2.03 , -0.916) |          0 |                                                                                   (-0.916, 0.201 ) |   16777216 | ########################################                                          (0.201 , 1.32  ) |          0 |                                                                                   (1.32  , 2.43  ) |          0 |                                                                                   (2.43  , 3.55  ) |          0 |                                                                                   (3.55  , 4.67  ) |          0 |                                                                                   (4.67  , 5.79  ) |          0 |                                                                   [I]         onnxrt-runner-N0-02/22/23-08:38:32: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalizatio n_output_0 | Stats: mean=2.0576e-07, std-dev=0.99961, var=0.99923, median=0.053864, min=-5.3828 at (1, 13, 123060 ), max=5.7852 at (0, 11, 47871), avg-magnitude=0.82224                                                            [I]             ---- Histogram ----                                                                                               Bin Range        |  Num Elems | Visualization                                                                     (-5.38 , -4.27 ) |         62 |                                                                                   (-4.27 , -3.15 ) |       4321 |                                                                                   (-3.15 , -2.03 ) |     246204 | #                                                                                 (-2.03 , -0.916) |    3273482 | ######################                                                            (-0.916, 0.201 ) |    5860698 | ########################################                                          (0.201 , 1.32  ) |    5725682 | #######################################                                           (1.32  , 2.44  ) |    1631536 | ###########                                                                       (2.44  , 3.55  ) |      34636 |                  (3.55  , 4.67  ) |        578 |                 (4.67  , 5.79  ) |         17 | [I]         Error Metrics: /input_blocks.1/input_blocks.1.0/in_layers.0/InstanceNormalization_output_0 [I]             Minimum Required Tolerance: elemwise error | [abs=5.7399] OR [rel=1.0935] (requirements may be lo wer if both abs/rel tolerances are set) [I]             Absolute Difference | Stats: mean=0.81905, std-dev=0.56622, var=0.3206, median=0.74607, min=3.400 4e-07 at (0, 1, 150973), max=5.7399 at (0, 11, 47871), avg-magnitude=0.81905 [I]                 ---- Histogram ----                     Bin Range        |  Num Elems | Visualization                     (3.4e-07, 0.574) |    6758688 | ########################################                     (0.574  , 1.15 ) |    5077625 | ##############################                     (1.15   , 1.72 ) |    3851361 | ######################                     (1.72   , 2.3  ) |     936867 | #####                     (2.3    , 2.87 ) |     134072 |                     (2.87   , 3.44 ) |      16417 |                     (3.44   , 4.02 ) |       1931 |                     (4.02   , 4.59 ) |        222 |                     (4.59   , 5.17 ) |         29 |                     (5.17   , 5.74 ) |          4 | [I]             Relative Difference | Stats: mean=0.99609, std-dev=0.003915, var=1.5327e-05, median=1, min=0.8410 8 at (1, 7, 46214), max=1.0935 at (1, 7, 112736), avg-magnitude=0.99609 [I]                 ---- Histogram ----                     Bin Range      |  Num Elems | Visualization                     (0.841, 0.866) |          2 |                     (0.866, 0.892) |          2 |                     (0.892, 0.917) |          2 |                     (0.917, 0.942) |          5 |                     (0.942, 0.967) |         12 |                     (0.967, 0.993) |    8386932 | #######################################                     (0.993, 1.02 ) |    8390241 | ########################################                     (1.02 , 1.04 ) |         13 |                     (1.04 , 1.07 ) |          3 |                     (1.07 , 1.09 ) |          4 |

jasxu · February 22, 2023, 9:00am

Can you add "-v -v -v " options to Polygraphy and share the whole Polygraphy log to me?

lqs1 · February 22, 2023, 9:03am

Here is the log: https://cloud.tsinghua.edu.cn/f/85eee484ca024503bd31/?dl=1

It seems that this time the error becomes incredibly larger…

jasxu · February 22, 2023, 10:28am

Yes, I think this seems like a TRT bug. I can repro it in TRT 8.5, but seems this issue has been fixed in TRT 8.6. You can wait the TRT 8.6 release to verify it, which should be released next month.

lqs1 · February 22, 2023, 10:55am

Alright… Looking forward to the new release!

lqs1 · February 25, 2023, 3:58pm

Moreover, is there any workaround before the new release comes up? I notice there are group normalization plugins in the repo. However, they don’t have any docs or readmes. I wonder if I can use them? And If yes, how?

Topic		Replies	Views
ONNX Model and Tensorrt Engine gives different output TensorRT tensorrt , onnx	13	5478	June 29, 2022
Onnx -> tensorrt fp32 conversion performance degradation different outputs TensorRT tensorrt , pytorch , onnx	4	2124	November 29, 2022
Output from ONNX inference and trt inference are different Jetson TX2 tensorrt , tensorflow , nvbugs	6	851	October 18, 2021
Pytorch to Onnx to TRT: Unstable Output when running TRTExec TensorRT	11	2306	August 10, 2021
TensorRT gives diffent results than ONNX and Pytorch TensorRT	8	1664	September 28, 2023
Tensorrt8.5 inference different with origin onnx model TensorRT	5	1444	January 23, 2023
BUG: Output TRT engine from trtexec has completely different inference than input model TensorRT tensorrt , debugging-and-troubleshooting	3	2272	January 4, 2022
Yolor to onnx to trt TensorRT	1	1587	September 14, 2022
Issues while converting ONNX to TRT Jetson Nano tensorrt , onnx	9	1301	October 18, 2021
tensorRT inference unstable compared onnxruntime TensorRT	4	1358	May 4, 2021

Onnx output differs largely to TRT engine output

Description

Environment

Relevant Files

Steps To Reproduce

Related topics