Description
I tried to convert the GPT model from pytorch to onnx and then to tensorRT, I successfully converted to tensorRT engine, but I can’t get the results I want during the inference phase, I can guarantee that the onnx model is correct. These two warnings appeared in the process of converting the onnx model to the tensorRT engine. I don’t know if these two warnings will affect the engine conversion.
[05/29/2022-19:08:00] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[05/29/2022-19:08:01] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.8.attn.c_attn.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
The code that onnx converts to tensorRT:
import tensorrt as trt logger = trt.Logger(trt.Logger.WARNING) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) parser = trt.OnnxParser(network, logger) success = parser.parse_from_file('model.onnx') # for idx in range(parser.num_errors): # print(parser.get_error(idx)) if not success: pass # Error handling code here config = builder.create_builder_config() #config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB config.max_workspace_size = 1 << 31 profile = builder.create_optimization_profile() profile.set_shape("input_ids", (1, 1), (1, 20), (1, 300)) profile.set_shape("token_type_ids", (1, 1), (1, 20), (1, 300)) config.add_optimization_profile(profile) serialized_engine = builder.build_serialized_network(network, config) with open("sample4.engine", "wb") as f: f.write(serialized_engine)
The main code to inference, input_ids and token_type_ids is two input for the model.
context.active_optimization_profile = 0 origin_inputshape = context.get_binding_shape(0) origin_inputshape[0],origin_inputshape[1] = input_ids.shape context.set_binding_shape(0,(origin_inputshape)) context.set_binding_shape(1,(origin_inputshape)) inputs, outputs, bindings, stream = common.allocate_buffers(engine) inputs[1].host = input_ids inputs[0].host = token_type_ids logits, *_= common.do_inference_v2(context,bindings = bindings, inputs= inputs, outputs=outputs, stream = stream)
the model I want to convert is OpenAIGPTLMHeadModel, I can only put one link, but you can cheack it from huggingface
Environment
TensorRT Version: 8.2.5.1
GPU Type: RTX 3060
Nvidia Driver Version: 497.38
CUDA Version: 11.5.1
CUDNN Version: 8.2.1.32
Operating System + Version: Windows11
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):
Relevant Files
github link to my code
RuntensorRT is inference phase