TensorRT produce all zero output for Qwen3-Embedding-0.6B

RicardoLu · June 24, 2025, 11:43am

Description

I downloaded Qwen/Qwen3-Embedding-0.6B from HuggingFace, and convert it to a dynamic shape onnx(see the attachment named test_qwen3_embedding.py). Then I use trtexec to convert onnx model to TensorRT Engine by following command:

/usr/src/tensorrt/bin/trtexec --onnx=qwen3_embedding_0.6b.onnx --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:1x1024,attention_mask:1x1024 --maxShapes=input_ids:1x4096,attention_mask:1x4096 --fp16 --saveEngine=qwen3_embedding_0.6b.engine

However when I run the engine with TensorRT C++ API and TritonServer, both of them output zeros.

Environment

I test on A100 in the NGC Container nvcr.io/nvidia/tritonserver:23.07-py3.

TensorRT Version: 8.6.1
GPU Type: dGPU
Nvidia Driver Version: 535.183.01
CUDA Version: 12.1
CUDNN Version:
Operating System + Version: 22.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Steps To Reproduce

Pull and run docker image nvcr.io/nvidia/tritonserver:23.07-py3.
Download Qwen/Qwen3-Embedding-0.6B from HuggingFace.
Convert it to ONNX model by test_qwen3_embedding.txt (4.2 KB), this is a Python script.
Convert it to TensorRT Engine by above command line.
Run it by tritonserver.
Send a HTTP or gRPC Request by tritonclient.

You will get all zero outputs.

RicardoLu · June 25, 2025, 9:25am

I observed the weights and activation distribution of the original model and found that Qwen3RMSNorm may overflow FP16, so I convert TensorRT Engine without --fp16. In ONNX model there are plenty of cast node to maintain the precision, it seems TensorRT will delete them? So I have to convert a FP32 TensorRT engine? However, even though FP32 engine produces none zero outputs, it still can’t align to the outputs of ONNX model.

RicardoLu · June 26, 2025, 1:59am

I pulled the lastest tritonserver image nvcr.io/nvidia/tritonserver:25.05-py3, it has lastest TensorRT v10.0.10, and it works. Now I can get the correct outputs. However my target platform is Drive Orin, which only have TensorRT 8.6.1 deb package released. My question are:

what are the differences between v10.0.10 and v8.6.1?
why --fp16 doesn’t work well?

RicardoLu · June 27, 2025, 8:33am

After confirmation by an expert, I believe that there are some bugs of RMSNorm in TensorRT 8.6, both precision and efficiency, even in FP32 data type. So I implement a fp16 RMSNorm plugin and replace all related ops to a RMSNorm node. Then convert it with --fp16 option, it works well.
However it still can’t explain why --fp16 doesn’t work well in TensorRT 10.0.

nabang1010 · July 9, 2025, 9:49am

Hi, @RicardoLu. I also catch the problem. My trt model inference and return wrong output. It different with Raw model and ONNX model (ONNX model work well)

My environmet:

Docker image: nvcr.io/nvidia/deepstream:6.3-triton-multiarch
TensorRT: 10.3.0
CUDA: 12.6
Python: 3.10.12
Polygraphy: 0.49.24
Transformers: 4.53.1

Here is my step:

1. I export ONNX model using optimum-cli

optimum-cli export onnx --model Qwen/Qwen3-Embedding-0.6B --task feature-extraction --opset 19   models/qwen3_embedding_0.6b_onnx

2. Use trtexec to build trt engine:

trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \     --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \     --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \     --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \     --fp16 \     --saveEngine=qwen3_embedding_0.6b.engine

trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \     --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \     --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \     --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \     --best \     --saveEngine=model_repository/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine

3. Use python script to check trt model:

 from transformers import AutoTokenizer, AutoModel from polygraphy.backend.trt import EngineFromBytes, TrtRunner import time import numpy as np import torch import sys import os    def run_tensorrt_polygraphy_model(texts, engine_path):           print("\n" + "=" * 50)     print("RUNNING TENSORRT MODEL ")     print("=" * 50)          # Load tokenizer     tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B")     if tokenizer.pad_token is None:         tokenizer.pad_token = tokenizer.eos_token          try:         # Load TensorRT engine using Polygraphy         with open(engine_path, 'rb') as f:             engine_bytes = f.read()         engine = EngineFromBytes(engine_bytes)                  print(f"TensorRT engine loaded from {engine_path}")                  # Create TensorRT runner         with TrtRunner(engine) as runner:             # Tokenize inputs             start_time = time.time()             inputs = tokenizer(                 texts,                 padding='max_length',                 max_length=1024,                 truncation=True,                 return_tensors="np"             )             tokenization_time = time.time() - start_time             print(f"Tokenization time: {tokenization_time:.4f}s")                          # Print input info             print("Input shapes:")             for key, value in inputs.items():                 print(f"  {key}: {value.shape}")                          # Prepare inputs             input_ids = inputs['input_ids'].astype(np.int64)             attention_mask = inputs['attention_mask'].astype(np.int64)             position_ids = np.arange(1024, dtype=np.int64)[np.newaxis, :].repeat(len(texts), axis=0)                          print(f"  position_ids: {position_ids.shape}")                          # Prepare input dictionary for TensorRT             input_dict = {                 'input_ids': input_ids,                 'attention_mask': attention_mask,                 'position_ids': position_ids             }                          # Run inference             start_time = time.time()             outputs = runner.infer(input_dict)             inference_time = time.time() - start_time             print(f"Inference time: {inference_time:.4f}s")                          # Get output (assuming first output is last_hidden_state)             output_names = list(outputs.keys())             print(f"Available outputs: {output_names}")                          # Get the main output (should be last_hidden_state)             last_hidden_state = None             for output_name in output_names:                 if 'last_hidden_state' in output_name.lower() or len(output_names) == 1:                     last_hidden_state = outputs[output_name]                     break                          if last_hidden_state is None:                 # Take the first output if no clear match                 last_hidden_state = outputs[output_names[0]]                 print(f"Using output: {output_names[0]}")                          print(f"TensorRT output shape: {last_hidden_state.shape}")             print(f"TensorRT output range: [{last_hidden_state.min():.6f}, {last_hidden_state.max():.6f}]")             print(f"TensorRT output mean: {last_hidden_state.mean():.6f}")             print(f"TensorRT output std: {last_hidden_state.std():.6f}")                          # Pool embeddings             embeddings_list = pool_embeddings(last_hidden_state, attention_mask)             embeddings = np.stack(embeddings_list)             print(f"Embeddings shape after pooling: {embeddings.shape}")                          # Calculate similarity             similarity_scores, normalized_embeddings = calculate_similarity(embeddings)                          # Calculate norms             norms = np.linalg.norm(normalized_embeddings, axis=1)             print(f"Embedding norms: {norms.tolist()}")                          print(f"Similarity scores:")             print(f"  TensorRT scores: {similarity_scores}")                          return {                 'embeddings': normalized_embeddings,                 'raw_outputs': last_hidden_state,                 'inputs': {k: v for k, v in inputs.items()},                 'similarity_scores': similarity_scores,                 'inference_time': inference_time,                 'output_stats': {                     'min': float(last_hidden_state.min()),                     'max': float(last_hidden_state.max()),                     'mean': float(last_hidden_state.mean()),                     'std': float(last_hidden_state.std())                 }             }              except Exception as e:         print(f"TensorRT Polygraphy model failed: {e}")         import traceback         traceback.print_exc()         return None  def main():     texts = [             "What is the capital of China?",         "The capital of China is Beijing.",         "Gravity is a force that attracts two bodies toward each other."     ]     engine_path = "/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine"     tensorrt_result = run_tensorrt_polygraphy_model(texts, engine_path)      if __name__ == "__main__":     main()

RicardoLu · July 11, 2025, 7:59am

Hi @nabang1010. Try to convert fp32 model under TRT10+, it should work fine. BTW, TRT10+ support bf16, you can also try it, but as far as I known, efficiency of bf16 is worse than fp16.

jaya.kommuru · July 22, 2025, 5:27pm

@RicardoLu @nabang1010 can we use tensorRT-LLM for deploying this model?

nabang1010 · July 25, 2025, 9:00am

I tried but it so hard. Finally, i used vLLM

jaya.kommuru · July 25, 2025, 9:10am

@nabang1010 in vLLM, were you able to deploy the quantised model ?
as with tensor RT FP16, I am facing the issue of all dimensions of the embedding being 0

RicardoLu · July 29, 2025, 2:55am

sorry, my target platform is drive-orin, it doesn’t support TRT-LLM.

RicardoLu · July 29, 2025, 6:32am

Hi @nabang1010 , after a long time searching, I found that options like --fp16, --int8, --bf16 are be deprecated and superseded by strong typing. So you can just use --stronglyTyped when convert TensorRT engine without --fp16, it should work fine with TensorRT > 10.

Topic		Replies	Views
I can't get result from TensorRT model TensorRT tensorrt	8	1028	May 31, 2022
ONNX Model and Tensorrt Engine gives different output for parseq model TensorRT onnx	4	1268	July 17, 2023
TensorRT output full of NaN TensorRT	1	456	October 19, 2023
Ouput of infer with trt not maching vs onnx and pytorch model (SlowFast use Retnet 3D Conv) TensorRT tensorrt , inference-server-triton	13	1187	February 16, 2022
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	986	September 29, 2022
TensorRT model returns only zero outputs TensorRT	5	2843	April 24, 2022
Trtexec can not convert resnet152 onnx to TRT engine, without prompting error! TensorRT	12	1540	July 22, 2021
Onnx output differs largely to TRT engine output TensorRT	14	1826	February 25, 2023
Pytorch model convert to TensorRT engine failed TensorRT tensorrt , pytorch , onnx	5	1223	December 28, 2020
Unable to convert model to TensorRT when do_constant_folding=False TensorRT cudnn	1	571	February 29, 2024

TensorRT produce all zero output for Qwen3-Embedding-0.6B

Description

Environment

Steps To Reproduce

Related topics