TensorRT produce all zero output for Qwen3-Embedding-0.6B

Description

I downloaded Qwen/Qwen3-Embedding-0.6B from HuggingFace, and convert it to a dynamic shape onnx(see the attachment named test_qwen3_embedding.py). Then I use trtexec to convert onnx model to TensorRT Engine by following command:

/usr/src/tensorrt/bin/trtexec --onnx=qwen3_embedding_0.6b.onnx --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:1x1024,attention_mask:1x1024 --maxShapes=input_ids:1x4096,attention_mask:1x4096 --fp16 --saveEngine=qwen3_embedding_0.6b.engine 

However when I run the engine with TensorRT C++ API and TritonServer, both of them output zeros.

Environment

I test on A100 in the NGC Container nvcr.io/nvidia/tritonserver:23.07-py3.

TensorRT Version: 8.6.1
GPU Type: dGPU
Nvidia Driver Version: 535.183.01
CUDA Version: 12.1
CUDNN Version:
Operating System + Version: 22.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Steps To Reproduce

  • Pull and run docker image nvcr.io/nvidia/tritonserver:23.07-py3.
  • Download Qwen/Qwen3-Embedding-0.6B from HuggingFace.
  • Convert it to ONNX model by test_qwen3_embedding.txt (4.2 KB), this is a Python script.
  • Convert it to TensorRT Engine by above command line.
  • Run it by tritonserver.
  • Send a HTTP or gRPC Request by tritonclient.

You will get all zero outputs.

1 Like

I observed the weights and activation distribution of the original model and found that Qwen3RMSNorm may overflow FP16, so I convert TensorRT Engine without --fp16. In ONNX model there are plenty of cast node to maintain the precision, it seems TensorRT will delete them? So I have to convert a FP32 TensorRT engine? However, even though FP32 engine produces none zero outputs, it still can’t align to the outputs of ONNX model.

2 Likes

I pulled the lastest tritonserver image nvcr.io/nvidia/tritonserver:25.05-py3, it has lastest TensorRT v10.0.10, and it works. Now I can get the correct outputs. However my target platform is Drive Orin, which only have TensorRT 8.6.1 deb package released. My question are:

  1. what are the differences between v10.0.10 and v8.6.1?
  2. why --fp16 doesn’t work well?
1 Like

After confirmation by an expert, I believe that there are some bugs of RMSNorm in TensorRT 8.6, both precision and efficiency, even in FP32 data type. So I implement a fp16 RMSNorm plugin and replace all related ops to a RMSNorm node. Then convert it with --fp16 option, it works well.
However it still can’t explain why --fp16 doesn’t work well in TensorRT 10.0.

1 Like

Hi, @RicardoLu. I also catch the problem. My trt model inference and return wrong output. It different with Raw model and ONNX model (ONNX model work well)

My environmet:

  • Docker image: nvcr.io/nvidia/deepstream:6.3-triton-multiarch
  • TensorRT: 10.3.0
  • CUDA: 12.6
  • Python: 3.10.12
  • Polygraphy: 0.49.24
  • Transformers: 4.53.1

Here is my step:

1. I export ONNX model using optimum-cli

optimum-cli export onnx --model Qwen/Qwen3-Embedding-0.6B --task feature-extraction --opset 19   models/qwen3_embedding_0.6b_onnx  

2. Use trtexec to build trt engine:

trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \     --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \     --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \     --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \     --fp16 \     --saveEngine=qwen3_embedding_0.6b.engine 
trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \     --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \     --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \     --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \     --best \     --saveEngine=model_repository/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine 

3. Use python script to check trt model:

 from transformers import AutoTokenizer, AutoModel from polygraphy.backend.trt import EngineFromBytes, TrtRunner import time import numpy as np import torch import sys import os    def run_tensorrt_polygraphy_model(texts, engine_path):           print("\n" + "=" * 50)     print("RUNNING TENSORRT MODEL ")     print("=" * 50)          # Load tokenizer     tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B")     if tokenizer.pad_token is None:         tokenizer.pad_token = tokenizer.eos_token          try:         # Load TensorRT engine using Polygraphy         with open(engine_path, 'rb') as f:             engine_bytes = f.read()         engine = EngineFromBytes(engine_bytes)                  print(f"TensorRT engine loaded from {engine_path}")                  # Create TensorRT runner         with TrtRunner(engine) as runner:             # Tokenize inputs             start_time = time.time()             inputs = tokenizer(                 texts,                 padding='max_length',                 max_length=1024,                 truncation=True,                 return_tensors="np"             )             tokenization_time = time.time() - start_time             print(f"Tokenization time: {tokenization_time:.4f}s")                          # Print input info             print("Input shapes:")             for key, value in inputs.items():                 print(f"  {key}: {value.shape}")                          # Prepare inputs             input_ids = inputs['input_ids'].astype(np.int64)             attention_mask = inputs['attention_mask'].astype(np.int64)             position_ids = np.arange(1024, dtype=np.int64)[np.newaxis, :].repeat(len(texts), axis=0)                          print(f"  position_ids: {position_ids.shape}")                          # Prepare input dictionary for TensorRT             input_dict = {                 'input_ids': input_ids,                 'attention_mask': attention_mask,                 'position_ids': position_ids             }                          # Run inference             start_time = time.time()             outputs = runner.infer(input_dict)             inference_time = time.time() - start_time             print(f"Inference time: {inference_time:.4f}s")                          # Get output (assuming first output is last_hidden_state)             output_names = list(outputs.keys())             print(f"Available outputs: {output_names}")                          # Get the main output (should be last_hidden_state)             last_hidden_state = None             for output_name in output_names:                 if 'last_hidden_state' in output_name.lower() or len(output_names) == 1:                     last_hidden_state = outputs[output_name]                     break                          if last_hidden_state is None:                 # Take the first output if no clear match                 last_hidden_state = outputs[output_names[0]]                 print(f"Using output: {output_names[0]}")                          print(f"TensorRT output shape: {last_hidden_state.shape}")             print(f"TensorRT output range: [{last_hidden_state.min():.6f}, {last_hidden_state.max():.6f}]")             print(f"TensorRT output mean: {last_hidden_state.mean():.6f}")             print(f"TensorRT output std: {last_hidden_state.std():.6f}")                          # Pool embeddings             embeddings_list = pool_embeddings(last_hidden_state, attention_mask)             embeddings = np.stack(embeddings_list)             print(f"Embeddings shape after pooling: {embeddings.shape}")                          # Calculate similarity             similarity_scores, normalized_embeddings = calculate_similarity(embeddings)                          # Calculate norms             norms = np.linalg.norm(normalized_embeddings, axis=1)             print(f"Embedding norms: {norms.tolist()}")                          print(f"Similarity scores:")             print(f"  TensorRT scores: {similarity_scores}")                          return {                 'embeddings': normalized_embeddings,                 'raw_outputs': last_hidden_state,                 'inputs': {k: v for k, v in inputs.items()},                 'similarity_scores': similarity_scores,                 'inference_time': inference_time,                 'output_stats': {                     'min': float(last_hidden_state.min()),                     'max': float(last_hidden_state.max()),                     'mean': float(last_hidden_state.mean()),                     'std': float(last_hidden_state.std())                 }             }              except Exception as e:         print(f"TensorRT Polygraphy model failed: {e}")         import traceback         traceback.print_exc()         return None  def main():     texts = [             "What is the capital of China?",         "The capital of China is Beijing.",         "Gravity is a force that attracts two bodies toward each other."     ]     engine_path = "/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine"     tensorrt_result = run_tensorrt_polygraphy_model(texts, engine_path)      if __name__ == "__main__":     main()   

Hi @nabang1010. Try to convert fp32 model under TRT10+, it should work fine. BTW, TRT10+ support bf16, you can also try it, but as far as I known, efficiency of bf16 is worse than fp16.

1 Like

@RicardoLu @nabang1010 can we use tensorRT-LLM for deploying this model?

1 Like

I tried but it so hard. Finally, i used vLLM

1 Like

@nabang1010 in vLLM, were you able to deploy the quantised model ?
as with tensor RT FP16, I am facing the issue of all dimensions of the embedding being 0

sorry, my target platform is drive-orin, it doesn’t support TRT-LLM.

Hi @nabang1010 , after a long time searching, I found that options like --fp16, --int8, --bf16 are be deprecated and superseded by strong typing. So you can just use --stronglyTyped when convert TensorRT engine without --fp16, it should work fine with TensorRT > 10.