Please provide the following information when creating a topic:
- Hardware Platform (1xA100)
- System Memory (164Gi)
- Ubuntu Version (Ubuntu 22.04.5 LTS)
- NVIDIA GPU Driver Version (535.183.06)
- Issue Type( questions, new requirements, bugs)
I am trying to deploy the VSS 2.3.0, as in Remote LLM Deployment
with .env file
export MODEL_PATH=ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
I get the following error
via-server-1 | 2025-05-21 12:38:25,835 INFO Initializing VLM pipeline via-server-1 | 2025-05-21 12:38:25,842 INFO Have peer access: True via-server-1 | 2025-05-21 12:38:25,844 INFO Using model cached at /root/.via/ngc_model_cache/nim_nvidia_vila-1.5-40b_vila-yi-34b-siglip-stage3_1003_video_v8_vila-llama-3-8b-lita via-server-1 | 2025-05-21 12:38:25,844 INFO TRT-LLM Engine not found. Generating engines ... via-server-1 | INFO: Started server process [215] via-server-1 | INFO: Waiting for application startup. via-server-1 | INFO: Application startup complete. via-server-1 | INFO: Uvicorn running on http://127.0.0.1:60000 (Press CTRL+C to quit) via-server-1 | Selecting INT4 AWQ mode via-server-1 | Converting Checkpoint ... via-server-1 | [2025-05-21 12:38:30,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) via-server-1 | df: /root/.triton/autotune: No such file or directory via-server-1 | [TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400 via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 167, in <module> via-server-1 | quantize_and_export( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 669, in quantize_and_export via-server-1 | hf_config = get_hf_config(model_dir) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 265, in get_hf_config via-server-1 | return AutoConfig.from_pretrained(ckpt_path, trust_remote_code=True) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1091, in from_pretrained via-server-1 | raise ValueError( via-server-1 | ValueError: Unrecognized model in /tmp/tmp.vila.Nh1A9HRc. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth, intern_vit_6b, v2l_projector, llava_llama, llava_mistral, llava_mixtral via-server-1 | ERROR: Failed to convert checkpoint via-server-1 | 2025-05-21 12:38:35,777 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1368, in run via-server-1 | self._stream_handler = ViaStreamHandler(self._args) via-server-1 | File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 416, in __init__ via-server-1 | self._vlm_pipeline = VlmPipeline(args.asset_dir, args) via-server-1 | File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 1270, in __init__ via-server-1 | raise Exception("Failed to generate TRT-LLM engine") via-server-1 | Exception: Failed to generate TRT-LLM engine via-server-1 | via-server-1 | During handling of the above exception, another exception occurred: via-server-1 | via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 2880, in <module> via-server-1 | server.run() via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1370, in run via-server-1 | raise ViaException(f"Failed to load VIA stream handler - {str(ex)}") via-server-1 | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Killed process with PID 120
changing in the .env file for
export MODEL_PATH=git:Efficient-Large-Model/NVILA-15B · Hugging Face
as I did on previous vss 2.2.0 I get the following error
via-server-1 | Selecting INT4 AWQ mode via-server-1 | Converting Checkpoint ... via-server-1 | [2025-05-21 12:42:46,406] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) via-server-1 | df: /root/.triton/autotune: No such file or directory via-server-1 | [TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400 Loading checkpoint shards: 100% 6/6 [00:20<00:00, 3.35s/it] via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 167, in <module> via-server-1 | quantize_and_export( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 672, in quantize_and_export via-server-1 | model = get_model(model_dir, dtype, device=device, device_map=device_map) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 305, in get_model via-server-1 | model = _get_vila_model(ckpt_path) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 251, in _get_vila_model via-server-1 | model = AutoModel.from_pretrained( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained via-server-1 | return model_class.from_pretrained( via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/language_model/llava_llama.py", line 61, in from_pretrained via-server-1 | return cls.load_pretrained( via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/llava_arch.py", line 127, in load_pretrained via-server-1 | vlm = cls(config, *args, **kwargs) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/language_model/llava_llama.py", line 43, in __init__ via-server-1 | return self.init_vlm(config=config, *args, **kwargs) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/llava_arch.py", line 78, in init_vlm via-server-1 | self.mm_projector = build_mm_projector(mm_projector_cfg, config) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/multimodal_projector/builder.py", line 34, in build_mm_projector via-server-1 | return MultimodalProjector.from_pretrained(model_type_or_path, config, torch_dtype=eval(config.model_dtype)) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/opt/plugins/huggingface.py", line 84, in _new_from_pretrained via-server-1 | model = types.MethodType(cls._modelopt_cache["from_pretrained"].__func__, cls)( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4245, in from_pretrained via-server-1 | ) = cls._load_pretrained_model( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/opt/plugins/huggingface.py", line 120, in _new__load_pretrained_model via-server-1 | return types.MethodType(cls._modelopt_cache["_load_pretrained_model"].__func__, cls)( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4873, in _load_pretrained_model via-server-1 | raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") via-server-1 | RuntimeError: Error(s) in loading state_dict for MultimodalProjector: via-server-1 | size mismatch for layers.1.weight: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([4608]). via-server-1 | size mismatch for layers.1.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([4608]). via-server-1 | size mismatch for layers.2.weight: copying a param with shape torch.Size([5120, 13824]) from checkpoint, the shape in current model is torch.Size([5120, 4608]). via-server-1 | You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method. via-server-1 | ERROR: Failed to convert checkpoint via-server-1 | 2025-05-21 12:43:12,903 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1368, in run via-server-1 | self._stream_handler = ViaStreamHandler(self._args) via-server-1 | File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 416, in __init__ via-server-1 | self._vlm_pipeline = VlmPipeline(args.asset_dir, args) via-server-1 | File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 1270, in __init__ via-server-1 | raise Exception("Failed to generate TRT-LLM engine") via-server-1 | Exception: Failed to generate TRT-LLM engine via-server-1 | via-server-1 | During handling of the above exception, another exception occurred: via-server-1 | via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 2880, in <module> via-server-1 | server.run() via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1370, in run via-server-1 | raise ViaException(f"Failed to load VIA stream handler - {str(ex)}") via-server-1 | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Killed process with PID 120
Trying to do as Local NGC models (VILA & NVILA) and in .env:
export MODEL_PATH=/home/ubuntu/video-search-and-summarization/deploy/docker/remote_llm_deployment/nvila-highres_vnvila-lite-15b-highres-lita
export MODEL_ROOT_DIR=/home/ubuntu/video-search-and-summarization/deploy/docker/remote_llm_deployment
I also receive similar error:
via-server-1 | Converting Checkpoint ... via-server-1 | [2025-05-21 12:47:11,468] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) via-server-1 | df: /root/.triton/autotune: No such file or directory via-server-1 | [TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400 Loading checkpoint shards: 100% 6/6 [00:20<00:00, 3.37s/it] via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 167, in <module> via-server-1 | quantize_and_export( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 672, in quantize_and_export via-server-1 | model = get_model(model_dir, dtype, device=device, device_map=device_map) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 305, in get_model via-server-1 | model = _get_vila_model(ckpt_path) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 251, in _get_vila_model via-server-1 | model = AutoModel.from_pretrained( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained via-server-1 | return model_class.from_pretrained( via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/language_model/llava_llama.py", line 61, in from_pretrained via-server-1 | return cls.load_pretrained( via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/llava_arch.py", line 127, in load_pretrained via-server-1 | vlm = cls(config, *args, **kwargs) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/language_model/llava_llama.py", line 43, in __init__ via-server-1 | return self.init_vlm(config=config, *args, **kwargs) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/llava_arch.py", line 78, in init_vlm via-server-1 | self.mm_projector = build_mm_projector(mm_projector_cfg, config) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/multimodal_projector/builder.py", line 34, in build_mm_projector via-server-1 | return MultimodalProjector.from_pretrained(model_type_or_path, config, torch_dtype=eval(config.model_dtype)) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/opt/plugins/huggingface.py", line 84, in _new_from_pretrained via-server-1 | model = types.MethodType(cls._modelopt_cache["from_pretrained"].__func__, cls)( via-server-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4111, in from_pretrained via-server-1 | model = cls(config, *model_args, **model_kwargs) via-server-1 | File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/opt/plugins/huggingface.py", line 61, in new_init_fn via-server-1 | cls._original__init__(self, *args, **kwargs) via-server-1 | File "/opt/nvidia/via/via-engine/models/vila15/VILA/llava/model/multimodal_projector/base_projector.py", line 107, in __init__ via-server-1 | raise ValueError(f"Unknown projector type: {mm_projector_type}") via-server-1 | ValueError: Unknown projector type: mlp_downsample_3x3_fix via-server-1 | ERROR: Failed to convert checkpoint via-server-1 | 2025-05-21 12:47:38,044 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1368, in run via-server-1 | self._stream_handler = ViaStreamHandler(self._args) via-server-1 | File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 416, in __init__ via-server-1 | self._vlm_pipeline = VlmPipeline(args.asset_dir, args) via-server-1 | File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 1270, in __init__ via-server-1 | raise Exception("Failed to generate TRT-LLM engine") via-server-1 | Exception: Failed to generate TRT-LLM engine via-server-1 | via-server-1 | During handling of the above exception, another exception occurred: via-server-1 | via-server-1 | Traceback (most recent call last): via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 2880, in <module> via-server-1 | server.run() via-server-1 | File "/opt/nvidia/via/via-engine/via_server.py", line 1370, in run via-server-1 | raise ViaException(f"Failed to load VIA stream handler - {str(ex)}") via-server-1 | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine via-server-1 | Killed process with PID 119 via-server-1 exited with code 1
Any suggestion?
Thank you