Hi,
I am currently trying to build a docker container for Llama-3.2-NV-EmbedQA-1B-v2, and am following the steps outlined in the api: Get Started With NeMo Retriever Text Embedding NIM — NeMo Retriever Text Embedding NIM
My machine: Asus zephyrus g14 with Ryzen 8945hs w/ Radeon 780m integrated graphics, and a 4070 laptop
NVIDIA-SMI 575.64.01, driver aversion: 576.88, Cuda Version 12.9
I am running WSL2 ubuntu 24.04 with docker desktop.
What I think is causing the issue:
W0722 17:43:35.839547 262 metrics.cc:644] “Unable to get power limit for GPU 0. Status:Success, value:0.000000”
“nvidia-smi” output:
laptop:~$ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.64.01 Driver Version: 576.88 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 ... On | 00000000:01:00.0 On | N/A | | N/A 51C P8 2W / 75W | 793MiB / 8188MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
“nvidia-smi -q -d POWER” output:
laptop:~$ nvidia-smi -q -d POWER ==============NVSMI LOG============== Timestamp : Tue Jul 22 11:04:11 2025 Driver Version : 576.88 CUDA Version : 12.9 Attached GPUs : 1 GPU 00000000:01:00.0 GPU Power Readings Average Power Draw : 3.20 W Instantaneous Power Draw : 2.47 W Current Power Limit : 75.00 W Requested Power Limit : 75.00 W Default Power Limit : 55.00 W Min Power Limit : 5.00 W Max Power Limit : 90.00 W Power Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found GPU Memory Power Readings Average Power Draw : N/A Instantaneous Power Draw : N/A Module Power Readings Average Power Draw : N/A Instantaneous Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A
The entire console output:
$ docker login nvcr.io Authenticating with existing credentials... Login Succeeded laptop:~$ # Choose a container name for bookkeeping export NIM_MODEL_NAME=nvidia/llama-3.2-nv-embedqa-1b-v2 export CONTAINER_NAME=$(basename $NIM_MODEL_NAME) # Choose a NIM Image from NGC export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.9.0" # Choose a path on your system to cache the downloaded models export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" # Start the NIM docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME =================================== == NVIDIA NIM for Text Embedding == =================================== NVIDIA Release 1.9.0 Model: nvidia/llama-3.2-nv-embedqa-1b-v2 Container image Copyright (c) 2016-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the NVIDIA Community Model License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ADDITIONAL INFORMATION: Llama 3.2 Community License Agreement (https://www.llama.com/llama3_2/license/). Built with Llama. A copy of this license can be found under /opt/nim/LICENSE. Third Party Software Attributions and Licenses can be found under /opt/nim/acknowledgements.txt. Overriding NIM_LOG_LEVEL: replacing NIM_LOG_LEVEL=unset with NIM_LOG_LEVEL=INFO HF_HOME is set to /opt/nim/.cache/huggingface INFO 2025-07-22 17:43:07.630 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['/opt/nim/start_server.d', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages'] NIM_MODEL_PROFILE is set to f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:08.323 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['/opt/nim/start_server.d', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages'] INFO 2025-07-22 17:43:10.610 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:10.610 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:10.610 nim_sdk.py:299] Using the profile selected by the profile selector: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:10.610 nim_sdk.py:308] Downloading manifest profile: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:10.652 lib.rs:203] File: tokenizer.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer.json" INFO 2025-07-22 17:43:10.652 public.rs:52] Skipping download, using cached copy of file: tokenizer.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer.json" INFO 2025-07-22 17:43:10.662 lib.rs:203] File: model.onnx.tar found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/onnx-precision.fp16-7c7a1c17/model.onnx.tar" INFO 2025-07-22 17:43:10.662 public.rs:52] Skipping download, using cached copy of file: model.onnx.tar at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/onnx-precision.fp16-7c7a1c17/model.onnx.tar" INFO 2025-07-22 17:43:10.669 lib.rs:203] File: special_tokens_map.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/special_tokens_map.json" INFO 2025-07-22 17:43:10.669 public.rs:52] Skipping download, using cached copy of file: special_tokens_map.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/special_tokens_map.json" INFO 2025-07-22 17:43:10.675 lib.rs:203] File: tokenizer_config.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer_config.json" INFO 2025-07-22 17:43:10.675 public.rs:52] Skipping download, using cached copy of file: tokenizer_config.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer_config.json" INFO 2025-07-22 17:43:10.675 nim_sdk.py:328] Using the workspace specified during init: /opt/nim/workspace INFO 2025-07-22 17:43:10.677 nim_sdk.py:341] Materializing workspace to: /opt/nim/workspace INFO 2025-07-22 17:43:13.579 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['/usr/local/bin', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages'] INFO 2025-07-22 17:43:13.822 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:13.822 nim_sdk.py:294] Using the profile specified by the user: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:13.823 nim_sdk.py:308] Downloading manifest profile: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:13.846 lib.rs:203] File: special_tokens_map.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/special_tokens_map.json" INFO 2025-07-22 17:43:13.846 public.rs:52] Skipping download, using cached copy of file: special_tokens_map.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/special_tokens_map.json" INFO 2025-07-22 17:43:13.856 lib.rs:203] File: tokenizer_config.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer_config.json" INFO 2025-07-22 17:43:13.856 public.rs:52] Skipping download, using cached copy of file: tokenizer_config.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer_config.json" INFO 2025-07-22 17:43:13.864 lib.rs:203] File: model.onnx.tar found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/onnx-precision.fp16-7c7a1c17/model.onnx.tar" INFO 2025-07-22 17:43:13.864 public.rs:52] Skipping download, using cached copy of file: model.onnx.tar at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/onnx-precision.fp16-7c7a1c17/model.onnx.tar" INFO 2025-07-22 17:43:13.873 lib.rs:203] File: tokenizer.json found in cache: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer.json" INFO 2025-07-22 17:43:13.874 public.rs:52] Skipping download, using cached copy of file: tokenizer.json at path: "/opt/nim/.cache/ngc/hub/models--nim--nvidia--llama-3.2-nv-embedqa-1b-v2/snapshots/tokenizer-4096-f250c002/tokenizer.json" INFO 2025-07-22 17:43:13.875 nim_sdk.py:328] Using the workspace specified during init: /opt/nim/workspace INFO 2025-07-22 17:43:13.875 nim_sdk.py:341] Materializing workspace to: /opt/nim/workspace Extracting /opt/nim/workspace/model/model.onnx.tar... (-xvf) INFO 2025-07-22 17:43:18.742 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['/usr/local/bin', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages'] /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x754f7a837ec0> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( WARNING 2025-07-22 17:43:21.033 tokenizer.py:43] triton_python_backend_utils not found WARNING 2025-07-22 17:43:21.037 batch_embedder.py:28] triton_python_backend_utils not found WARNING 2025-07-22 17:43:21.039 pytorch.py:25] torch not found WARNING 2025-07-22 17:43:21.039 pytorch.py:32] triton_python_backend_utils not found INFO 2025-07-22 17:43:22.595 repository.py:230] Loaded tokenizer from /opt/nim/workspace/tokenizer INFO 2025-07-22 17:43:22.595 repository.py:238] No processor found, using tokenizer as processor INFO 2025-07-22 17:43:22.600 onnx_model_builder.py:167] Setting number of models for 'nvidia_llama_3_2_nv_embedqa_1b_v2_model' to: None INFO 2025-07-22 17:43:22.971 bls_model_builder.py:67] BLS model successfully generated and written to: /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/model.py INFO 2025-07-22 17:43:23.037 repository.py:659] Successfully saved model service config to /opt/nim/tmp/run/triton-model-repository/service_config.yaml INFO 2025-07-22 17:43:23.037 repository.py:660] Sucessfully generated Triton Model Repository at /opt/nim/tmp/run/triton-model-repository INFO 2025-07-22 17:43:24.015 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['/usr/local/bin', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages'] INFO 2025-07-22 17:43:24.373 http_api.py:52] Serving endpoints: 0.0.0.0:8000/v1/embeddings (POST) 0.0.0.0:8000/v1/triton-inference-statistics (GET) 0.0.0.0:8000/v1/models (GET) 0.0.0.0:8000/v1/health/live (GET) 0.0.0.0:8000/v1/health/ready (GET) 0.0.0.0:8000/v1/metrics (GET) 0.0.0.0:8000/v1/license (GET) 0.0.0.0:8000/v1/metadata (GET) 0.0.0.0:8000/v1/manifest (GET) INFO 2025-07-22 17:43:24.374 http_api.py:73] {'message': 'Starting HTTP Inference server', 'port': 8000, 'workers_count': 8, 'host': '0.0.0.0', 'log_level': 'info', 'SSL': 'disabled'} I0722 17:43:24.934903 262 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204200000' with size 268435456" I0722 17:43:24.935065 262 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0722 17:43:24.971727 262 model_lifecycle.cc:473] "loading: nvidia_llama_3_2_nv_embedqa_1b_v2_model:1" I0722 17:43:24.971887 262 model_lifecycle.cc:473] "loading: nvidia_llama_3_2_nv_embedqa_1b_v2:1" I0722 17:43:25.020303 262 onnxruntime.cc:2914] "TRITONBACKEND_Initialize: onnxruntime" I0722 17:43:25.020393 262 onnxruntime.cc:2924] "Triton TRITONBACKEND API version: 1.19" I0722 17:43:25.020403 262 onnxruntime.cc:2930] "'onnxruntime' TRITONBACKEND API version: 1.19" I0722 17:43:25.020436 262 onnxruntime.cc:2960] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}" I0722 17:43:25.086486 262 onnxruntime.cc:3025] "TRITONBACKEND_ModelInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_model (version 1)" I0722 17:43:25.091959 262 onnxruntime.cc:1014] "skipping model configuration auto-complete for 'nvidia_llama_3_2_nv_embedqa_1b_v2_model': inputs and outputs already specified" I0722 17:43:25.093460 262 onnxruntime.cc:3090] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_model_0_0 (GPU device 0)" INFO 2025-07-22 17:43:25.593 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:25.858 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.859 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:25.859 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.859 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. INFO:uvicorn.error:Application startup complete. INFO 2025-07-22 17:43:25.865 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.865 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. INFO 2025-07-22 17:43:25.867 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.868 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. INFO 2025-07-22 17:43:25.870 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.870 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. INFO 2025-07-22 17:43:25.873 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.873 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.873 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO 2025-07-22 17:43:25.873 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. INFO:uvicorn.error:Application startup complete. INFO 2025-07-22 17:43:25.875 profiles.py:98] Registered custom profile selectors: [] INFO 2025-07-22 17:43:25.876 profiles.py:208] Matched profile_id in manifest from env NIM_MODEL_PROFILE to: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f INFO:uvicorn.error:Application startup complete. /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x79e7fe659040> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( 2025-07-22 17:43:26.358336444 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 19 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2025-07-22 17:43:26.371280958 [W:onnxruntime:, session_state.cc:1280 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2025-07-22 17:43:26.371342767 [W:onnxruntime:, session_state.cc:1282 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. WARNING 2025-07-22 17:43:27.457 pytorch.py:25] torch not found I0722 17:43:28.163329 262 model_lifecycle.cc:849] "successfully loaded 'nvidia_llama_3_2_nv_embedqa_1b_v2_model'" I0722 17:43:28.764816 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_0 (CPU device 0)" I0722 17:43:28.764977 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_1 (CPU device 0)" I0722 17:43:28.765122 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_2 (CPU device 0)" I0722 17:43:28.765281 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_3 (CPU device 0)" I0722 17:43:28.765526 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_4 (CPU device 0)" I0722 17:43:28.767086 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_5 (CPU device 0)" I0722 17:43:28.767413 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_6 (CPU device 0)" I0722 17:43:28.767576 262 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: nvidia_llama_3_2_nv_embedqa_1b_v2_0_7 (CPU device 0)" INFO 2025-07-22 17:43:29.031 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.069 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.074 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.085 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.091 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.097 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.102 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] INFO 2025-07-22 17:43:29.107 __init__.py:413] Appending: /opt/nim to PYTHONPATH: ['', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1', '/opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2', '/opt/tritonserver/backends/python'] /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x739038e84e30> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x76789d1fd1f0> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7dd4c42ccfe0> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7a78cc31d160> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x777b25178ef0> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x76ceaba41460> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x79c64f7a5250> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:628: UserWarning: <google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7648b75f0a10> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`. warn( WARNING 2025-07-22 17:43:31.075 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.079 bls.py:75] Initializing tokenizer model... WARNING 2025-07-22 17:43:31.088 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.090 bls.py:75] Initializing tokenizer model... INFO 2025-07-22 17:43:31.105 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:31.105 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer WARNING 2025-07-22 17:43:31.125 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.127 bls.py:75] Initializing tokenizer model... INFO 2025-07-22 17:43:31.136 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer WARNING 2025-07-22 17:43:31.166 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.169 bls.py:75] Initializing tokenizer model... WARNING 2025-07-22 17:43:31.171 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.174 bls.py:75] Initializing tokenizer model... WARNING 2025-07-22 17:43:31.177 pytorch.py:25] torch not found WARNING 2025-07-22 17:43:31.177 pytorch.py:25] torch not found WARNING 2025-07-22 17:43:31.178 pytorch.py:25] torch not found INFO 2025-07-22 17:43:31.179 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:31.180 bls.py:75] Initializing tokenizer model... INFO 2025-07-22 17:43:31.180 bls.py:75] Initializing tokenizer model... INFO 2025-07-22 17:43:31.181 bls.py:75] Initializing tokenizer model... INFO 2025-07-22 17:43:31.186 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:31.191 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:31.192 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:31.192 tokenizer.py:578] Loading tokenizer from /opt/nim/tmp/run/triton-model-repository/nvidia_llama_3_2_nv_embedqa_1b_v2/1/tokenizer INFO 2025-07-22 17:43:32.620 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.620 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.640 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.640 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.678 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.678 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.701 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.701 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.704 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.704 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.712 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.713 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.721 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.721 bls.py:120] tokenizer.model_max_length: 4096 INFO 2025-07-22 17:43:32.732 bls.py:118] Tokenizer is fast: True INFO 2025-07-22 17:43:32.732 bls.py:120] tokenizer.model_max_length: 4096 I0722 17:43:32.745139 262 model_lifecycle.cc:849] "successfully loaded 'nvidia_llama_3_2_nv_embedqa_1b_v2'" I0722 17:43:32.745375 262 server.cc:611] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0722 17:43:32.745473 262 server.cc:638] +-------------+-----------------------------------------------+-----------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------+-----------------------------------------------+ | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtri | {"cmdline":{"auto-complete-config":"true","ba | | | ton_onnxruntime.so | ckend-directory":"/opt/tritonserver/backends" | | | | ,"min-compute-capability":"6.000000","default | | | | -max-batch-size":"4"}} | | | | | | python | /opt/tritonserver/backends/python/libtriton_p | {"cmdline":{"auto-complete-config":"true","ba | | | ython.so | ckend-directory":"/opt/tritonserver/backends" | | | | ,"min-compute-capability":"6.000000","default | | | | -max-batch-size":"4"}} | | | | | +-------------+-----------------------------------------------+-----------------------------------------------+ I0722 17:43:32.745656 262 server.cc:681] +-----------------------------------------+---------+--------+ | Model | Version | Status | +-----------------------------------------+---------+--------+ | nvidia_llama_3_2_nv_embedqa_1b_v2 | 1 | READY | | nvidia_llama_3_2_nv_embedqa_1b_v2_model | 1 | READY | +-----------------------------------------+---------+--------+ I0722 17:43:32.810668 262 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU" I0722 17:43:32.814717 262 metrics.cc:783] "Collecting CPU metrics" I0722 17:43:32.815155 262 tritonserver.cc:2598] +----------------------------------+--------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.59.0 | | server_extensions | classification sequence model_repository model_repository(unload_depende | | | nts) schedule_policy model_configuration system_shared_memory cuda_share | | | d_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /opt/nim/tmp/run/triton-model-repository | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | model_config_name | | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------+ I0722 17:43:32.845904 262 grpc_server.cc:2562] "Started GRPCInferenceService at 0.0.0.0:8001" I0722 17:43:32.846729 262 http_server.cc:4832] "Started HTTPService at 0.0.0.0:8080" I0722 17:43:32.893549 262 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002" W0722 17:43:33.825195 262 metrics.cc:644] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" W0722 17:43:34.836180 262 metrics.cc:644] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" W0722 17:43:35.839547 262 metrics.cc:644] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
Thanks in advance