CUDA initialization failure with error Error 802: system not yet initialized

I am trying to run tensortllm bench for deepseek model with a custom docker image and below is my docker command. My cuda (12.8) gpu driver installation is correct and all 8 B200 gpu are correctly shown in nvidia-smi output.

docker run --rm -it --gpus all --network host --ipc host --privileged --cap-add SYS_PTRACE --security-opt seccomp=unconfined --name test_gpu evuedsoacr.azurecr.io/dc-ecosys-appl-eng/deepseek_tensorrt_llm:release bash

Despite that I keep getting below failure from the docker image

I am having the same issue. Did you figure out what the problem was?

My Problem Summary

Although all 8 NVIDIA B200 GPUs are visible in nvidia-smi, any attempt to initialize CUDA fails across all contexts:

  • Python (PyTorch, Transformers)
  • OpenLLM / inference pipelines
  • Raw CUDA C++ binary (./deviceQuery)

All return: cudaGetDeviceCount() → Error 802: system not yet initialized


What I’ve Done

  • Confirmed B200s are visible in nvidia-smi with correct driver (v575.57.08)
  • Installed CUDA Toolkit 12.9
  • Installed and reinstalled PyTorch with CUDA 12.1 and 12.9 compatibility
  • Manually configured and tested Fabric Manager with various nvlsm.conf and fabricmanager.cfg settings
  • Verified required kernel modules are loaded (nvidia, nvidia_uvm, nvidia_drm, nvidia_modeset)
  • Confirmed libcuda.so loads via ctypes (no missing shared library issues)
  • Observed NVLSM receiving SM traps and entering MASTER state
  • NVSwitch topology appears correctly mapped via nvidia-smi topo -m
  • Kernel version: 5.15.0-143-generic

Remaining Issues

  • deviceQuery continues to return: cudaGetDeviceCount() → 802
  • All CUDA applications and libraries still fail to initialize
  • Fabric Manager repeatedly exits in non-operational state
  • nvlsm.log is flooded with unauthorized packet trap warnings

Can someone please provide me recommended BIOS/UEFI settings for B200 GPUs with NVLink, including:

  • Above 4G Decoding : Confirm if this should be enabled.
  • PCIe BAR Size : Specify settings for resizable BAR or BAR1 size.
  • MMIO Windowing : Recommend MMIO high/low base addresses (e.g., 56T).
  • UEFI vs. Legacy Mode : Confirm if UEFI mode is required.

Also, what exactly are the latest firmware versions for GPUs, NVSwitch, and system BMC, with safe flashing instructions?

My Server Configuration is:

  • H14 10U GPU System with NVIDIA HGX B200 8-GPU and Dual AMD EPYC 9575F CPU
  • Dual AMD EPYC 9575F 64-core Processors (128 cores and 256 threads)
  • RAM - 24x Samsung DDR5 6000 MT/s 128GB (Total 3.0TB)
  • Storage/Drivers - 8x Micron 7500 Pro 3.8TB
  • GPU - NVIDIA HGX™ B200 8-GPU (180GB HBM3e memory per GPU)

Any help from anyone would be welcome!

Hello @krishoza and @jacehall, welcome to the NVIDIA developer forums.

I don’t have any quick answers for you I am afraid, beside the things you will also find when checking out the CUDA categories here on the server.

Making sure FabricManager is running properly is one of those, but at least jacehall already verified that.

Driver version compatibility is another thing, but that again is something I trust both of you checked already several times.

Beyond that I highly recommend contacting Enterprise Support. Given the level of Hardware both of you are using I am sure you have access to that either directly or through your cloud provider.

1 Like

Thanks Mark for your input it was indeed an issue for me with the fabric manager. Below was the error we were having and fixing this resolved the issue for me.

Kernel module ‘ib_umad’ has not been loaded, fabric manager cannot be started

Hey Markus…

Just wanted to get back to you with the solution we eventually figured out. I will post this as a separate thread as well:

CUDA HMM Compatibility Issue with Linux Kernel KASLR (Kernel Address Space Layout Randomization)

Problem Overview:

When deploying CUDA applications utilizing NVIDIA’s latest CUDA Heterogeneous Memory Management (HMM) features on modern Linux kernels, you may encounter significant stability or performance issues. After investigation, this problem has been traced to an explicit compatibility issue between Kernel Address Space Layout Randomization (KASLR) and CUDA’s HMM functionality.

Technical Details:

  • Kernel Address Space Layout Randomization (KASLR) enhances security by randomly positioning the Linux kernel in memory during boot, making certain exploits harder.
  • CUDA’s Heterogeneous Memory Management (HMM) enables GPUs and CPUs to transparently share virtual address spaces, crucial for advanced AI workloads and memory coherence.

The randomized kernel memory addresses created by KASLR can conflict with CUDA’s ability to accurately map and maintain shared GPU/CPU memory references. This interaction causes:

  • Frequent system instability or crashes.
  • Significant performance degradation.
  • Potential memory management errors at runtime.

Solution or Workaround:

Currently, the explicit solution or workaround is to either:

  1. Disable KASLR:
  • Temporarily or permanently disable KASLR in your Linux kernel boot configuration (nokaslr kernel boot parameter).
  1. Adjust kernel and driver versions:
  • Test specific Linux kernel or NVIDIA driver versions known to handle the KASLR/HMM interaction better.
  1. Engage NVIDIA support:
  • Report this compatibility issue explicitly to NVIDIA’s support and seek further guidance and long-term resolutions.

We hope sharing this explicitly helps other developers and system administrators facing similar challenges. If you’ve encountered this issue or have found additional solutions, please feel free to share your experience below.

-Hall iNtelligence

I’m having the same error. I tried disabling KASLR. Was there something else that was done as well?

bbaez@b200-temp:/persist/bbaez$ sudo modprobe ib_umad modprobe: ERROR: could not insert 'ib_umad': Invalid argument modprobe: ERROR: ../libkmod/libkmod-module.c:1084 command_do() Error running install command '/sbin/modprobe --ignore-install ib_umad  && (if [ -x /sbin/mlnx_bf_configure ]; then /sbin/mlnx_bf_configure; fi)' for module ib_umad: retcode 1 modprobe: ERROR: could not insert 'ib_umad': Invalid argument 

Error when starting fabric manager:

Jul 24 20:28:03 b200-temp nvidia-fabricmanager-start.sh[102206]: Checking mlx5_0... Jul 24 20:28:03 b200-temp nvidia-fabricmanager-start.sh[102206]: Using device mlx5_0, port 0x5000e603003ae5a0 Jul 24 20:28:03 b200-temp nvidia-fabricmanager-start.sh[102206]: Detected NVL5+ system Jul 24 20:28:03 b200-temp nvidia-fabricmanager-start.sh[102206]: Kernel module "ib_umad" has not been loaded, fabric manager cannot be started Jul 24 20:28:03 b200-temp nvidia-fabricmanager-start.sh[102206]: Please run "modprobe ib_umad" before starting fabric manager Jul 24 20:28:03 b200-temp systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE 

packages installed and nvidia-smi

bbaez@b200-temp:/persist/bbaez$ dpkg -l | grep nvidia ii  libnvidia-cfg1-575-server:amd64         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA binary OpenGL/GLX configuration library ii  libnvidia-common-575-server             575.57.08-0ubuntu0.24.04.2                    all          Shared files used by the NVIDIA libraries ii  libnvidia-compute-575-server:amd64      575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA libcompute package ii  libnvidia-container-tools               1.17.8-1                                      amd64        NVIDIA container runtime library (command-line tools) ii  libnvidia-container1:amd64              1.17.8-1                                      amd64        NVIDIA container runtime library ii  libnvidia-decode-575-server:amd64       575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA Video Decoding runtime libraries ii  libnvidia-egl-wayland1:amd64            1:1.1.13-1build1                              amd64        Wayland EGL External Platform library -- shared library ii  libnvidia-encode-575-server:amd64       575.57.08-0ubuntu0.24.04.2                    amd64        NVENC Video Encoding runtime library ii  libnvidia-extra-575-server:amd64        575.57.08-0ubuntu0.24.04.2                    amd64        Extra libraries for the NVIDIA Server Driver ii  libnvidia-fbc1-575-server:amd64         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library ii  libnvidia-gl-575-server:amd64           575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD ii  nvidia-compute-utils-575-server         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA compute utilities ii  nvidia-container-toolkit                1.17.8-1                                      amd64        NVIDIA Container toolkit ii  nvidia-container-toolkit-base           1.17.8-1                                      amd64        NVIDIA Container Toolkit Base ii  nvidia-dkms-575-server-open             575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA DKMS package (open kernel module) ii  nvidia-driver-575-server-open           575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA driver (open kernel) metapackage ii  nvidia-fabricmanager-575                575.57.08-0ubuntu0.24.04.1                    amd64        Fabric Manager for NVSwitch based systems. ii  nvidia-firmware-575-server-575.57.08    575.57.08-0ubuntu0.24.04.2                    amd64        Firmware files used by the kernel module ii  nvidia-headless-no-dkms-575-server-open 575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA headless metapackage - no DKMS (open kernel module) ii  nvidia-kernel-common-575-server         575.57.08-0ubuntu0.24.04.2                    amd64        Shared files used with the kernel module ii  nvidia-kernel-source-575-server-open    575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA kernel source package ii  nvidia-utils-575-server                 575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA Server Driver support binaries ii  xserver-xorg-video-nvidia-575-server    575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA binary Xorg driver bbaez@b200-temp:/persist/bbaez$ systemctl | grep nvidia   sys-bus-pci-drivers-nvidia.device                                                                    loaded active plugged   /sys/bus/pci/drivers/nvidia ● nvidia-fabricmanager.service                                                                         loaded failed failed    NVIDIA fabric manager service   nvidia-persistenced.service                                                                          loaded active running   NVIDIA Persistence Daemon bbaez@b200-temp:/persist/bbaez$ nvidia-smi Thu Jul 24 20:20:00 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     | |-----------------------------------------+------------------------+----------------------+ | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC | | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. | |                                         |                        |               MIG M. | |=========================================+========================+======================| |   0  NVIDIA B200                    Off |   00000000:1A:00.0 Off |                    0 | | N/A   29C    P0            196W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   1  NVIDIA B200                    Off |   00000000:3B:00.0 Off |                    0 | | N/A   36C    P0            202W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   2  NVIDIA B200                    Off |   00000000:4C:00.0 Off |                    0 | | N/A   35C    P0            198W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   3  NVIDIA B200                    Off |   00000000:5D:00.0 Off |                    0 | | N/A   28C    P0            197W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   4  NVIDIA B200                    Off |   00000000:9B:00.0 Off |                    0 | | N/A   26C    P0            192W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   5  NVIDIA B200                    Off |   00000000:BB:00.0 Off |                    0 | | N/A   32C    P0            196W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   6  NVIDIA B200                    Off |   00000000:CA:00.0 Off |                    0 | | N/A   33C    P0            196W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+ |   7  NVIDIA B200                    Off |   00000000:DC:00.0 Off |                    0 | | N/A   26C    P0            197W / 1000W |       0MiB / 183359MiB |      0%      Default | |                                         |                        |             Disabled | +-----------------------------------------+------------------------+----------------------+  +-----------------------------------------------------------------------------------------+ | Processes:                                                                              | |  GPU   GI   CI              PID   Type   Process name                        GPU Memory | |        ID   ID                                                               Usage      | |=========================================================================================| |  No running processes found                                                             | +-----------------------------------------------------------------------------------------+ 

The Infiniband module is not on the b200. Looking at our H100 it is there. I performed similar installs on both systems.

I thought DOCA 3.0.0 would provide now that MLNX_OFED is deprecated.

b200

bbaez@b200-temp:/persist/bbaez$ sudo find / -iname ib_umad 

H100

bbaez@dgx09:~$ sudo find / -iname ib_umad /sys/kernel/tracing/events/ib_umad /sys/kernel/debug/tracing/events/ib_umad /sys/module/mlx_compat/holders/ib_umad /sys/module/ib_umad /sys/module/ib_core/holders/ib_umad 

I was able to resolve the cuda 802 issue with the latest MLNX_OFED package MLNX_OFED_LINUX-24.10-3.2.5.0-ubuntu24.04-x86_64.tgz, which is deprecated. I think I know the issue, openibd service needs to start to add the module. I automated the whole stack with netbooting on the H100s, I forgot about this step. Need to verify that DOCA 3.0.0 will work by starting openibd, which works on H100.

Glad you were able to figure it out!

For anyone who is reading this thread, here is a summary of what bbaez encountered and did, as I understand it:


The Problem:

Bbaez encountered errors starting NVIDIA’s fabric manager due to a missing kernel module (ib_umad):

modprobe: ERROR: could not insert 'ib_umad': Invalid argument Kernel module "ib_umad" has not been loaded, fabric manager cannot be started 

Bbaez initially thought disabling KASLR might solve this (as I suggested on the forum earlier), but that wasn’t sufficient in this particular scenario.

The Actual Issue:

The ib_umad kernel module, essential for InfiniBand communication used by NVIDIA’s fabric manager, was not installed or enabled on their B200 system, although it existed on Bbaez’s H100 systems. This module is typically provided by the MLNX_OFED (Mellanox OpenFabrics Enterprise Distribution) software package.

How Bbaez Solved It:
• Bbaez realized the module was missing because MLNX_OFED wasn’t properly installed on the B200.
• Bbaez explicitly installed the latest MLNX_OFED version:

MLNX_OFED_LINUX-24.10-3.2.5.0-ubuntu24.04-x86_64.tgz 
  • Once installed, Bbaez explicitly started the required service (openibd), which loads the necessary kernel module (ib_umad), resolving the problem and enabling NVIDIA’s fabric manager to function correctly.

Key Takeaway:

  • NVIDIA Fabric Manager and CUDA’s InfiniBand features depend explicitly on the proper installation and loading of InfiniBand kernel modules provided by MLNX_OFED (or the equivalent NVIDIA DOCA stack).
  • Simply disabling KASLR was not sufficient for resolving issues with ib_umad - the MLNX_OFED package must be explicitly installed and initialized correctly.

Bbaez’s resolution explicitly confirms that proper MLNX_OFED installation and startup procedures are critical prerequisites for smooth CUDA operation on systems like B200 and H100.

Good stuff