I am trying to run tensortllm bench for deepseek model with a custom docker image and below is my docker command. My cuda (12.8) gpu driver installation is correct and all 8 B200 gpu are correctly shown in nvidia-smi output.
I don’t have any quick answers for you I am afraid, beside the things you will also find when checking out the CUDA categories here on the server.
Making sure FabricManager is running properly is one of those, but at least jacehall already verified that.
Driver version compatibility is another thing, but that again is something I trust both of you checked already several times.
Beyond that I highly recommend contacting Enterprise Support. Given the level of Hardware both of you are using I am sure you have access to that either directly or through your cloud provider.
Thanks Mark for your input it was indeed an issue for me with the fabric manager. Below was the error we were having and fixing this resolved the issue for me.
Kernel module ‘ib_umad’ has not been loaded, fabric manager cannot be started
Just wanted to get back to you with the solution we eventually figured out. I will post this as a separate thread as well:
CUDA HMM Compatibility Issue with Linux Kernel KASLR (Kernel Address Space Layout Randomization)
Problem Overview:
When deploying CUDA applications utilizing NVIDIA’s latest CUDA Heterogeneous Memory Management (HMM) features on modern Linux kernels, you may encounter significant stability or performance issues. After investigation, this problem has been traced to an explicit compatibility issue between Kernel Address Space Layout Randomization (KASLR) and CUDA’s HMM functionality.
Technical Details:
Kernel Address Space Layout Randomization (KASLR) enhances security by randomly positioning the Linux kernel in memory during boot, making certain exploits harder.
CUDA’s Heterogeneous Memory Management (HMM) enables GPUs and CPUs to transparently share virtual address spaces, crucial for advanced AI workloads and memory coherence.
The randomized kernel memory addresses created by KASLR can conflict with CUDA’s ability to accurately map and maintain shared GPU/CPU memory references. This interaction causes:
Frequent system instability or crashes.
Significant performance degradation.
Potential memory management errors at runtime.
Solution or Workaround:
Currently, the explicit solution or workaround is to either:
Disable KASLR:
Temporarily or permanently disable KASLR in your Linux kernel boot configuration (nokaslr kernel boot parameter).
Adjust kernel and driver versions:
Test specific Linux kernel or NVIDIA driver versions known to handle the KASLR/HMM interaction better.
Engage NVIDIA support:
Report this compatibility issue explicitly to NVIDIA’s support and seek further guidance and long-term resolutions.
We hope sharing this explicitly helps other developers and system administrators facing similar challenges. If you’ve encountered this issue or have found additional solutions, please feel free to share your experience below.
I was able to resolve the cuda 802 issue with the latest MLNX_OFED package MLNX_OFED_LINUX-24.10-3.2.5.0-ubuntu24.04-x86_64.tgz, which is deprecated. I think I know the issue, openibd service needs to start to add the module. I automated the whole stack with netbooting on the H100s, I forgot about this step. Need to verify that DOCA 3.0.0 will work by starting openibd, which works on H100.
For anyone who is reading this thread, here is a summary of what bbaez encountered and did, as I understand it:
The Problem:
Bbaez encountered errors starting NVIDIA’s fabric manager due to a missing kernel module (ib_umad):
modprobe: ERROR: could not insert 'ib_umad': Invalid argument Kernel module "ib_umad" has not been loaded, fabric manager cannot be started
Bbaez initially thought disabling KASLR might solve this (as I suggested on the forum earlier), but that wasn’t sufficient in this particular scenario.
The Actual Issue:
The ib_umad kernel module, essential for InfiniBand communication used by NVIDIA’s fabric manager, was not installed or enabled on their B200 system, although it existed on Bbaez’s H100 systems. This module is typically provided by the MLNX_OFED (Mellanox OpenFabrics Enterprise Distribution) software package.
How Bbaez Solved It: • Bbaez realized the module was missing because MLNX_OFED wasn’t properly installed on the B200. • Bbaez explicitly installed the latest MLNX_OFED version:
Once installed, Bbaez explicitly started the required service (openibd), which loads the necessary kernel module (ib_umad), resolving the problem and enabling NVIDIA’s fabric manager to function correctly.
Key Takeaway:
NVIDIA Fabric Manager and CUDA’s InfiniBand features depend explicitly on the proper installation and loading of InfiniBand kernel modules provided by MLNX_OFED (or the equivalent NVIDIA DOCA stack).
Simply disabling KASLR was not sufficient for resolving issues with ib_umad - the MLNX_OFED package must be explicitly installed and initialized correctly.
Bbaez’s resolution explicitly confirms that proper MLNX_OFED installation and startup procedures are critical prerequisites for smooth CUDA operation on systems like B200 and H100.