How to Configure NVLink on a Dedicated Server
NVLink is NVIDIA's proprietary high-speed GPU interconnect that replaces the PCIe bus as the primary communication path between GPUs β and in select architectures, between GPUs and CPUs. It delivers bidirectional bandwidth of up to 600 GB/s per link on Hopper-generation hardware, compared to a theoretical maximum of roughly 64 GB/s on PCIe 5.0 x16. For workloads like large language model training, molecular dynamics simulation, or multi-GPU inference, this difference is not marginal β it is architectural.
This guide provides a complete, production-grade walkthrough for configuring NVLink on a dedicated server: from physical hardware installation and driver stack setup to topology verification, application-layer integration, and ongoing performance monitoring.
What NVLink Actually Is β and What It Is Not
NVLink is not simply a faster cable. It is a coherent, point-to-point interconnect fabric built directly into the GPU die. Each NVLink lane carries data in both directions simultaneously using a serialized differential signaling protocol. Multiple lanes are bonded together into a single logical link, and multiple links can connect the same pair of GPUs for additive bandwidth.
Critically, NVLink supports cache-coherent memory access. This means GPU A can read from GPU B's framebuffer memory without staging data through host RAM or the CPU's memory controller. This property is what enables the "unified memory" programming model in CUDA β a single virtual address space that spans multiple physical GPU memories.
What NVLink is not: it is not a replacement for NVSwitch in large-scale systems. In configurations with more than two GPUs, NVIDIA uses NVSwitch β a dedicated crossbar switching chip β to provide all-to-all NVLink connectivity. The DGX A100, for example, uses six NVSwitch chips to give each of its eight A100 GPUs full NVLink bandwidth to every other GPU simultaneously. If you are building a two-GPU workstation or a four-GPU server with a supported bridge, you are working with direct NVLink connections. If you are working with eight or more GPUs, you are almost certainly in NVSwitch territory.
NVLink Bandwidth by GPU Generation
Understanding the bandwidth ceiling of your specific hardware is essential before benchmarking or capacity planning.
| GPU Generation | NVLink Version | Links per GPU | Total Bidirectional Bandwidth |
|---|---|---|---|
| Volta (V100) | NVLink 2.0 | 6 | 300 GB/s |
| Turing (RTX 2080 Ti) | NVLink 2.0 | 2 | 100 GB/s |
| Ampere (A100 SXM) | NVLink 3.0 | 12 | 600 GB/s |
| Ampere (RTX 3090) | NVLink 3.0 | 2 | 112.5 GB/s |
| Ada Lovelace (RTX 4090) | NVLink 4.0 | 2 | 112.5 GB/s |
| Hopper (H100 SXM) | NVLink 4.0 | 18 | 900 GB/s |
| Blackwell (B200) | NVLink 5.0 | 18 | 1800 GB/s |
PCIe 4.0 x16 delivers approximately 32 GB/s bidirectional. PCIe 5.0 x16 doubles that to roughly 64 GB/s. Even a two-link consumer NVLink bridge on RTX 3090 cards provides nearly double the bandwidth of PCIe 5.0 β and data center GPUs are in a completely different category.
Prerequisites and Hardware Compatibility
Before touching a single configuration file, confirm the following:
GPU compatibility. NVLink is not available on all NVIDIA GPUs. Consumer cards below the RTX 2080 Ti do not support it. The RTX 4080 does not support NVLink despite being a high-end card β only the RTX 4090 does in the Ada generation. Always verify against NVIDIA's official GPU specification sheet for your exact SKU.
NVLink bridge. For consumer and prosumer GPUs, a physical NVLink bridge connector is required. These bridges are generation-specific β a Turing bridge will not fit an Ampere card. Data center GPUs (A100, H100) in SXM form factor use a proprietary mezzanine board and do not use a discrete bridge.
Motherboard and PCIe slot spacing. The NVLink bridge requires the two GPUs to be in adjacent PCIe x16 slots with a specific physical gap. Most consumer bridges span two slots. Some high-end bridges span three slots. Verify your motherboard's slot pitch against the bridge dimensions before purchasing.
BIOS settings. Enable "Above 4G Decoding" and "Resizable BAR" (also called Smart Access Memory on AMD platforms) in UEFI. Some systems also require disabling CSM (Compatibility Support Module) to allow full PCIe address space allocation for multiple GPUs.
Power delivery. Two high-end GPUs under full NVLink-accelerated load can draw 600W or more combined. Ensure your PSU has sufficient headroom and that the GPU power connectors are on separate rails where possible.
Operating system. This guide covers Linux (Ubuntu 22.04 LTS / Debian 12) as the primary target, which is the standard environment for AI and HPC workloads on Dedicated Servers. Windows-specific steps are noted where they differ significantly.
Step 1: Physical GPU and Bridge Installation
Power down the server completely and disconnect it from mains power. Ground yourself using an ESD wrist strap before handling any PCIe cards.
- Remove the PCIe slot covers for the target slots.
- Insert the first GPU into the primary x16 slot (typically closest to the CPU).
- Insert the second GPU into the adjacent x16 slot, ensuring the physical gap matches your NVLink bridge.
- Seat both cards firmly until the PCIe retention clips click.
- Connect all required PCIe power connectors from the PSU to each GPU. Do not use daisy-chained connectors for high-TDP cards β use separate cables from the PSU.
- Align the NVLink bridge over the gold connector pads on the top edge of both GPUs. Press down firmly and evenly until it seats fully. A partially seated bridge will cause the link to fail silently or operate at reduced width.
- If your GPUs support dual NVLink bridges (e.g., RTX 2080 Ti has two NVLink connectors), install both bridges for maximum bandwidth.
- Close the chassis and reconnect power.
Step 2: BIOS and UEFI Configuration
Boot into UEFI setup (typically Del or F2 at POST).
- Enable Above 4G Decoding.
- Enable Resizable BAR if supported.
- Set PCIe link speed to Auto or Gen4/Gen5 as appropriate for your hardware.
- Disable CSM if your OS boots via UEFI.
- Save and exit.
Step 3: Install NVIDIA Drivers on Linux
NVIDIA provides multiple installation paths. The package manager method is preferred for server environments because it integrates with DKMS (Dynamic Kernel Module Support), which automatically rebuilds the kernel module after kernel updates.
First, add the NVIDIA package repository and install the driver:
sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install -y nvidia-driver-545 nvidia-dkms-545Replace 545 with the latest production branch version available for your GPU. You can query available versions with:
apt-cache search nvidia-driver | grep "^nvidia-driver"After installation, reboot:
sudo rebootPost-reboot, verify the driver loaded correctly:
nvidia-smiThe output should list both GPUs with their driver version, CUDA version compatibility, and current power state. If only one GPU appears, the second card may not be seated correctly or may have a power delivery issue.
A critical pitfall: If you have Nouveau (the open-source NVIDIA driver) loaded, it will conflict with the proprietary driver. Blacklist it explicitly:
echo -e "blacklist nouveaunoptions nouveau modeset=0" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
sudo rebootStep 4: Install the CUDA Toolkit
NVLink's full capability β particularly peer-to-peer memory access and collective communications β requires the CUDA toolkit. Install it via the NVIDIA CUDA repository for the most reliable version matching:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-4Add CUDA binaries and libraries to your shell environment:
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrcVerify the installation:
nvcc --versionYou should see output identifying the CUDA compiler version. Also run the CUDA sample deviceQuery if you have the samples installed β it will enumerate both GPUs and report NVLink capability flags.
Step 5: Verify NVLink Topology and Status
This is the most diagnostically important step. nvidia-smi provides several subcommands specifically for NVLink inspection.
Check the system topology matrix:
nvidia-smi topo -mThe output is a matrix showing the interconnect type between every pair of devices in the system. Look for NV# labels between your GPUs, where # is the number of NVLink bridges connecting them. A label of NV2 means two NVLink bridges are active. A label of PIX or PHB means the GPUs are communicating over PCIe β NVLink is not active.
Example output for a correctly configured dual-GPU system:
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV2 0-23 0
GPU1 NV2 X 0-23 0Check NVLink link status per GPU:
nvidia-smi nvlink --status -i 0This shows the state of each NVLink port on GPU 0. Active links will show Active state and the negotiated speed.
Check NVLink error counters:
nvidia-smi nvlink --errorcounters -i 0Non-zero replay or recovery error counts indicate a physical layer problem β a partially seated bridge, a damaged connector, or signal integrity issues from inadequate power delivery.
Monitor NVLink throughput in real time:
nvidia-smi nvlink -sThis prints cumulative throughput counters. For real-time delta monitoring, combine it with watch:
watch -n 1 nvidia-smi nvlink -sStep 6: Enable and Verify Peer-to-Peer Memory Access
NVLink's coherent memory access requires peer-to-peer (P2P) to be enabled at the CUDA level. You can verify this programmatically:
cat << 'EOF' > check_p2p.py
import subprocess
result = subprocess.run(['nvidia-smi', 'topo', '-p2p', 'r'], capture_output=True, text=True)
print(result.stdout)
EOF
python3 check_p2p.pyOr use a CUDA C program with cudaDeviceCanAccessPeer(). For quick validation, the CUDA samples simpleP2P and p2pBandwidthLatencyTest are the definitive tools:
cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTestThe output will show bidirectional bandwidth between GPU pairs. Over NVLink, you should see values consistent with the bandwidth table above. If you see PCIe-level bandwidth (~10β30 GB/s), P2P over NVLink is not active β check that the bridge is fully seated and that no IOMMU settings are blocking peer access.
IOMMU consideration: On AMD EPYC and some Intel Xeon platforms, IOMMU may be enabled by default and can block GPU P2P access. If P2P is not working, add iommu=pt (passthrough mode) or amd_iommu=on iommu=pt to your kernel command line in /etc/default/grub:
sudo nano /etc/default/grub
# Add iommu=pt to GRUB_CMDLINE_LINUX_DEFAULT
sudo update-grub
sudo rebootStep 7: Configure Deep Learning Frameworks to Use NVLink
Modern frameworks detect NVLink automatically through NCCL (NVIDIA Collective Communications Library), but understanding how to verify and tune this behavior is essential for production deployments.
NCCL environment variables for NVLink optimization:
export NCCL_DEBUG=INFO
export NCCL_P2P_LEVEL=NVL # Force NVLink for P2P transfers
export NCCL_SHM_DISABLE=0 # Keep shared memory enabled
export NCCL_SOCKET_IFNAME=eth0 # Specify network interface for multi-nodeSetting NCCL_DEBUG=INFO causes NCCL to print its topology detection at runtime. You will see lines like [0] NCCL INFO Channel 00 : 0[...] -> 1[...] via NVL confirming NVLink is being used for inter-GPU transfers.
PyTorch multi-GPU verification:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
for j in range(torch.cuda.device_count()):
if i != j:
can_access = torch.cuda.can_device_access_peer(i, j)
print(f"GPU {i} -> GPU {j} P2P access: {can_access}")If can_device_access_peer returns True for both directions, PyTorch's DataParallel and DistributedDataParallel will use NVLink for gradient synchronization automatically.
TensorFlow multi-GPU check:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
print(f"Detected GPUs: {len(gpus)}")
for gpu in gpus:
print(gpu)
# Enable memory growth to prevent TF from allocating all VRAM at startup
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)TensorFlow uses NCCL for collective operations when running with MirroredStrategy, so the NCCL environment variables above apply equally.
Step 8: Benchmark NVLink Performance
Before committing production workloads, establish a performance baseline. This also serves as a regression test after driver updates or hardware changes.
NCCL all-reduce bandwidth test (the most representative benchmark for distributed training):
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make CUDA_HOME=/usr/local/cuda
./build/all_reduce_perf -b 8 -e 512M -f 2 -g 2The -g 2 flag specifies two GPUs. Look at the busbw column β this is the effective bus bandwidth. Over NVLink on RTX 3090 cards, you should see values approaching 100 GB/s. Over PCIe, expect 20β30 GB/s.
cuBLAS GEMM benchmark for compute-bound workloads:
/usr/local/cuda/extras/demo_suite/bandwidthTest --mode=shmooStep 9: Ongoing Monitoring and Alerting
For production environments, nvidia-smi in daemon mode or integration with Prometheus via dcgm-exporter is the recommended approach.
Install DCGM (Data Center GPU Manager):
sudo apt-get install -y datacenter-gpu-manager
sudo systemctl enable nvidia-dcgm
sudo systemctl start nvidia-dcgmQuery NVLink-specific metrics via DCGM:
dcgmi dmon -e 1011,1012,1013,1014Field IDs 1011β1014 correspond to NVLink bandwidth counters (TX/RX per link). These can be exported to Prometheus and visualized in Grafana for long-term trend analysis.
For lighter-weight monitoring, a simple nvidia-smi loop captures the essentials:
nvidia-smi dmon -s pucvmet -d 5The flags -s pucvmet enable power, utilization, clock, VRAM, memory bandwidth, ECC, and temperature reporting at 5-second intervals.
NVLink vs. PCIe vs. NVSwitch: When Each Architecture Applies
| Scenario | Recommended Interconnect | Rationale |
|---|---|---|
| 2-GPU consumer workstation | NVLink bridge | Cost-effective, 2x PCIe bandwidth |
| 2-4 GPU prosumer server | NVLink bridge (if supported) | Meaningful bandwidth gain for training |
| 8-GPU data center node | NVSwitch fabric | All-to-all connectivity, no bottleneck |
| Multi-node distributed training | InfiniBand + NVLink | NVLink within node, IB across nodes |
| Inference serving (latency-critical) | NVLink | Reduces inter-GPU synchronization latency |
| Video transcoding (embarrassingly parallel) | PCIe sufficient | No inter-GPU communication needed |
Common Failure Modes and Troubleshooting
NVLink not detected after physical installation. Run nvidia-smi topo -m and check for PIX instead of NV#. Re-seat the bridge. Check that both GPUs are on the same PCIe root complex β GPUs on different CPU sockets connected via QPI/UPI will not form an NVLink pair even with a bridge installed.
P2P bandwidth matches PCIe speeds despite NVLink bridge. IOMMU is almost certainly blocking peer access. Apply the iommu=pt kernel parameter as described above.
NVLink errors accumulating in nvidia-smi nvlink --errorcounters. Physical layer issue. Inspect the bridge connector pads for debris or damage. Try reseating the bridge. If errors persist, the bridge itself may be faulty.
NCCL not using NVLink despite topology showing NV2. Set NCCL_P2P_LEVEL=NVL explicitly. Also verify NCCL version compatibility with your CUDA version β mismatches cause NCCL to fall back to shared memory or socket transport.
Driver installation fails with DKMS errors. The kernel headers for your running kernel may not be installed. Fix with:
sudo apt-get install -y linux-headers-$(uname -r)
sudo dkms autoinstallChoosing the Right Server Infrastructure for NVLink Workloads
NVLink configuration is only as effective as the underlying server platform. Several infrastructure factors directly affect NVLink performance in practice:
PCIe topology. On dual-socket EPYC or Xeon platforms, PCIe lanes are distributed across both CPUs. GPUs connected to different CPUs communicate via the inter-socket fabric (Infinity Fabric or UPI), which adds latency and reduces effective bandwidth for GPU-to-GPU transfers that must cross the socket boundary. Whenever possible, install NVLink-paired GPUs on PCIe slots attached to the same CPU socket.
Memory bandwidth. Even with NVLink handling GPU-to-GPU transfers, the CPU's memory subsystem remains the bottleneck for data ingestion. High-bandwidth DDR5 or HBM-equipped platforms reduce the time spent staging data before it reaches the GPU.
Storage throughput. Large model checkpoints and training datasets require fast storage. NVMe SSDs with sequential read speeds above 7 GB/s prevent the storage layer from becoming the bottleneck during data loading.
Cooling. Two high-TDP GPUs under sustained NVLink-accelerated load generate substantial heat. Ensure adequate airflow or liquid cooling capacity. Thermal throttling will reduce GPU clock speeds and negate the bandwidth advantage NVLink provides.
For teams running multi-GPU AI training or HPC simulations, Dedicated Servers with NVMe storage and root access provide the hardware control necessary to implement the full configuration described in this guide. For GPU-accelerated workloads specifically, GPU Hosting offers pre-configured environments with NVIDIA drivers already installed. Teams that need a flexible base for custom CUDA environments may also find VPS Hosting useful for development and testing before scaling to dedicated hardware.
Key Takeaways and Decision Checklist
Before deploying NVLink in production, verify each item:
- Hardware confirmed: Both GPUs are on NVIDIA's NVLink compatibility list for your specific SKU, not just the product family.
- Bridge generation matched: The NVLink bridge generation matches the GPU generation (Turing bridge for Turing GPUs, Ampere bridge for Ampere GPUs).
- Physical installation verified:
nvidia-smi topo -mshowsNV1orNV2between GPU pairs, notPIXorPHB. - P2P access confirmed:
p2pBandwidthLatencyTestreports NVLink-level bandwidth (not PCIe-level). - IOMMU addressed: If running on EPYC or Xeon,
iommu=ptis set in kernel parameters. - NCCL transport confirmed:
NCCL_DEBUG=INFOoutput showsvia NVLfor inter-GPU channels. - Error counters clean:
nvidia-smi nvlink --errorcountersshows zero replay and recovery errors after a burn-in test. - Monitoring active: DCGM or
nvidia-smi dmonis logging NVLink bandwidth and error metrics to a persistent store. - Thermal headroom confirmed: Both GPUs sustain target clock speeds under full load without thermal throttling.
- Driver and CUDA versions pinned: Production environments use pinned driver versions managed through DKMS to prevent unintended updates from breaking the configuration.
Frequently Asked Questions
Does NVLink work on all NVIDIA RTX cards?
No. NVLink support varies significantly even within the RTX lineup. The RTX 4080, for example, does not support NVLink despite being a high-end Ada Lovelace card. Only the RTX 4090 supports NVLink in that generation. Always verify against the specific GPU's datasheet, not the product family.
Can NVLink be used across different GPU models?
In general, no. NVLink requires both GPUs to be the same model and generation. NVIDIA does not officially support mixed-model NVLink configurations, and the driver stack will not form a peer-to-peer NVLink relationship between dissimilar GPUs even if the physical connectors are compatible.
What happens if the NVLink bridge is removed while the system is running?
The system will not crash immediately, but any active P2P transfers over NVLink will fail, which will typically cause the running CUDA application to throw a CUDA error and terminate. The GPUs will fall back to PCIe for subsequent operations. Hot-removal of the bridge is not supported and risks physical damage to the connector pads.
Is NVLink automatically used by PyTorch and TensorFlow, or does it require explicit configuration?
Both frameworks use NCCL for multi-GPU collective operations, and NCCL detects NVLink topology automatically. However, you should always verify with NCCL_DEBUG=INFO that NCCL is actually selecting the NVLink transport path. In some configurations β particularly with IOMMU enabled or mismatched NCCL/CUDA versions β NCCL will silently fall back to slower transports.
How do I tell if NVLink is actually improving my training throughput?
Run your training job with NCCL_P2P_DISABLE=1 (forces PCIe) and then without it (allows NVLink). Compare iteration time or samples per second. For communication-heavy workloads like large transformer training with frequent all-reduce operations, NVLink typically reduces inter-GPU synchronization time by 40β70% compared to PCIe, translating directly into faster epoch times.
