1. Install CUDA Toolkit Link to heading

The CUDA Toolkit includes: CUDA, cuDNN, TensorRT, and more.

  1. Download the CUDA Toolkit Download link: CUDA Toolkit Archive

  2. Set up Environment Variables

    export PATH=/usr/local/cuda-12.2/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH
    
  3. Test the installation

    nvcc --version # Display CUDA version
    

2. Install Drivers Link to heading

  1. Check if the system detects the NVIDIA GPU:

    lspci | grep -i nvidia
    
  2. Use ubuntu-drivers to check for the recommended NVIDIA driver version:

    ubuntu-drivers devices
    
  3. Install the recommended driver: To let the system automatically install the recommended NVIDIA driver:

    sudo ubuntu-drivers autoinstall
    

    If you need to install a specific driver version manually:

    sudo apt install nvidia-driver-535  # The recommended version on my system is 535
    
  4. After installation, reboot the system:

    sudo reboot
    
  5. Check if the NVIDIA driver is loaded:

    nvidia-smi  # The output should match Step 3 from the first part
    

At this point, the installation should be complete.

Troubleshooting Link to heading

  1. If there’s no output or you see an error such as:

    NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
    
  2. Run the following command to check if the driver is properly installed:

    dpkg -l | grep nvidia
    

    Look for an entry like nvidia-driver-535.

  3. Check if the NVIDIA module is loaded:

    lsmod | grep nvidia
    
  4. If there’s no output, try manually loading it:

    sudo modprobe nvidia
    

If you encounter this error: bash modprobe: ERROR: could not insert 'nvidia': Operation not permitted This is likely due to Secure Boot being enabled.

  1. Check if Secure Boot is enabled:

    mokutil --sb-state
    

    If it shows SecureBoot enabled, you need to disable it, as Secure Boot prevents unsigned drivers from loading.

  2. Disabling Secure Boot:

  • Method 1: BIOS Settings

    1. Restart the computer and enter the BIOS/UEFI settings.
    2. Find the Secure Boot option and set it to Disabled.
    3. Save the changes and exit the BIOS.
  • Method 2: MOK Settings

    1. Run the following command to disable Secure Boot or register the key:
    sudo mokutil --disable-validation
    
    1. Set a password when prompted.
    2. Reboot the system:
    sudo reboot
    
    1. Enter the MOK management interface:
      • “Continue Boot” to proceed with normal startup.
      • “Enroll MOK” to register keys (if you selected to import keys).
      • “Disable Secure Boot” (if you ran mokutil --disable-validation).
      • “Change Password” to change the password.
  1. Finally, check if the NVIDIA driver is properly loaded:
    nvidia-smi
    

3. Nvidia-container-toolkit Link to heading

This toolkit helps users access/build/run GPU-accelerated applications in containerized environments (like Docker). It includes a runtime library and associated utilities that automatically configure containers to leverage NVIDIA GPUs for efficient GPU acceleration in containerized applications.

Installation

  1. Check if nvidia-container-toolkit is installed:

    dpkg -l | grep nvidia-container-toolkit
    
  2. Install the toolkit:

    sudo apt-get install -y nvidia-container-toolkit
    

    If you encounter the error:

    E: Unable to locate package nvidia-container-toolkit
    

    Follow the steps below.

  3. Add NVIDIA GPG key:

    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
    
  4. Add the NVIDIA container toolkit repository:

    distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list |
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
  5. Update the package list:

    sudo apt-get update
    
  6. Install nvidia-container-toolkit:

    sudo apt-get install -y nvidia-container-toolkit
    
  7. Restart Docker service:

    sudo systemctl restart docker
    

Usage

  1. Start a container with GPU support: When running a Docker container, use the --gpus all flag to enable GPU support, and the -v flag to mount the host system’s CUDA directory to the container. For example, if the host’s CUDA installation path is /usr/local/cuda, use:

    docker run --gpus all -it \ 
      -v /usr/local/cuda:/usr/local/cuda \
      your_docker_image
    
  2. Set up environment variables inside the container to ensure it can find nvcc:

    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    
  3. Test CUDA:

    nvcc --version
    

4. GPU Power Settings Link to heading

To limit power usage:

nvidia-smi -i 0 -pl 100  # -i 0 for the first GPU, -pl 100 to limit power to 100W

To restore power limits:

nvidia-smi -i 0 -pl 160