Working with the CUDA Environment

Not all machines have CUDA capable video cards in them (the FastX jump hosts specifically do NOT have GPU's installed and you mush ssh to a bigjobs lab computer after logging in via FastX, ssh bigjobs ). Use the "lspci" command to determine if there is an NVIDIA GPU installed:

From command line execute the command nvidia-smi

$ nvidia-smi
Thu Sep 16 10:36:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap|         Memory-Usage | GPU-Util Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0 Quadro K620         Off | 00000000:01:00.0 Off |                  N/A |
| 34%   37C    P8     1W / 30W |    253MiB / 2001MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
| GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A N/A    518882      G   /usr/libexec/Xorg                 251MiB |
+-----------------------------------------------------------------------------+

[user@l-lnx103 cuda_samples]$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2)

Where is the documentation?

CUDA (/usr/local/cuda/doc/html/index.html)
CUDA Downloads
CUDA Toolkit Documentation
Nvidia's CUDA Forum
Release Notes (/usr/local/cuda/doc/html/cuda-toolkit-release-notes/index.html)
SDK Sample programs: /usr/local/cuda/samples

How do I build the examples that are included with the SDK?

You'll need to make a copy of the /usr/local/cuda/samples and setup your shell environment for an older, side loaded version of the gcc 4.9.2 compiler compatible with the CUDA libraries:

[user@l-lnx103:~]$ cp -a /usr/local/cuda/samples ~/
[user@l-lnx103:~]$ cd samples
# setup your environment with version 4.9.2 of gcc to correctly build the samples
[user@l-lnx103:~]$ scl enable devtoolset-3 bash
[user@l-lnx103:~/samples]$ make
[user@l-lnx103:~]$ export PATH=$PATH:~/samples/bin/x86_64/linux/release # for bash users
[user@l-lnx103:~]$ setenv PATH ${PATH}:~/samples/bin/x86_64/linux/release # for tcsh users
[user@l-lnx103:~]$ deviceQuery

If you followed the above example, everything is built and the executables are in ~/samples/bin/x86_64/linux/release.

What version of the toolkit is installed?

Follow the link for /usr/local/cuda or use the rpm command. On this workstation, version 8 of the cuda SDK is installed:

[user@l-lnx103 cuda_samples]$ ls -l /usr/local/cuda
lrwxrwxrwx. 1 root root 8 Jun 25 12:58 /usr/local/cuda -> cuda-10.1

[user@l-lnx103 cuda_samples]$ rpm -qav|grep cuda-toolkit
cuda-toolkit-10-0-10.0.130-1.x86_64
cuda-toolkit-10-1-10.1.168-1.x86_64
cuda-toolkit-9-2-9.2.148-1.x86_64
cuda-toolkit-8-0-8.0.61-1.x86_64

How can I tell if the card supports double-precision floating point numbers?

If you built the examples, there's an executable named deviceQuery that you can run that will tell you all about the card. The card must have compute capability 1.3 or higher. A description of each compute capability level is available in Appendix G of the Programming Guide

Here's sample output of deviceQuery with the Major/Minor revision number highlighted:

[user@l-lnx103:samples]$ bin/x86_64/linux/release/deviceQuery

bin/x86_64/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K620"
CUDA Driver Version / Runtime Version                8.0 / 8.0
CUDA Capability Major/Minor version number:     5.0
Total amount of global memory:                           2000 MBytes (2097414144 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP:         384 CUDA Cores
GPU Max Clock rate:                                              1124 MHz (1.12 GHz)
Memory Clock rate:                                                900 Mhz
Memory Bus Width:                                                128-bit
L2 Cache Size:                                                        2097152 bytes
Maximum Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory:                         65536 bytes
Total amount of shared memory per block:            49152 bytes
Total number of registers available per block:        65536
Warp size:                                                                 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block:                  1024
Max dimension size of a thread block (x,y,z):         (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z):             (2147483647, 65535, 65535)
Maximum memory pitch:                                          2147483647 bytes
Texture alignment:                                                    512 bytes
Concurrent copy and kernel execution:                    Yes with 1 copy engine(s)
Run time limit on kernels:                                         Yes
Integrated GPU sharing Host Memory:                      No
Support host page-locked memory mapping:           Yes
Alignment requirement for Surfaces:                        Yes
Device has ECC support:                                            Disabled
Device supports Unified Addressing (UVA):               Yes
Device PCI Domain ID / Bus ID / location ID:              0 / 1 / 0
Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K620
Result = PASS

How do I compile my code to use double precision?

By default, CUDA will convert doubles into floats. In order to override this behavior, add "--gpu-name sm_13" to the command line options passed to nvcc. Please see CUDA FAQ for more information.

deviceQuery gave me an error about API mismatch, what's wrong?

[user@l-lnx103] bin/linux/release/deviceQuery
bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Error: API mismatch: the NVIDIA kernel module has version 260.19.36,
but this NVIDIA driver component has version 260.19.29. Please make
sure that the kernel module and all NVIDIA driver components
have the same version.
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

FAILED

Press <Enter> to Quit...
-----------------------------------------------------------

From time to time we update the Nvidia drivers on the system to fix various issues. When that happens, you'll need to rebuild deviceQuery and any code that you built against a previous version of libcudart.so (which is provided by the driver and not by the sdk)

What other issues do I need to be aware of?

If the computer is using the video card as a display (there's a graphical login screen), you will be limited to 5 seconds of time per kernel execution (Look for the line "Run time limit on kernels" in the output of deviceQuery to see if you'll be limited).

If you are using the computer remotely via ssh and someone sits down at the computer and logs in, you'll no longer be able to use the video card to run CUDA executables. The program deviceQuery will return "There is no device supporting CUDA".

If multiple users are trying to run CUDA programs at the same time, there may be contention problems. It appears to be a first-come, first-served situation. If the first user allocates all of the memory on the video card, no one else will be able to run programs until the first user finishes. If the card can accommodate the needs of multiple programs (memory and processing), then it will run all programs simultaneously. Otherwise, you'll have to wait. If you want to be sure that you can run your code, you'll have to go to the lab and sit at the computer. You will then get exclusive access to the video card, but you will be limited to 5 seconds per kernel execution.