You are tasked with validating a newly installed NVIDIAAIOO Tensor Core GPU within a server. You need to confirm the GPU is correctly recognized and functioning at its expected performance level. Describe the process, including commands and tools, to verify the following aspects: 1) GPU presence and basic information, 2) PCle bandwidth and link speed, and 3) Sustained computational performance under load.
A. 1) Use ‘Ispci I grep NVIDIA’ for presence, ‘nvidia-smi’ for basic info.
2) Use ‘nvidia-smi -q -d pcie’ for bandwidth/speed.
3) Run a TensorFlow ResNet50 benchmark.
B. 1) Use ‘nvidia-smi’ for presence and basic info.
2) PCle speed is irrelevant.
3) Run the ‘nvprof profiler during a CUDA application.
C. 1) Check BIOS settings for GPU detection.
2) Use ‘Ispci -vv’ to check PCle speed.
3) Run a PyTorch ImageNet training script.
D. 1) Use ‘nvidia-smi’ for presence and basic info.
2) Use ‘nvidia-smi -q -d pcie’ for bandwidth/speed.
3) Run a CUDA-based matrix multiplication benchmark (e.g., using cuBLAS) with increasing matrix sizes and monitor performance.
E. 1) Use ‘nvidia-smi’ for presence and basic info.
2) Use ‘nvlink-monitor’ for bandwidth/speed.
3) Run a CPU-bound benchmark to avoid GPU bottlenecks.
Explanation:
‘nvidia-smi’ is the primary tool for NVIDIA GPU information. ‘nvidia-smi -q -d pcie’ provides PCle details. A CUDA-based benchmark isolates GPU performance. Other options have elements of truth but aren’t complete or optimally targeted (e.g., ResNet50 relies on other frameworks). Using a CPU-bound benchmark wouldn’t test the GPU’s capabilities.