使用gpu-burn测试GPU是否有错误,也可以用来做压力测试。
Linux GPU压力测试
首先安装好驱动和cuda,设置好cuda环境变量,使用nvcc -V查看cuda版本
下载工具
wget https://codeload.github.com/wilicc/gpu-burn/zip/master
解压
unzip gpu-burn-master.zip
编译
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| cd gpu-burn-master make ll
-rw-r--r--. 1 root root 2681 Apr 9 2024 compare.cu -rw-r--r--. 1 root root 6104 Apr 14 16:33 compare.ptx -rw-r--r--. 1 root root 341 Apr 9 2024 Dockerfile -rwxr-xr-x. 1 root root 83232 Apr 14 16:33 gpu_burn -rw-r--r--. 1 root root 962 Apr 9 2024 gpu-burn.8 -rw-r--r--. 1 root root 31595 Apr 9 2024 gpu_burn-drv.cpp -rw-r--r--. 1 root root 105408 Apr 14 16:33 gpu_burn-drv.o -rw-r--r--. 1 root root 1318 Apr 9 2024 LICENSE -rw-r--r--. 1 root root 1469 Apr 9 2024 Makefile -rw-r--r--. 1 root root 1801 Apr 9 2024 README.md
|
编译成功后生成了gpu_burn这个文件
测试
./gpu_burn 100
默认测试所有GPU,测试时间为100S。可以使用-d参数来进行双精度浮点计算
./gpu_burn -d 100
多卡指定GPU测试
CUDA_VISIBLE_DEVICES=1 ./gpu_burn 100
Gflop/s:GPU计算速度
errors:显示GPU计算有没有问题
temps:温度
测试结果有问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| [root@localhost gpu-burn-master]# ./gpu_burn 500 Using compare file: compare.ptx Burning for 500 seconds. GPU 0: Tesla T4 (UUID: GPU-899fdfea-f386-7371-8495-9a98620768fb) Initialized device 0 with 14913 MB of memory (14798 MB available, using 13318 MB of it), using FLOATS Results are 268435456 bytes each, thus performing 50 iterations 11.0% proc'd: 150 (3392 Gflop/s) errors: 30472096 (WARNING!) temps: 63 C Summary at: Mon Apr 14 04:37:04 PM CST 2025
22.0% proc'd: 300 (3298 Gflop/s) errors: 30537676 (WARNING!) temps: 66 C Summary at: Mon Apr 14 04:37:59 PM CST 2025
32.2% proc'd: 450 (3194 Gflop/s) errors: 30482853 (WARNING!) temps: 67 C Summary at: Mon Apr 14 04:38:50 PM CST 2025
43.2% proc'd: 600 (3138 Gflop/s) errors: 30498686 (WARNING!) temps: 69 C Summary at: Mon Apr 14 04:39:45 PM CST 2025
54.2% proc'd: 750 (3093 Gflop/s) errors: 30404967 (WARNING!) temps: 69 C Summary at: Mon Apr 14 04:40:40 PM CST 2025
64.2% proc'd: 900 (3090 Gflop/s) errors: 30511521 (WARNING!) temps: 69 C Summary at: Mon Apr 14 04:41:30 PM CST 2025
74.4% proc'd: 1050 (3077 Gflop/s) errors: 30539726 (WARNING!) temps: 70 C Summary at: Mon Apr 14 04:42:21 PM CST 2025
85.4% proc'd: 1200 (3067 Gflop/s) errors: 30461343 (WARNING!) temps: 70 C Summary at: Mon Apr 14 04:43:16 PM CST 2025
95.4% proc'd: 1350 (3065 Gflop/s) errors: 30454599 (WARNING!) temps: 70 C Summary at: Mon Apr 14 04:44:06 PM CST 2025
100.0% proc'd: 1400 (3064 Gflop/s) errors: 10159088 (WARNING!) temps: 70 C Killing processes with SIGTERM (soft kill) Freed memory for dev 0 Uninitted cublas done
Tested 1 GPUs: GPU 0: FAULTY
|
测试结果没问题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| [root@localhost gpu-burn-master]# ./gpu_burn 500 Using compare file: compare.ptx Burning for 500 seconds. GPU 0: NVIDIA RTX A400 (UUID: GPU-227259f4-b4d3-4638-69d1-4c5ceb2e21b0) Initialized device 0 with 3777 MB of memory (3715 MB available, using 3343 MB of it), using FLOATS Results are 268435456 bytes each, thus performing 11 iterations 10.8% proc'd: 88 (1784 Gflop/s) errors: 0 temps: 70 C Summary at: Mon Apr 14 04:50:50 PM CST 2025
21.0% proc'd: 165 (1774 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:51:41 PM CST 2025
31.2% proc'd: 253 (1770 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:52:32 PM CST 2025
42.0% proc'd: 330 (1772 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:53:26 PM CST 2025
53.0% proc'd: 418 (1772 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:54:21 PM CST 2025
63.0% proc'd: 506 (1771 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:55:11 PM CST 2025
73.6% proc'd: 594 (1772 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:56:04 PM CST 2025
84.0% proc'd: 671 (1772 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:56:56 PM CST 2025
94.2% proc'd: 759 (1773 Gflop/s) errors: 0 temps: 74 C Summary at: Mon Apr 14 04:57:47 PM CST 2025
100.0% proc'd: 814 (1772 Gflop/s) errors: 0 temps: 74 C Killing processes with SIGTERM (soft kill) Freed memory for dev 0 Uninitted cublas done
Tested 1 GPUs: GPU 0: OK
|