英伟达NVIDIA显卡GPU深度学习性能实测
1 引言
现在市场上,很多显卡可以用于深度学习,尤其矿难后,好多显卡价格也下来了,这里,我们看看英伟达NVIDIA显卡GPU深度学习性能实测,到底结果怎么样。尤其可以看下最新的RTX 4080, RTX 4090与3080, 3090的区别。
2 测试说明
只看官方算力大小不完全能体现出不同GPU的差异和好坏,比如显存带宽对最终耗时也会产生较大影响,这里以深度学习典型任务为例进行实测对比。
测试说明:
- 使用PyTorch=1.9.0在AutoDL不同GPU(均为单卡测试)上实测
- 网络的输入为使用torch.zero在内存中构造的伪数据,因此不包含CPU预处理数据的负载和额外IO的影响,主要是GPU本身的性能占主导
- 测试ResNet50和ViT Transformer两种算法。ResNet网络包含激活较多,除了本身算力,显存的带宽也对性能有较大影响。ViT Transfomer卷积多,本身算力大小占主要因素
- 包含单精FP32和半精FP16(非混合精度)的测试结果,请根据自己的需要进行对比
数据来源 https://www.autodl.com/docs/gpu_perf/,(取中间值)
先看看简单的图示
3 显卡GPU参数
看下这几款GPU的参数怎么样。
Tesla P40 | Titan Xp | 1080Ti | 2080 Ti | V100 | 3060 | A4000 | A40 | |
Release Date | 2016/9/13 | 2017/4/6 | 2017/3/10 | 2018/9/20 | 2017/6/21 | 2021/12/1 | 2021/4/12 | 2020/10/5 |
GPU Name | GP102 | GP102 | GP102 | TU102 | GV100 | GA106 | GA104 | GA102 |
Ampere | Pascal | Pascal | Pascal | Turing | Volta | Ampere | Ampere | Ampere |
Base Clock | 1303 MHz | 1405 MHz | 1481 MHz | 1350 MHz | 1245 MHz | 1320MHz | 735 MHz | 1305 MHz |
Boost Clock | 1531 MHz | 1582 MHz | 1582 MHz | 1545 MHz | 1380 MHz | 1777MHz | 1560 MHz | 1740 MHz |
Memory Clock | 1808 MHz 14.5 Gbps |
1426 MHz 11.4 Gbps |
1376 MHz 11Gbps | 1750 MHz 14Gbps | 876 MHz 1752 Mbps |
1875 MHz 15 Gbps |
1750 MHz 14 Gbps |
1812 MHz 14.5 Gbps |
Memory Size | 24 GB | 12 GB | 11 GB | 11 GB | 16 GB | 12 GB | 16 GB | 48 GB |
Memory Type | GDDR5X | GDDR5X | GDDR5X | GDDR6 | HBM2 | GDDR6 | GDDR6 | GDDR6 |
Memory Bus | 384 bit | 384 bit | 352 bit | 352 bit | 4096 bit | 192 bit | 256 bit | 384 bit |
Bandwidth | 694.3 GB/s | 547.6 GB/s | 484.4 GB/s | 616.0 GB/s | 897.0 GB/s | 360.0 GB/s | 448.0 GB/s | 695.8 GB/s |
Shading Units | 3840 | 3840 | 3584 | 4352 | 5120 | 3584 | 6144 | 10752 |
TMUs | 240 | 240 | 224 | 272 | 320 | 112 | 192 | 336 |
ROPs | 96 | 96 | 88 | 88 | 128 | 48 | 96 | 112 |
SM Count | 30 | 30 | 28 | 68 | 80 | 28 | 48 | 84 |
Tensor Cores | N/A | N/A | N/A | 544 | 640 | 112 | 192 | 336 |
RT Cores | N/A | N/A | N/A | 68 | 28 | 48 | 84 | |
L1 Cache(per SM) | 48 KB | 48 KB | 48 KB | 64 KB | 128 KB | 128KB | 128 KB | 128 KB |
L2 Cache | 3 MB | 3 MB | 2.75 MB | 5.5 MB | 6 MB | 3 MB | 4 MB | 6 MB |
CUDA | 6.1 | 6.1 | 6.1 | 7.5 | 7 | 8.6 | 8.6 | 8.6 |
Pixel, GPixel/s | 147.0 | 151.9 | 139.2 | 136.0 | 176.6 | 85.30 | 149.8 | 194.9 |
Texture, GTexel/s | 367.4 | 379.7 | 354.4 | 420.2 | 441.6 | 199.0 | 299.5 | 584.6 |
FP16, TFLOPS | 0.1837 | 0.1898 | 0.1772 | 26.90 | 28.26 | 12.74 | 19.17 | 37.42 |
FP32, TFLOPS | 11.76 | 12.15 | 11.34 | 13.45 | 14.13 | 12.74 | 19.17 | 37.42 |
FP64, TFLOPS | 0.3674 | 0.3797 | 0.3544 | 0.4202 | 7.066 | 0.199 | 0.599 | 0.5846 |
TDP | 250 W | 250 W | 250 W | 250 W | 300 W | 170 W | 140 W | 300 W |
续,表二
3080 | A5000 | 3080 Ti | 3090 | 3090 Ti | A100 | 4080 | 4090 | |
Release Date | 2020/9/1 | 2021/4/12 | 2021/5/31 | 2020/9/1 | 2022/1/27 | 2020/6/22 | 2022/9/20 | 2022/9/20 |
GPU Name | GA102 | GA102 | GA102 | GA102 | GA102 | GA100 | AD103 | AD102 |
Ampere | Ampere | Ampere | Ampere | Ampere | Ampere | Ampere | Ada Lovelace | Ada Lovelace |
Base Clock | 1440 MHz | 1170 MHz | 1365 MHz | 1395 MHz | 1560 MHz | 765 MHz | 2205 MHz | 2235 MHz |
Boost Clock | 1710 MHz | 1695 MHz | 1665 MHz | 1695 MHz | 1860 MHz | 1410 MHz | 2505 MHz | 2520 MHz |
Memory Clock | 1188 MHz 19 Gbps |
2000 MHz 16 Gbps |
1188 MHz 19 Gbps |
1219 MHz 19.5Gbps | 1313 MHz 21 Gbps |
1215 MHz 2.4 Gbps |
1400 MHz 22.4 Gbps |
1313 MHz 21 Gbps |
Memory Size | 10 GB | 24 GB | 12 GB | 24 GB | 24 GB | 40 GB | 16 GB | 24 GB |
Memory Type | GDDR6X | GDDR6 | GDDR6X | GDDR6X | GDDR6X | HBM2e | GDDR6X | GDDR6X |
Memory Bus | 320 bit | 384 bit | 384 bit | 384 bit | 384 bit | 5120 bit | 256 bit | 384 bit |
Bandwidth | 760.3 GB/s | 768.0 GB/s | 912.4 GB/s | 936.2 GB/s | 1,008 GB/s | 1,555 GB/s | 716.8 GB/s | 1,008 GB/s |
Shading Units | 8704 | 8192 | 10240 | 10496 | 10752 | 6912 | 9728 | 16384 |
TMUs | 272 | 256 | 320 | 328 | 336 | 432 | 304 | 512 |
ROPs | 96 | 96 | 112 | 112 | 112 | 160 | 112 | 176 |
SM Count | 68 | 64 | 80 | 82 | 84 | 108 | 76 | 128 |
Tensor Cores | 272 | 256 | 320 | 328 | 336 | 432 | 304 | 512 |
RT Cores | 68 | 64 | 80 | 82 | 84 | 112 | 128 | |
L1 Cache(per SM) | 128 KB | 128 KB | 128 KB | 128 KB | 128 KB | 192 KB | 128 KB | 128 KB |
L2 Cache | 5 MB | 6 MB | 6 MB | 6 MB | 6 MB | 40 MB | 64 MB | 72 MB |
CUDA | 8.6 | 8.6 | 8.6 | 8.6 | 8.6 | 8 | 8.9 | 8.9 |
Pixel, GPixel/s | 164.2 | 162.7 | 186.5 | 189.8 | 208.3 | 225.6 | 280.6 | 443.5 |
Texture, GTexel/s | 465.1 | 433.9 | 532.8 | 556.0 | 625.0 | 609.1 | 761.5 | 1,290 |
FP16, TFLOPS | 29.77 | 27.77 | 34.10 | 35.58 | 40.00 | 77.97 | 48.74 | 82.58 |
FP32, TFLOPS | 29.77 | 27.77 | 34.10 | 35.58 | 40.00 | 19.49 | 48.74 | 82.58 |
FP64, TFLOPS | 0.4651 | 0.8678 | 0.5328 | 0.556 | 0.625 | 9.746 | 0.7615 | 1.290 |
TDP | 320 W | 230 W | 350 W | 350 W | 450 W | 250 W | 320 W | 450 W |