CUDA

	TECHNICAL SPECIFICATIONS
	Tesla A30 на THEOR4 ( Micro-architecture Ampere GA100 ) Peak FP64: 5.2 TF Peak FP64: Tensor Core 10.3 TF Peak FP32: 10.3 TF TF32 Tensor Core: 82 TF \| 165 TF* BFLOAT16 Tensor Core: 165 TF \| 330 TF* Peak FP16 Tensor Core: 165 TF \| 330 TF* Peak INT8 Tensor Core: 330 TOPS \| 661 TOPS* Peak INT4 Tensor Core: 661 TOPS \| 1321 TOPS* Media engines: 1 optical flow accelerator (OFA) 1 JPEG decoder (NVJPEG) 4 Video decoders (NVDEC) GPU Memory 24GB HBM2 GPU Memory Bandwidth 933GB/s Interconnect PCIe Gen4: 64GB/s Third-gen NVIDIA® NVLINK® 200GB/s** Form Factor 2-slot, full height, full length (FHFL) Max thermal design power (TDP) 165W Multi-Instance GPU (MIG): 4 MIGs @ 6GB each 2 MIGs @ 12GB each 1 MIGs @ 24GB Virtual GPU (vGPU) software support: NVIDIA AI Enterprise NVIDIA Virtual Compute Server CUDA compute capability: 8.0
	Tesla P100 на THEOR3 ( Micro-architecture Pascal GP100 ) NVIDIA CUDA® Cores: 3584 Double-Precision Performance: 4.7 teraFLOPS Single-Precision Performance: 9.3 teraFLOPS Half-Precision Performance: 18.7 teraFLOPS PCIe x16 Interconnect Bandwidth: 32 GB/s CoWoS HBM2 Stacked Memory Capacity: 12 GB CoWoS HBM2 Stacked Memory Bandwidth: 549 GB/s Enhanced Programmability with Page Migration Engine: yes ECC Protection for Reliability: yes Server-Optimized for Data Center Deployment: yes System Interface: PCIe Gen3 Max Power Consumption: 250 W Thermal Solution: Passive Form Factor: PCIe Full Height/Length Compute APIs: CUDA, DirectCompute,OpenCL™, OpenACC CUDA compute capability: 6.0
	RTX A2000 на i9a ( Micro-architecture Ampere GA106 ) Form factor → PCIe x16 form factor # OF CUDA CORES → 3328 # OF Tensor cores → 104 CUDA compute capability → 8.6 Frequency of cuda cores → up to 1.2 GHz Double precision floating point performance (peak) → 249 Gflops Single precision floating point performance (peak) → 7.987 Tflops Total dedicated memory → 12GB GDDR6* Memory speed → 1.5 GHz Memory interface → 192-bit Memory bandwidth → 288 GB/sec Power consumption → 70W TDP System interface → PCIe x16
	Tesla C2075 на Theor2 ( Micro-architecture Fermi GF100 ) Form factor → 9.75. PCIe x16 form factor Number of CUDA cores → 448 Frequency of CUDA cores → 1.15 GHz Double precision floating point performance (peak) → 515 Gflops Single precision floating point performance (peak) → 1.03 Tflops Total dedicated memory → 6GB GDDR5* Memory speed → 1.5 GHz Memory interface → 384-bit Memory bandwidth → 144 GB/sec Power consumption → 225W TDP System interface → PCIe x16 Gen2 Thermal solution → Active Fansink Display support → Dual-Link DVI-I: 1 → Maximum Display Resolution 1600x1200

TECHNICAL SPECIFICATIONS

Tesla A30 на THEOR4 ( Micro-architecture Ampere GA100 )

Peak FP64: 5.2 TF
Peak FP64: Tensor Core 10.3 TF
Peak FP32: 10.3 TF
TF32 Tensor Core: 82 TF | 165 TF*
BFLOAT16 Tensor Core: 165 TF | 330 TF*
Peak FP16 Tensor Core: 165 TF | 330 TF*
Peak INT8 Tensor Core: 330 TOPS | 661 TOPS*
Peak INT4 Tensor Core: 661 TOPS | 1321 TOPS*
Media engines:
1 optical flow accelerator (OFA)
1 JPEG decoder (NVJPEG)
4 Video decoders (NVDEC)
GPU Memory 24GB HBM2
GPU Memory Bandwidth 933GB/s
Interconnect PCIe Gen4: 64GB/s
Third-gen NVIDIA® NVLINK® 200GB/s**
Form Factor 2-slot, full height, full length (FHFL)
Max thermal design power (TDP) 165W Multi-Instance GPU (MIG): 4 MIGs @ 6GB each 2 MIGs @ 12GB each 1 MIGs @ 24GB
Virtual GPU (vGPU) software support: NVIDIA AI Enterprise NVIDIA Virtual Compute Server
CUDA compute capability: 8.0

Tesla P100 на THEOR3 ( Micro-architecture Pascal GP100 )

NVIDIA CUDA® Cores: 3584
Double-Precision Performance: 4.7 teraFLOPS
Single-Precision Performance: 9.3 teraFLOPS
Half-Precision Performance: 18.7 teraFLOPS
PCIe x16 Interconnect Bandwidth: 32 GB/s
CoWoS HBM2 Stacked Memory Capacity: 12 GB
CoWoS HBM2 Stacked Memory Bandwidth: 549 GB/s
Enhanced Programmability with Page Migration Engine: yes
ECC Protection for Reliability: yes
Server-Optimized for Data Center Deployment: yes
System Interface: PCIe Gen3
Max Power Consumption: 250 W
Thermal Solution: Passive
Form Factor: PCIe Full Height/Length
Compute APIs: CUDA, DirectCompute,OpenCL™, OpenACC
CUDA compute capability: 6.0

RTX A2000 на i9a ( Micro-architecture Ampere GA106 )

Form factor → PCIe x16 form factor
# OF CUDA CORES → 3328
# OF Tensor cores → 104
CUDA compute capability → 8.6
Frequency of cuda cores → up to 1.2 GHz
Double precision floating point performance (peak) → 249 Gflops
Single precision floating point performance (peak) → 7.987 Tflops
Total dedicated memory → 12GB GDDR6*
Memory speed → 1.5 GHz
Memory interface → 192-bit
Memory bandwidth → 288 GB/sec
Power consumption → 70W TDP
System interface → PCIe x16

Tesla C2075 на Theor2 ( Micro-architecture Fermi GF100 )

Form factor → 9.75. PCIe x16 form factor
Number of CUDA cores → 448
Frequency of CUDA cores → 1.15 GHz
Double precision floating point performance (peak) → 515 Gflops
Single precision floating point performance (peak) → 1.03 Tflops
Total dedicated memory → 6GB GDDR5*
Memory speed → 1.5 GHz
Memory interface → 384-bit
Memory bandwidth → 144 GB/sec
Power consumption → 225W TDP
System interface → PCIe x16 Gen2
Thermal solution → Active Fansink
Display support → Dual-Link DVI-I: 1 → Maximum Display Resolution 1600x1200

theor4:> maple test_cuda.mpl
    |\^/|     Maple 2025 (X86 64 LINUX)
._|\|   |/|_. Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2025
 \  MAPLE  /  All rights reserved. Maple is a trademark of
 <____ ____>  Waterloo Maple Inc.
      |       Type ? for help.
> CUDA:-IsEnabled();
                                     false

> CUDA:-Enable(true);
                                     false

> CUDA:-IsEnabled();
                                     true


> CUDA:-HasDoubleSupport();
                               table([0 = true])


> with(LinearAlgebra):
> M:=RandomMatrix(8000,outputoptions=[datatype=float[4]]);
memory used=245.6MB, alloc=252.5MB, time=1.53
memory used=251.4MB, alloc=285.5MB, time=1.62
M :=

    [46. , -45. , -87. , 53. , 30. , 10. , 55. , -85. , 63. , 16. , ...]

    [12. , -20. , -46. , 92. , 49. , -12. , -36. , 76. , -97. , -15. , ...]

    [-60. , 10. , 25. , -90. , -23. , 22. , 16. , -39. , 0. , 74. , ...]

    [-52. , -99. , -32. , 48. , -19. , 58. , -34. , -39. , 62. , -16. , ...]

    [13. , -48. , 45. , -68. , 99. , 43. , 34. , 45. , -35. , -22. , ...]

    [-4. , -67. , 5. , -50. , -70. , 0. , -7. , -64. , 44. , -88. , ...]

    [-33. , 12. , 83. , -63. , -69. , -31. , 23. , -61. , -49. , 58. , ...]

    [3. , -67. , 40. , -76. , -37. , 98. , 5. , -49. , 72. , 4. , ...]

    [-92. , 43. , 14. , 58. , 94. , -13. , -48. , -77. , 87. , 97. , ...]

    [-50. , 1. , 11. , 16. , -55. , 1. , 78. , 40. , -36. , 80. , ...]

    [: , : , : , : , : , : , : , : , : , : , "8000 x 8000 Matrix"]

> N:=RandomMatrix(8000,outputoptions=[datatype=float[4]]);
memory used=508.0MB, alloc=532.6MB, time=2.92
N :=

    [5. , 96. , 19. , 92. , 64. , 87. , -71. , -98. , 52. , -80. , ...]

    [3. , 1. , 93. , -75. , 87. , -50. , -2. , -99. , -49. , 95. , ...]

    [-46. , 81. , 20. , 69. , -11. , -84. , 82. , -51. , 97. , 74. , ...]

    [-24. , 59. , -41. , 86. , 23. , 65. , 52. , 52. , 61. , 64. , ...]

    [65. , -69. , -57. , -67. , -19. , -80. , -79. , 70. , -44. , 57. , ...]

    [98. , 38. , -81. , 81. , -7. , -97. , -54. , -79. , 97. , -45. , ...]

    [-60. , -2. , 60. , 21. , 36. , -71. , -28. , -15. , 87. , 56. , ...]

    [-57. , -53. , 93. , 78. , 97. , -50. , -98. , -24. , 91. , -69. , ...]

    [-14. , -95. , -95. , -95. , -80. , 42. , -76. , 15. , 94. , -24. , ...]

    [84. , -83. , -30. , -22. , -65. , -90. , -55. , -75. , 38. , 68. , ...]

    [: , : , : , : , : , : , : , : , : , : , "8000 x 8000 Matrix"]


> time[real](MatrixMatrixMultiply(M,N));
memory used=996.7MB, alloc=1020.9MB, time=3.24
memory used=1485.0MB, alloc=1509.2MB, time=5.58
                                     1.810

> CUDA:-Enable(false);
                                     true

> CUDA:-IsEnabled();
                                     false

> time[real](MatrixMatrixMultiply(M,N));
memory used=2949.9MB, alloc=1997.5MB, time=313.69
                                    10.417



> CUDA:-Enable(true);
                                     false

> CUDA:-IsEnabled();
                                     true

> M:=RandomMatrix(8000,outputoptions=[datatype=float[8]]);
M :=

    [85. , -81. , -28. , 71. , -28. , -9. , 11. , 86. , 6. , 65. , ...]

    [25. , -65. , -90. , -21. , -76. , -41. , 54. , -57. , 59. , 63. , ...]

    [26. , 19. , -46. , -70. , -27. , 87. , -3. , -10. , -40. , -5. , ...]

    [-39. , -15. , 15. , 89. , -46. , 59. , 89. , 82. , -8. , 78. , ...]

    [88. , 80. , -99. , 95. , 87. , 89. , -92. , -69. , 3. , 42. , ...]

    [-25. , 48. , 87. , -28. , -61. , -97. , 53. , 68. , -71. , 17. , ...]

    [14. , 82. , 59. , -77. , 57. , -94. , 71. , 58. , -20. , -5. , ...]

    [-65. , -74. , -72. , 64. , -37. , 69. , -56. , -34. , -96. , -30. , ...]

    [-28. , -30. , -18. , -75. , 93. , 91. , -90. , 28. , -81. , 57. , ...]

    [1. , 10. , -79. , -66. , 56. , 69. , -71. , -9. , -61. , 70. , ...]

    [: , : , : , : , : , : , : , : , : , : , "8000 x 8000 Matrix"]

> N:=RandomMatrix(8000,outputoptions=[datatype=float[8]]);
N :=

    [-10. , -65. , 36. , -25. , -51. , 11. , 40. , 66. , -28. , -85. , ...]

    [94. , 27. , 86. , -87. , 17. , -66. , 66. , -51. , -83. , 14. , ...]

    [-41. , -64. , 28. , 79. , 44. , 70. , -45. , 19. , -45. , 10. , ...]

    [-22. , 78. , -64. , 22. , 50. , 40. , 48. , 33. , 74. , 33. , ...]

    [14. , 91. , -66. , -72. , 50. , -38. , -27. , -53. , 89. , -77. , ...]

    [-75. , 17. , 12. , 48. , -92. , -31. , 16. , -20. , -91. , 39. , ...]

    [16. , 55. , -2. , 65. , 74. , -52. , -73. , 3. , -46. , 19. , ...]

    [-35. , 32. , 89. , 58. , 27. , 69. , 57. , -76. , 93. , -71. , ...]

    [-17. , -49. , 47. , 39. , 25. , 19. , 45. , 99. , -74. , 7. , ...]

    [-42. , -36. , -98. , -35. , 99. , -13. , -49. , 5. , -96. , -45. , ...]

    [: , : , : , : , : , : , : , : , : , : , "8000 x 8000 Matrix"]


> time[real](MatrixMatrixMultiply(M,N));
                                     1.293


> CUDA:-Enable(false);
                                     true

> CUDA:-IsEnabled();
                                     false


> time[real](MatrixMatrixMultiply(M,N));
                                     0.733


> CUDA:-Properties();
[table(["Texture Alignment" = 512, "Clock Rate" = 1440000,

    "Total Constant Memory" = 65536, "Device Overlap" = 1, "ID" = 0,

    "Total Global Memory" = 4294967295,

    "Max Threads Dimensions" = [1024, 1024, 64],

    "Shared Memory Per Block" = 49152, "Name" = "NVIDIA A30", "Warp Size" = 32

    , "Kernel Exec Timeout Enabled" = true, "Resisters Per Block" = 65536,

    "Memory Pitch" = 2147483647, "MultiProcessor Count" = 56, "Minor" = 0,

    "Max Grid Size" = [2147483647, 65535, 65535],

    "Max Threads Per Block" = 1024,

    "Major" = 8

    ])]

> quit
memory used=5392.2MB, alloc=4438.9MB, time=343.16

CUDA

Технология вычислений на графических процессорах Nvidia

Введение

Аппаратное обеспечение

Программное обеспечение

Производительность

Примеры для Maple

Источники информации