Difference between revisions of "Performance"

From Gw-qcd-wiki
Jump to: navigation, search
Line 1: Line 1:
'''Carver'''
+
'''Carver (Fermi with ECC)'''
  
 
* Tester: Ben Gamari
 
* Tester: Ben Gamari
Line 59: Line 59:
 
  |Copy     
 
  |Copy     
 
  |84 GB/s  
 
  |84 GB/s  
 +
|N/A
 +
|-
 +
|}
 +
 +
 +
'''Carver (Fermi without ECC)'''
 +
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
 +
* Hardware: Tesla C2050 (ECC off)
 +
* CUDA version 3.0
 +
 +
{|
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|85 GB/s 
 +
|38 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|86 GB/s
 +
|39 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|96 GB/s 
 +
|43 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|179 GB/s
 +
|82 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|304 GB/s
 +
|139 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|114 GB/s 
 +
|101 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|210 GB/s
 +
|187 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|279 GB/s
 +
|256 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|110 GB/s 
 +
|4.6 GFLOP/s
 +
|-
 +
|Dot product   
 +
|119 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|114 GB/s
 
  |N/A
 
  |N/A
 
  |-
 
  |-
 
  |}
 
  |}

Revision as of 13:44, 14 July 2010

Carver (Fermi with ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC on)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 73 GB/s 32 GFLOP/s
hopping (24^4) 74 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 35 GFLOP/s
2 nodes, 24^4 Dslash 145 GB/s 64 GFLOP/s
4 nodes, 24^4 Dslash 256 GB/s 114 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 76 GFLOP/s
2 nodes, 24^4 Dslash 156 GB/s 140 GFLOP/s
4 nodes, 24^4 Dslash 283 GB/s 252 GFLOP/s
Vector utilities Addition 82 GB/s 3.4 GFLOP/s
Dot product 88 GB/s N/A
Copy 84 GB/s N/A


Carver (Fermi without ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC off)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 85 GB/s 38 GFLOP/s
hopping (24^4) 86 GB/s 39 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 96 GB/s 43 GFLOP/s
2 nodes, 24^4 Dslash 179 GB/s 82 GFLOP/s
4 nodes, 24^4 Dslash 304 GB/s 139 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 114 GB/s 101 GFLOP/s
2 nodes, 24^4 Dslash 210 GB/s 187 GFLOP/s
4 nodes, 24^4 Dslash 279 GB/s 256 GFLOP/s
Vector utilities Addition 110 GB/s 4.6 GFLOP/s
Dot product 119 GB/s N/A
Copy 114 GB/s N/A