Difference between revisions of "Performance"

From Gw-qcd-wiki
Jump to: navigation, search
Line 13: Line 13:
 
  |-
 
  |-
 
  |rowspan="2"|Dslash_cuda
 
  |rowspan="2"|Dslash_cuda
  |Dslash (24^4)           |73 GB/s   |32 GFLOP/s
+
  |Dslash (24^4)        
 +
|73 GB/s  
 +
|32 GFLOP/s
 
  |-
 
  |-
  |hopping (24^4)         |74 GB/s   |34 GFLOP/s
+
  |hopping (24^4)        
 +
|74 GB/s  
 +
|34 GFLOP/s
 
  |-
 
  |-
 
  |rowspan="3"|Dslash_multi_gpu (double)
 
  |rowspan="3"|Dslash_multi_gpu (double)
  |1 node, 24^4 Dslash     |79 GB/s   |35 GFLOP/s
+
  |1 node, 24^4 Dslash
 +
|79 GB/s
 +
|35 GFLOP/s
 
  |-
 
  |-
  |2 nodes, 24^4 Dslash   |145 GB/s   |64 GFLOP/s
+
  |2 nodes, 24^4 Dslash  
 +
|145 GB/s  
 +
|64 GFLOP/s
 
  |-
 
  |-
  |4 nodes, 24^4 Dslash   |256 GB/s   |114 GFLOP/s
+
  |4 nodes, 24^4 Dslash
 +
|256 GB/s  
 +
|114 GFLOP/s
 
  |-
 
  |-
 
  |rowspan="3"|Dslash_multi_gpu (double)
 
  |rowspan="3"|Dslash_multi_gpu (double)
  |1 node, 24^4 Dslash     |79 GB/s   |76 GFLOP/s
+
  |1 node, 24^4 Dslash
 +
|79 GB/s
 +
|76 GFLOP/s
 
  |-
 
  |-
  |2 nodes, 24^4 Dslash   |156 GB/s   |140 GFLOP/s
+
  |2 nodes, 24^4 Dslash  
 +
|156 GB/s  
 +
|140 GFLOP/s
 
  |-
 
  |-
  |4 nodes, 24^4 Dslash   |283 GB/s   |252 GFLOP/s
+
  |4 nodes, 24^4 Dslash
 +
|283 GB/s  
 +
|252 GFLOP/s
 
  |-
 
  |-
  |Vector addition         |82 GB/s   |3.4 GFLOP/s
+
  |Vector addition    
 +
|82 GB/s
 +
|3.4 GFLOP/s
 
  |-
 
  |-
  |Vector dot product     |88 GB/s   |N/A
+
  |Vector dot product  
 +
|88 GB/s
 +
|N/A
 
  |-
 
  |-
  |Vector copy             |84 GB/s   |N/A
+
  |Vector copy    
 +
|84 GB/s  
 +
|N/A
 
  |-
 
  |-
 
  |}
 
  |}

Revision as of 13:35, 14 July 2010

Carver Tester: Ben Gamari Test date: 14 Jul 2010 Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 Hardware: CUDA version 3.0

Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 73 GB/s 32 GFLOP/s
hopping (24^4) 74 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 35 GFLOP/s
2 nodes, 24^4 Dslash 145 GB/s 64 GFLOP/s
4 nodes, 24^4 Dslash 256 GB/s 114 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 76 GFLOP/s
2 nodes, 24^4 Dslash 156 GB/s 140 GFLOP/s
4 nodes, 24^4 Dslash 283 GB/s 252 GFLOP/s
Vector addition 82 GB/s 3.4 GFLOP/s
Vector dot product 88 GB/s N/A
Vector copy 84 GB/s N/A