Difference between revisions of "Performance"
From Gw-qcd-wiki
Line 1: | Line 1: | ||
− | '''Carver''' | + | '''Carver (Fermi with ECC)''' |
* Tester: Ben Gamari | * Tester: Ben Gamari | ||
Line 59: | Line 59: | ||
|Copy | |Copy | ||
|84 GB/s | |84 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Carver (Fermi without ECC)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 | ||
+ | * Hardware: Tesla C2050 (ECC off) | ||
+ | * CUDA version 3.0 | ||
+ | |||
+ | {| | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |85 GB/s | ||
+ | |38 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |86 GB/s | ||
+ | |39 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |96 GB/s | ||
+ | |43 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |179 GB/s | ||
+ | |82 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |304 GB/s | ||
+ | |139 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |114 GB/s | ||
+ | |101 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |210 GB/s | ||
+ | |187 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |279 GB/s | ||
+ | |256 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |110 GB/s | ||
+ | |4.6 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |119 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |114 GB/s | ||
|N/A | |N/A | ||
|- | |- | ||
|} | |} |
Revision as of 13:44, 14 July 2010
Carver (Fermi with ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC on)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 73 GB/s | 32 GFLOP/s |
hopping (24^4) | 74 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 35 GFLOP/s |
2 nodes, 24^4 Dslash | 145 GB/s | 64 GFLOP/s | |
4 nodes, 24^4 Dslash | 256 GB/s | 114 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 76 GFLOP/s |
2 nodes, 24^4 Dslash | 156 GB/s | 140 GFLOP/s | |
4 nodes, 24^4 Dslash | 283 GB/s | 252 GFLOP/s | |
Vector utilities | Addition | 82 GB/s | 3.4 GFLOP/s |
Dot product | 88 GB/s | N/A | |
Copy | 84 GB/s | N/A |
Carver (Fermi without ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC off)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 85 GB/s | 38 GFLOP/s |
hopping (24^4) | 86 GB/s | 39 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 96 GB/s | 43 GFLOP/s |
2 nodes, 24^4 Dslash | 179 GB/s | 82 GFLOP/s | |
4 nodes, 24^4 Dslash | 304 GB/s | 139 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 114 GB/s | 101 GFLOP/s |
2 nodes, 24^4 Dslash | 210 GB/s | 187 GFLOP/s | |
4 nodes, 24^4 Dslash | 279 GB/s | 256 GFLOP/s | |
Vector utilities | Addition | 110 GB/s | 4.6 GFLOP/s |
Dot product | 119 GB/s | N/A | |
Copy | 114 GB/s | N/A |