Difference between revisions of "Performance"
From Gw-qcd-wiki
Line 1: | Line 1: | ||
+ | '''Ninja (K20)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 19 Feb 2013 | ||
+ | * Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877 | ||
+ | * Hardware: K20 | ||
+ | * CUDA version 5.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |170.6 GB/s | ||
+ | |78.2 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |170.8 GB/s | ||
+ | |76.1 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |163.5 GB/s | ||
+ | |145.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |162.78 GB/s | ||
+ | |6.78 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |108.89 GB/s | ||
+ | |27.22 | ||
+ | |- | ||
+ | |Copy | ||
+ | |125.27 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
'''Lehman (Fermi)''' | '''Lehman (Fermi)''' | ||
Revision as of 23:49, 19 February 2013
Ninja (K20)
- Tester: Andrei Alexandru
- Test date: 19 Feb 2013
- Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
- Hardware: K20
- CUDA version 5.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | NA GB/s | NA GFLOP/s |
hopping (24^4) | 170.6 GB/s | 78.2 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 170.8 GB/s | 76.1 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 163.5 GB/s | 145.6 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 162.78 GB/s | 6.78 GFLOP/s |
Dot product | 108.89 GB/s | 27.22 | |
Copy | 125.27 GB/s | N/A |
Lehman (Fermi)
- Tester: Andrei Alexandru
- Test date: 20 Aug 2011
- Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
- Hardware: gtx580
- CUDA version 4.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 100 GB/s | 45 GFLOP/s |
hopping (24^4) | 114 GB/s | 53 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 128 GB/s | 57 GFLOP/s |
2 nodes, 24^4 Dslash | 240 GB/s | 107 GFLOP/s | |
4 nodes, 24^4 Dslash | 391 GB/s | 174 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 165 GB/s | 147 GFLOP/s |
2 nodes, 24^4 Dslash | 302 GB/s | 269 GFLOP/s | |
4 nodes, 24^4 Dslash | 402 GB/s | 356 GFLOP/s | |
Vector utilities | Addition | 159 GB/s | 6.6 GFLOP/s |
Dot product | 129 GB/s | N/A | |
Copy | 155 GB/s | N/A |
Carver (Fermi with ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC on)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 73 GB/s | 32 GFLOP/s |
hopping (24^4) | 74 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 35 GFLOP/s |
2 nodes, 24^4 Dslash | 145 GB/s | 64 GFLOP/s | |
4 nodes, 24^4 Dslash | 256 GB/s | 114 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 79 GB/s | 76 GFLOP/s |
2 nodes, 24^4 Dslash | 156 GB/s | 140 GFLOP/s | |
4 nodes, 24^4 Dslash | 283 GB/s | 252 GFLOP/s | |
Vector utilities | Addition | 82 GB/s | 3.4 GFLOP/s |
Dot product | 88 GB/s | N/A | |
Copy | 84 GB/s | N/A |
Carver (Fermi without ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC off)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 85 GB/s | 38 GFLOP/s |
hopping (24^4) | 86 GB/s | 39 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 96 GB/s | 43 GFLOP/s |
2 nodes, 24^4 Dslash | 179 GB/s | 82 GFLOP/s | |
4 nodes, 24^4 Dslash | 304 GB/s | 139 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 114 GB/s | 101 GFLOP/s |
2 nodes, 24^4 Dslash | 210 GB/s | 187 GFLOP/s | |
4 nodes, 24^4 Dslash | 279 GB/s | 256 GFLOP/s | |
Vector utilities | Addition | 110 GB/s | 4.6 GFLOP/s |
Dot product | 119 GB/s | N/A | |
Copy | 114 GB/s | N/A |
Carver (Tesla)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C1060
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 69 GB/s | 31 GFLOP/s |
hopping (24^4) | 68 GB/s | 31 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 71 GB/s | 32 GFLOP/s |
2 nodes, 24^4 Dslash | 134 GB/s | 60 GFLOP/s | |
4 nodes, 24^4 Dslash | 240 GB/s | 107 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 65 GB/s | 58 GFLOP/s |
2 nodes, 24^4 Dslash | 122 GB/s | 112 GFLOP/s | |
4 nodes, 24^4 Dslash | 224 GB/s | 199 GFLOP/s | |
Vector utilities | Addition | 83 GB/s | 3.5 GFLOP/s |
Dot product | 81 GB/s | N/A | |
Copy | 83 GB/s | N/A |
jlab (GF100)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: GeForce GTX 480
- CUDA version 3.1
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 102 GB/s | 46 GFLOP/s |
hopping (24^4) | 105 GB/s | 48 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 113 GB/s | 51 GFLOP/s |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 140 GB/s | 125 GFLOP/s |
Vector utilities | Addition | 136 GB/s | 5.65 GFLOP/s |
Dot product | 136 GB/s | N/A | |
Copy | 139 GB/s | N/A |
Samurai (GT200)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
- Hardware: GeForce GTX 280
- CUDA version 3.1
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 76 GB/s | 34 GFLOP/s |
hopping (24^4) | 75 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 77 GB/s | 34 GFLOP/s |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 91 GB/s | 81 GFLOP/s |
Vector utilities | Addition | 113 GB/s | 4.7 GFLOP/s |
Dot product | 77 GB/s | N/A | |
Copy | 121 GB/s | N/A |