Difference between revisions of "Performance"
From Gw-qcd-wiki
Line 13: | Line 13: | ||
|- | |- | ||
|rowspan="2"|Dslash_cuda | |rowspan="2"|Dslash_cuda | ||
− | |Dslash (24^4) | + | |Dslash (24^4) |
+ | |73 GB/s | ||
+ | |32 GFLOP/s | ||
|- | |- | ||
− | |hopping (24^4) | + | |hopping (24^4) |
+ | |74 GB/s | ||
+ | |34 GFLOP/s | ||
|- | |- | ||
|rowspan="3"|Dslash_multi_gpu (double) | |rowspan="3"|Dslash_multi_gpu (double) | ||
− | |1 node, 24^4 Dslash | + | |1 node, 24^4 Dslash |
+ | |79 GB/s | ||
+ | |35 GFLOP/s | ||
|- | |- | ||
− | |2 nodes, 24^4 Dslash | + | |2 nodes, 24^4 Dslash |
+ | |145 GB/s | ||
+ | |64 GFLOP/s | ||
|- | |- | ||
− | |4 nodes, 24^4 Dslash | + | |4 nodes, 24^4 Dslash |
+ | |256 GB/s | ||
+ | |114 GFLOP/s | ||
|- | |- | ||
|rowspan="3"|Dslash_multi_gpu (double) | |rowspan="3"|Dslash_multi_gpu (double) | ||
− | |1 node, 24^4 Dslash | + | |1 node, 24^4 Dslash |
+ | |79 GB/s | ||
+ | |76 GFLOP/s | ||
|- | |- | ||
− | |2 nodes, 24^4 Dslash | + | |2 nodes, 24^4 Dslash |
+ | |156 GB/s | ||
+ | |140 GFLOP/s | ||
|- | |- | ||
− | |4 nodes, 24^4 Dslash | + | |4 nodes, 24^4 Dslash |
+ | |283 GB/s | ||
+ | |252 GFLOP/s | ||
|- | |- | ||
− | |Vector addition | + | |Vector addition |
+ | |82 GB/s | ||
+ | |3.4 GFLOP/s | ||
|- | |- | ||
− | |Vector dot product | + | |Vector dot product |
+ | |88 GB/s | ||
+ | |N/A | ||
|- | |- | ||
− | |Vector copy | + | |Vector copy |
+ | |84 GB/s | ||
+ | |N/A | ||
|- | |- | ||
|} | |} |
Revision as of 13:35, 14 July 2010
Carver Tester: Ben Gamari Test date: 14 Jul 2010 Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 Hardware: CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 73 GB/s | 32 GFLOP/s |
hopping (24^4) | 74 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 35 GFLOP/s |
2 nodes, 24^4 Dslash | 145 GB/s | 64 GFLOP/s | |
4 nodes, 24^4 Dslash | 256 GB/s | 114 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 76 GFLOP/s |
2 nodes, 24^4 Dslash | 156 GB/s | 140 GFLOP/s | |
4 nodes, 24^4 Dslash | 283 GB/s | 252 GFLOP/s | |
Vector addition | 82 GB/s | 3.4 GFLOP/s | |
Vector dot product | 88 GB/s | N/A | |
Vector copy | 84 GB/s | N/A |