Difference between revisions of "Performance"
From Gw-qcd-wiki
(27 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | ''' | + | '''Shogun (Tesla C2075)''' |
− | |||
− | |||
− | |||
− | |||
− | |||
− | {| | + | * Tester: Andrei Alexandru |
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 5dcae13af9abada6460a0061e5575af9c101f43a | ||
+ | * Hardware: Tesla C2075 | ||
+ | * CUDA version 4.1 | ||
+ | |||
+ | {| class="wikitable" | ||
!Kernel | !Kernel | ||
!Configuration | !Configuration | ||
Line 13: | Line 14: | ||
|- | |- | ||
|rowspan="2"|Dslash_cuda | |rowspan="2"|Dslash_cuda | ||
− | |Dslash (24^4) | + | |Dslash (24^4) |
+ | |42.2 GB/s | ||
+ | |18.8 GFLOP/s | ||
|- | |- | ||
− | |hopping (24^4) | + | |hopping (24^4) |
+ | |40.5 GB/s | ||
+ | |18.6 GFLOP/s | ||
|- | |- | ||
|rowspan="3"|Dslash_multi_gpu (double) | |rowspan="3"|Dslash_multi_gpu (double) | ||
− | |1 node, 24^4 Dslash | + | |1 node, 24^4 Dslash |
+ | |104.0 GB/s | ||
+ | |46.3 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |196.6 GB/s | ||
+ | |87.6 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |120.8 GB/s | ||
+ | |107.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |221.7 GB/s | ||
+ | |197.4 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |119.11 GB/s | ||
+ | |4.96 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |120.86 GB/s | ||
+ | |30.22 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |114.87 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Shogun (GTX Titan Black)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 5dcae13af9abada6460a0061e5575af9c101f43a | ||
+ | * Hardware: GeForce GTX TITAN Black | ||
+ | * CUDA version 7.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
|- | |- | ||
− | |2 | + | |rowspan="2"|Dslash_cuda |
+ | |Dslash (24^4) | ||
+ | |192.1 GB/s | ||
+ | |85.6 GFLOP/s | ||
|- | |- | ||
− | | | + | |hopping (24^4) |
+ | |183.7 GB/s | ||
+ | |84.2 GFLOP/s | ||
|- | |- | ||
|rowspan="3"|Dslash_multi_gpu (double) | |rowspan="3"|Dslash_multi_gpu (double) | ||
− | |1 node, 24^4 Dslash |79 GB/s |76 GFLOP/s | + | |1 node, 24^4 Dslash |
+ | |207.8 GB/s | ||
+ | |92.5 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |231.7 GB/s | ||
+ | |206.3 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |229.50 GB/s | ||
+ | |9.56 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |218.87 GB/s | ||
+ | |54.72 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |222.30 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Shogun (GTX Titan X)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 5dcae13af9abada6460a0061e5575af9c101f43a | ||
+ | * Hardware: GeForce GTX TITAN X | ||
+ | * CUDA version 7.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |177.8 GB/s | ||
+ | |79.2 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |190.2 GB/s | ||
+ | |87.2 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |248.4 GB/s | ||
+ | |110.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |274.5 GB/s | ||
+ | |244.5 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |246.14 GB/s | ||
+ | |10.26 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |246.52 GB/s | ||
+ | |61.63 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |234.50 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Samurai (GTX 680)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 62bada549c777b5a89058299df74a1239cb492cd | ||
+ | * Hardware: GeForce GTX 680 | ||
+ | * CUDA version 4.2 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |52.2 GB/s | ||
+ | |23.2 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |51.8 GB/s | ||
+ | |23.7 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |91.8 GB/s | ||
+ | |40.9 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |124.1 GB/s | ||
+ | |110.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |138.73 GB/s | ||
+ | |5.78 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |146.64 GB/s | ||
+ | |36.66 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |139.73 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''GWU QCD cluster (GTX Titan)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 5dcae13af9abada6460a0061e5575af9c101f43a | ||
+ | * Hardware: GeForce GTX TITAN | ||
+ | * CUDA version 5.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |173.0 GB/s | ||
+ | |77.0 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |165.4 GB/s | ||
+ | |75.8 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |199.4 GB/s | ||
+ | |88.8 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |389.3 GB/s | ||
+ | |173.4 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |309.4 GB/s | ||
+ | |137.8 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |227.9 GB/s | ||
+ | |202.9 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |409.2 GB/s | ||
+ | |364.4 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |319.7 GB/s | ||
+ | |284.7 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |231.19 GB/s | ||
+ | |9.63 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |225.21 GB/s | ||
+ | |56.30 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |232.76 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | |||
+ | '''Ninja (K40c ECC off)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 2 May 2015 | ||
+ | * Commit: 5dcae13af9abada6460a0061e5575af9c101f43a | ||
+ | * Hardware: K40c | ||
+ | * CUDA version 5.5 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |192.3 GB/s | ||
+ | |85.6 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |192.4 GB/s | ||
+ | |88.2 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |211.7 GB/s | ||
+ | |94.3 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |201.6 GB/s | ||
+ | |179.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |210.44 GB/s | ||
+ | |8.77 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |197.71 GB/s | ||
+ | |49.43 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |207.63 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Ninja (K20 ECC off)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 19 Feb 2013 | ||
+ | * Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877 | ||
+ | * Hardware: K20 | ||
+ | * CUDA version 5.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |160.9 GB/s | ||
+ | |71.7 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |156.7 GB/s | ||
+ | |71.8 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |170.8 GB/s | ||
+ | |76.1 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |163.5 GB/s | ||
+ | |145.6 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |NA GB/s | ||
+ | |NA GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |162.78 GB/s | ||
+ | |6.78 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |108.89 GB/s | ||
+ | |27.22 GFLOP/s | ||
+ | |- | ||
+ | |Copy | ||
+ | |125.27 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Lehman (Fermi)''' | ||
+ | |||
+ | * Tester: Andrei Alexandru | ||
+ | * Test date: 20 Aug 2011 | ||
+ | * Commit: b54a3437eeeebf773271d7a5424b41949f9283ad | ||
+ | * Hardware: gtx580 | ||
+ | * CUDA version 4.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |100 GB/s | ||
+ | |45 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |114 GB/s | ||
+ | |53 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |128 GB/s | ||
+ | |57 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |240 GB/s | ||
+ | |107 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |391 GB/s | ||
+ | |174 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |165 GB/s | ||
+ | |147 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |302 GB/s | ||
+ | |269 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |402 GB/s | ||
+ | |356 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |159 GB/s | ||
+ | |6.6 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |129 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |155 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Carver (Fermi with ECC)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 | ||
+ | * Hardware: Tesla C2050 (ECC on) | ||
+ | * CUDA version 3.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |73 GB/s | ||
+ | |32 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |74 GB/s | ||
+ | |34 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |79 GB/s | ||
+ | |35 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |145 GB/s | ||
+ | |64 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |256 GB/s | ||
+ | |114 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |79 GB/s | ||
+ | |76 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |156 GB/s | ||
+ | |140 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |283 GB/s | ||
+ | |252 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |82 GB/s | ||
+ | |3.4 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |88 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |84 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Carver (Fermi without ECC)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 | ||
+ | * Hardware: Tesla C2050 (ECC off) | ||
+ | * CUDA version 3.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |85 GB/s | ||
+ | |38 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |86 GB/s | ||
+ | |39 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |96 GB/s | ||
+ | |43 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |179 GB/s | ||
+ | |82 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |304 GB/s | ||
+ | |139 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |114 GB/s | ||
+ | |101 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |210 GB/s | ||
+ | |187 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |279 GB/s | ||
+ | |256 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |110 GB/s | ||
+ | |4.6 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |119 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |114 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Carver (Tesla)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 | ||
+ | * Hardware: Tesla C1060 | ||
+ | * CUDA version 3.0 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |69 GB/s | ||
+ | |31 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |68 GB/s | ||
+ | |31 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |71 GB/s | ||
+ | |32 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |134 GB/s | ||
+ | |60 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |240 GB/s | ||
+ | |107 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |65 GB/s | ||
+ | |58 GFLOP/s | ||
+ | |- | ||
+ | |2 nodes, 24^4 Dslash | ||
+ | |122 GB/s | ||
+ | |112 GFLOP/s | ||
+ | |- | ||
+ | |4 nodes, 24^4 Dslash | ||
+ | |224 GB/s | ||
+ | |199 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |83 GB/s | ||
+ | |3.5 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |81 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |83 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''jlab (GF100)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185 | ||
+ | * Hardware: GeForce GTX 480 | ||
+ | * CUDA version 3.1 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |102 GB/s | ||
+ | |46 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |105 GB/s | ||
+ | |48 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="1"|Dslash_multi_gpu (double) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |113 GB/s | ||
+ | |51 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="1"|Dslash_multi_gpu (single) | ||
+ | |1 node, 24^4 Dslash | ||
+ | |140 GB/s | ||
+ | |125 GFLOP/s | ||
+ | |- | ||
+ | |rowspan="3"|Vector utilities | ||
+ | |Addition | ||
+ | |136 GB/s | ||
+ | |5.65 GFLOP/s | ||
+ | |- | ||
+ | |Dot product | ||
+ | |136 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |Copy | ||
+ | |139 GB/s | ||
+ | |N/A | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | '''Samurai (GT200)''' | ||
+ | |||
+ | * Tester: Ben Gamari | ||
+ | * Test date: 14 Jul 2010 | ||
+ | * Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e | ||
+ | * Hardware: GeForce GTX 280 | ||
+ | * CUDA version 3.1 | ||
+ | |||
+ | {| class="wikitable" | ||
+ | !Kernel | ||
+ | !Configuration | ||
+ | !Bandwidth | ||
+ | !FLOPs | ||
+ | |- | ||
+ | |rowspan="2"|Dslash_cuda | ||
+ | |Dslash (24^4) | ||
+ | |76 GB/s | ||
+ | |34 GFLOP/s | ||
+ | |- | ||
+ | |hopping (24^4) | ||
+ | |75 GB/s | ||
+ | |34 GFLOP/s | ||
|- | |- | ||
− | | | + | |rowspan="1"|Dslash_multi_gpu (double) |
+ | |1 node, 24^4 Dslash | ||
+ | |77 GB/s | ||
+ | |34 GFLOP/s | ||
|- | |- | ||
− | | | + | |rowspan="1"|Dslash_multi_gpu (single) |
+ | |1 node, 24^4 Dslash | ||
+ | |91 GB/s | ||
+ | |81 GFLOP/s | ||
|- | |- | ||
− | |Vector | + | |rowspan="3"|Vector utilities |
+ | |Addition | ||
+ | |113 GB/s | ||
+ | |4.7 GFLOP/s | ||
|- | |- | ||
− | | | + | |Dot product |
+ | |77 GB/s | ||
+ | |N/A | ||
|- | |- | ||
− | | | + | |Copy |
+ | |121 GB/s | ||
+ | |N/A | ||
|- | |- | ||
|} | |} |
Latest revision as of 21:06, 2 May 2015
Shogun (Tesla C2075)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
- Hardware: Tesla C2075
- CUDA version 4.1
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 42.2 GB/s | 18.8 GFLOP/s |
hopping (24^4) | 40.5 GB/s | 18.6 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 104.0 GB/s | 46.3 GFLOP/s |
2 nodes, 24^4 Dslash | 196.6 GB/s | 87.6 GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 120.8 GB/s | 107.6 GFLOP/s |
2 nodes, 24^4 Dslash | 221.7 GB/s | 197.4 GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 119.11 GB/s | 4.96 GFLOP/s |
Dot product | 120.86 GB/s | 30.22 GFLOP/s | |
Copy | 114.87 GB/s | N/A |
Shogun (GTX Titan Black)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
- Hardware: GeForce GTX TITAN Black
- CUDA version 7.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 192.1 GB/s | 85.6 GFLOP/s |
hopping (24^4) | 183.7 GB/s | 84.2 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 207.8 GB/s | 92.5 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 231.7 GB/s | 206.3 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 229.50 GB/s | 9.56 GFLOP/s |
Dot product | 218.87 GB/s | 54.72 GFLOP/s | |
Copy | 222.30 GB/s | N/A |
Shogun (GTX Titan X)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
- Hardware: GeForce GTX TITAN X
- CUDA version 7.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 177.8 GB/s | 79.2 GFLOP/s |
hopping (24^4) | 190.2 GB/s | 87.2 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 248.4 GB/s | 110.6 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 274.5 GB/s | 244.5 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 246.14 GB/s | 10.26 GFLOP/s |
Dot product | 246.52 GB/s | 61.63 GFLOP/s | |
Copy | 234.50 GB/s | N/A |
Samurai (GTX 680)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 62bada549c777b5a89058299df74a1239cb492cd
- Hardware: GeForce GTX 680
- CUDA version 4.2
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 52.2 GB/s | 23.2 GFLOP/s |
hopping (24^4) | 51.8 GB/s | 23.7 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 91.8 GB/s | 40.9 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 124.1 GB/s | 110.6 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 138.73 GB/s | 5.78 GFLOP/s |
Dot product | 146.64 GB/s | 36.66 GFLOP/s | |
Copy | 139.73 GB/s | N/A |
GWU QCD cluster (GTX Titan)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
- Hardware: GeForce GTX TITAN
- CUDA version 5.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 173.0 GB/s | 77.0 GFLOP/s |
hopping (24^4) | 165.4 GB/s | 75.8 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 199.4 GB/s | 88.8 GFLOP/s |
2 nodes, 24^4 Dslash | 389.3 GB/s | 173.4 GFLOP/s | |
4 nodes, 24^4 Dslash | 309.4 GB/s | 137.8 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 227.9 GB/s | 202.9 GFLOP/s |
2 nodes, 24^4 Dslash | 409.2 GB/s | 364.4 GFLOP/s | |
4 nodes, 24^4 Dslash | 319.7 GB/s | 284.7 GFLOP/s | |
Vector utilities | Addition | 231.19 GB/s | 9.63 GFLOP/s |
Dot product | 225.21 GB/s | 56.30 GFLOP/s | |
Copy | 232.76 GB/s | N/A |
Ninja (K40c ECC off)
- Tester: Andrei Alexandru
- Test date: 2 May 2015
- Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
- Hardware: K40c
- CUDA version 5.5
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 192.3 GB/s | 85.6 GFLOP/s |
hopping (24^4) | 192.4 GB/s | 88.2 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 211.7 GB/s | 94.3 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 201.6 GB/s | 179.6 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 210.44 GB/s | 8.77 GFLOP/s |
Dot product | 197.71 GB/s | 49.43 GFLOP/s | |
Copy | 207.63 GB/s | N/A |
Ninja (K20 ECC off)
- Tester: Andrei Alexandru
- Test date: 19 Feb 2013
- Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
- Hardware: K20
- CUDA version 5.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 160.9 GB/s | 71.7 GFLOP/s |
hopping (24^4) | 156.7 GB/s | 71.8 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 170.8 GB/s | 76.1 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 163.5 GB/s | 145.6 GFLOP/s |
2 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
4 nodes, 24^4 Dslash | NA GB/s | NA GFLOP/s | |
Vector utilities | Addition | 162.78 GB/s | 6.78 GFLOP/s |
Dot product | 108.89 GB/s | 27.22 GFLOP/s | |
Copy | 125.27 GB/s | N/A |
Lehman (Fermi)
- Tester: Andrei Alexandru
- Test date: 20 Aug 2011
- Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
- Hardware: gtx580
- CUDA version 4.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 100 GB/s | 45 GFLOP/s |
hopping (24^4) | 114 GB/s | 53 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 128 GB/s | 57 GFLOP/s |
2 nodes, 24^4 Dslash | 240 GB/s | 107 GFLOP/s | |
4 nodes, 24^4 Dslash | 391 GB/s | 174 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 165 GB/s | 147 GFLOP/s |
2 nodes, 24^4 Dslash | 302 GB/s | 269 GFLOP/s | |
4 nodes, 24^4 Dslash | 402 GB/s | 356 GFLOP/s | |
Vector utilities | Addition | 159 GB/s | 6.6 GFLOP/s |
Dot product | 129 GB/s | N/A | |
Copy | 155 GB/s | N/A |
Carver (Fermi with ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC on)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 73 GB/s | 32 GFLOP/s |
hopping (24^4) | 74 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 79 GB/s | 35 GFLOP/s |
2 nodes, 24^4 Dslash | 145 GB/s | 64 GFLOP/s | |
4 nodes, 24^4 Dslash | 256 GB/s | 114 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 79 GB/s | 76 GFLOP/s |
2 nodes, 24^4 Dslash | 156 GB/s | 140 GFLOP/s | |
4 nodes, 24^4 Dslash | 283 GB/s | 252 GFLOP/s | |
Vector utilities | Addition | 82 GB/s | 3.4 GFLOP/s |
Dot product | 88 GB/s | N/A | |
Copy | 84 GB/s | N/A |
Carver (Fermi without ECC)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C2050 (ECC off)
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 85 GB/s | 38 GFLOP/s |
hopping (24^4) | 86 GB/s | 39 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 96 GB/s | 43 GFLOP/s |
2 nodes, 24^4 Dslash | 179 GB/s | 82 GFLOP/s | |
4 nodes, 24^4 Dslash | 304 GB/s | 139 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 114 GB/s | 101 GFLOP/s |
2 nodes, 24^4 Dslash | 210 GB/s | 187 GFLOP/s | |
4 nodes, 24^4 Dslash | 279 GB/s | 256 GFLOP/s | |
Vector utilities | Addition | 110 GB/s | 4.6 GFLOP/s |
Dot product | 119 GB/s | N/A | |
Copy | 114 GB/s | N/A |
Carver (Tesla)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: Tesla C1060
- CUDA version 3.0
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 69 GB/s | 31 GFLOP/s |
hopping (24^4) | 68 GB/s | 31 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 71 GB/s | 32 GFLOP/s |
2 nodes, 24^4 Dslash | 134 GB/s | 60 GFLOP/s | |
4 nodes, 24^4 Dslash | 240 GB/s | 107 GFLOP/s | |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 65 GB/s | 58 GFLOP/s |
2 nodes, 24^4 Dslash | 122 GB/s | 112 GFLOP/s | |
4 nodes, 24^4 Dslash | 224 GB/s | 199 GFLOP/s | |
Vector utilities | Addition | 83 GB/s | 3.5 GFLOP/s |
Dot product | 81 GB/s | N/A | |
Copy | 83 GB/s | N/A |
jlab (GF100)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
- Hardware: GeForce GTX 480
- CUDA version 3.1
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 102 GB/s | 46 GFLOP/s |
hopping (24^4) | 105 GB/s | 48 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 113 GB/s | 51 GFLOP/s |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 140 GB/s | 125 GFLOP/s |
Vector utilities | Addition | 136 GB/s | 5.65 GFLOP/s |
Dot product | 136 GB/s | N/A | |
Copy | 139 GB/s | N/A |
Samurai (GT200)
- Tester: Ben Gamari
- Test date: 14 Jul 2010
- Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
- Hardware: GeForce GTX 280
- CUDA version 3.1
Kernel | Configuration | Bandwidth | FLOPs |
---|---|---|---|
Dslash_cuda | Dslash (24^4) | 76 GB/s | 34 GFLOP/s |
hopping (24^4) | 75 GB/s | 34 GFLOP/s | |
Dslash_multi_gpu (double) | 1 node, 24^4 Dslash | 77 GB/s | 34 GFLOP/s |
Dslash_multi_gpu (single) | 1 node, 24^4 Dslash | 91 GB/s | 81 GFLOP/s |
Vector utilities | Addition | 113 GB/s | 4.7 GFLOP/s |
Dot product | 77 GB/s | N/A | |
Copy | 121 GB/s | N/A |