Performance

From Gw-qcd-wiki
Revision as of 07:31, 2 May 2015 by Alexan (talk | contribs)
Jump to: navigation, search

Ninja (K40c ECC off)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: K40c
  • CUDA version 5.5
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 192.3 GB/s 85.6 GFLOP/s
hopping (24^4) 192.4 GB/s 88.2 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 211.7 GB/s 94.3 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 201.6 GB/s 179.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 210.44 GB/s 8.77 GFLOP/s
Dot product 197.71 GB/s 49.43 GFLOP/s
Copy 207.63 GB/s N/A

Ninja (K20 ECC off)

  • Tester: Andrei Alexandru
  • Test date: 19 Feb 2013
  • Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
  • Hardware: K20
  • CUDA version 5.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 160.9 GB/s 71.7 GFLOP/s
hopping (24^4) 156.7 GB/s 71.8 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 170.8 GB/s 76.1 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 163.5 GB/s 145.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 162.78 GB/s 6.78 GFLOP/s
Dot product 108.89 GB/s 27.22 GFLOP/s
Copy 125.27 GB/s N/A


Lehman (Fermi)

  • Tester: Andrei Alexandru
  • Test date: 20 Aug 2011
  • Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
  • Hardware: gtx580
  • CUDA version 4.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 100 GB/s 45 GFLOP/s
hopping (24^4) 114 GB/s 53 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 128 GB/s 57 GFLOP/s
2 nodes, 24^4 Dslash 240 GB/s 107 GFLOP/s
4 nodes, 24^4 Dslash 391 GB/s 174 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 165 GB/s 147 GFLOP/s
2 nodes, 24^4 Dslash 302 GB/s 269 GFLOP/s
4 nodes, 24^4 Dslash 402 GB/s 356 GFLOP/s
Vector utilities Addition 159 GB/s 6.6 GFLOP/s
Dot product 129 GB/s N/A
Copy 155 GB/s N/A

Carver (Fermi with ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC on)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 73 GB/s 32 GFLOP/s
hopping (24^4) 74 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 35 GFLOP/s
2 nodes, 24^4 Dslash 145 GB/s 64 GFLOP/s
4 nodes, 24^4 Dslash 256 GB/s 114 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 79 GB/s 76 GFLOP/s
2 nodes, 24^4 Dslash 156 GB/s 140 GFLOP/s
4 nodes, 24^4 Dslash 283 GB/s 252 GFLOP/s
Vector utilities Addition 82 GB/s 3.4 GFLOP/s
Dot product 88 GB/s N/A
Copy 84 GB/s N/A


Carver (Fermi without ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC off)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 85 GB/s 38 GFLOP/s
hopping (24^4) 86 GB/s 39 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 96 GB/s 43 GFLOP/s
2 nodes, 24^4 Dslash 179 GB/s 82 GFLOP/s
4 nodes, 24^4 Dslash 304 GB/s 139 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 114 GB/s 101 GFLOP/s
2 nodes, 24^4 Dslash 210 GB/s 187 GFLOP/s
4 nodes, 24^4 Dslash 279 GB/s 256 GFLOP/s
Vector utilities Addition 110 GB/s 4.6 GFLOP/s
Dot product 119 GB/s N/A
Copy 114 GB/s N/A


Carver (Tesla)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C1060
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 69 GB/s 31 GFLOP/s
hopping (24^4) 68 GB/s 31 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 71 GB/s 32 GFLOP/s
2 nodes, 24^4 Dslash 134 GB/s 60 GFLOP/s
4 nodes, 24^4 Dslash 240 GB/s 107 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 65 GB/s 58 GFLOP/s
2 nodes, 24^4 Dslash 122 GB/s 112 GFLOP/s
4 nodes, 24^4 Dslash 224 GB/s 199 GFLOP/s
Vector utilities Addition 83 GB/s 3.5 GFLOP/s
Dot product 81 GB/s N/A
Copy 83 GB/s N/A


jlab (GF100)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: GeForce GTX 480
  • CUDA version 3.1
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 102 GB/s 46 GFLOP/s
hopping (24^4) 105 GB/s 48 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 113 GB/s 51 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 140 GB/s 125 GFLOP/s
Vector utilities Addition 136 GB/s 5.65 GFLOP/s
Dot product 136 GB/s N/A
Copy 139 GB/s N/A

Samurai (GT200)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
  • Hardware: GeForce GTX 280
  • CUDA version 3.1
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 76 GB/s 34 GFLOP/s
hopping (24^4) 75 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 77 GB/s 34 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 91 GB/s 81 GFLOP/s
Vector utilities Addition 113 GB/s 4.7 GFLOP/s
Dot product 77 GB/s N/A
Copy 121 GB/s N/A