Difference between revisions of "Performance"

From Gw-qcd-wiki
Jump to: navigation, search
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Carver'''
+
'''Shogun (Tesla C2075)'''
  
-Tester: Ben Gamari
+
* Tester: Andrei Alexandru
-Test date: 14 Jul 2010
+
* Test date: 2 May 2015
-Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+
* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
-Hardware:  
+
* Hardware: Tesla C2075
-CUDA version 3.0
+
* CUDA version 4.1
  
{|
+
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|42.2 GB/s 
 +
|18.8 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|40.5 GB/s
 +
|18.6 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|104.0 GB/s 
 +
|46.3 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|196.6 GB/s
 +
|87.6 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|120.8 GB/s 
 +
|107.6 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|221.7 GB/s
 +
|197.4 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|119.11 GB/s 
 +
|4.96 GFLOP/s
 +
|-
 +
|Dot product   
 +
|120.86 GB/s 
 +
|30.22 GFLOP/s
 +
|-
 +
|Copy   
 +
|114.87 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
'''Shogun (GTX Titan Black)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 2 May 2015
 +
* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
 +
* Hardware: GeForce GTX TITAN Black
 +
* CUDA version 7.0
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|192.1 GB/s 
 +
|85.6 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|183.7 GB/s
 +
|84.2 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|207.8 GB/s 
 +
|92.5 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|231.7 GB/s 
 +
|206.3 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|229.50 GB/s 
 +
|9.56 GFLOP/s
 +
|-
 +
|Dot product   
 +
|218.87 GB/s 
 +
|54.72 GFLOP/s
 +
|-
 +
|Copy   
 +
|222.30 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
'''Shogun (GTX Titan X)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 2 May 2015
 +
* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
 +
* Hardware: GeForce GTX TITAN X
 +
* CUDA version 7.0
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|177.8 GB/s 
 +
|79.2 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|190.2 GB/s
 +
|87.2 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|248.4 GB/s 
 +
|110.6 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|274.5 GB/s 
 +
|244.5 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|246.14 GB/s 
 +
|10.26 GFLOP/s
 +
|-
 +
|Dot product   
 +
|246.52 GB/s 
 +
|61.63 GFLOP/s
 +
|-
 +
|Copy   
 +
|234.50 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
'''Samurai (GTX 680)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 2 May 2015
 +
* Commit: 62bada549c777b5a89058299df74a1239cb492cd
 +
* Hardware: GeForce GTX 680
 +
* CUDA version 4.2
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|52.2 GB/s 
 +
|23.2 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|51.8 GB/s
 +
|23.7 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|91.8 GB/s 
 +
|40.9 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|124.1 GB/s 
 +
|110.6 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|138.73 GB/s 
 +
|5.78 GFLOP/s
 +
|-
 +
|Dot product   
 +
|146.64 GB/s 
 +
|36.66 GFLOP/s
 +
|-
 +
|Copy   
 +
|139.73 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
 
 +
'''GWU QCD cluster (GTX Titan)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 2 May 2015
 +
* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
 +
* Hardware: GeForce GTX TITAN
 +
* CUDA version 5.0
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|173.0 GB/s 
 +
|77.0 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|165.4 GB/s
 +
|75.8 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|199.4 GB/s 
 +
|88.8 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|389.3 GB/s
 +
|173.4 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|309.4 GB/s
 +
|137.8 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|227.9 GB/s 
 +
|202.9 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|409.2 GB/s
 +
|364.4 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|319.7 GB/s
 +
|284.7 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|231.19 GB/s 
 +
|9.63 GFLOP/s
 +
|-
 +
|Dot product   
 +
|225.21 GB/s 
 +
|56.30 GFLOP/s
 +
|-
 +
|Copy   
 +
|232.76 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
 
 +
 
 +
'''Ninja (K40c ECC off)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 2 May 2015
 +
* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
 +
* Hardware: K40c
 +
* CUDA version 5.5
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|192.3 GB/s 
 +
|85.6 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|192.4 GB/s
 +
|88.2 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|211.7 GB/s 
 +
|94.3 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|201.6 GB/s 
 +
|179.6 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|210.44 GB/s 
 +
|8.77 GFLOP/s
 +
|-
 +
|Dot product   
 +
|197.71 GB/s 
 +
|49.43 GFLOP/s
 +
|-
 +
|Copy   
 +
|207.63 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
'''Ninja (K20 ECC off)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 19 Feb 2013
 +
* Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
 +
* Hardware: K20
 +
* CUDA version 5.0
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|160.9 GB/s 
 +
|71.7 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|156.7 GB/s
 +
|71.8 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|170.8 GB/s 
 +
|76.1 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|163.5 GB/s 
 +
|145.6 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|NA GB/s
 +
|NA GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|162.78 GB/s 
 +
|6.78 GFLOP/s
 +
|-
 +
|Dot product   
 +
|108.89 GB/s 
 +
|27.22 GFLOP/s
 +
|-
 +
|Copy   
 +
|125.27 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
 
 +
'''Lehman (Fermi)'''
 +
 
 +
* Tester: Andrei Alexandru
 +
* Test date: 20 Aug 2011
 +
* Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
 +
* Hardware: gtx580
 +
* CUDA version 4.0
 +
 
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|100 GB/s 
 +
|45 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|114 GB/s
 +
|53 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|128 GB/s 
 +
|57 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|240 GB/s
 +
|107 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|391 GB/s
 +
|174 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|165 GB/s 
 +
|147 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|302 GB/s
 +
|269 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|402 GB/s
 +
|356 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|159 GB/s 
 +
|6.6 GFLOP/s
 +
|-
 +
|Dot product   
 +
|129 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|155 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 
 +
'''Carver (Fermi with ECC)'''
 +
 
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
 +
* Hardware: Tesla C2050 (ECC on)
 +
* CUDA version 3.0
 +
 
 +
{| class="wikitable"
 
  !Kernel   
 
  !Kernel   
 
  !Configuration  
 
  !Configuration  
Line 35: Line 551:
 
  |114 GFLOP/s
 
  |114 GFLOP/s
 
  |-
 
  |-
  |rowspan="3"|Dslash_multi_gpu (double)
+
  |rowspan="3"|Dslash_multi_gpu (single)
 
  |1 node, 24^4 Dslash   
 
  |1 node, 24^4 Dslash   
 
  |79 GB/s   
 
  |79 GB/s   
Line 59: Line 575:
 
  |Copy     
 
  |Copy     
 
  |84 GB/s  
 
  |84 GB/s  
 +
|N/A
 +
|-
 +
|}
 +
 +
 +
'''Carver (Fermi without ECC)'''
 +
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
 +
* Hardware: Tesla C2050 (ECC off)
 +
* CUDA version 3.0
 +
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|85 GB/s 
 +
|38 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|86 GB/s
 +
|39 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|96 GB/s 
 +
|43 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|179 GB/s
 +
|82 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|304 GB/s
 +
|139 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|114 GB/s 
 +
|101 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|210 GB/s
 +
|187 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|279 GB/s
 +
|256 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|110 GB/s 
 +
|4.6 GFLOP/s
 +
|-
 +
|Dot product   
 +
|119 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|114 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 +
 +
'''Carver (Tesla)'''
 +
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
 +
* Hardware: Tesla C1060
 +
* CUDA version 3.0
 +
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|69 GB/s 
 +
|31 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|68 GB/s
 +
|31 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|71 GB/s 
 +
|32 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|134 GB/s
 +
|60 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|240 GB/s
 +
|107 GFLOP/s
 +
|-
 +
|rowspan="3"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|65 GB/s 
 +
|58 GFLOP/s
 +
|-
 +
|2 nodes, 24^4 Dslash
 +
|122 GB/s
 +
|112 GFLOP/s
 +
|-
 +
|4 nodes, 24^4 Dslash 
 +
|224 GB/s
 +
|199 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|83 GB/s 
 +
|3.5 GFLOP/s
 +
|-
 +
|Dot product   
 +
|81 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|83 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 +
 +
'''jlab (GF100)'''
 +
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
 +
* Hardware: GeForce GTX 480
 +
* CUDA version 3.1
 +
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|102 GB/s 
 +
|46 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|105 GB/s
 +
|48 GFLOP/s
 +
|-
 +
|rowspan="1"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|113 GB/s 
 +
|51 GFLOP/s
 +
|-
 +
|rowspan="1"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|140 GB/s 
 +
|125 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|136 GB/s 
 +
|5.65 GFLOP/s
 +
|-
 +
|Dot product   
 +
|136 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|139 GB/s
 +
|N/A
 +
|-
 +
|}
 +
 +
'''Samurai (GT200)'''
 +
 +
* Tester: Ben Gamari
 +
* Test date: 14 Jul 2010
 +
* Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
 +
* Hardware: GeForce GTX 280
 +
* CUDA version 3.1
 +
 +
{| class="wikitable"
 +
!Kernel 
 +
!Configuration
 +
!Bandwidth
 +
!FLOPs
 +
|-
 +
|rowspan="2"|Dslash_cuda
 +
|Dslash (24^4)         
 +
|76 GB/s 
 +
|34 GFLOP/s
 +
|-
 +
|hopping (24^4)       
 +
|75 GB/s
 +
|34 GFLOP/s
 +
|-
 +
|rowspan="1"|Dslash_multi_gpu (double)
 +
|1 node, 24^4 Dslash 
 +
|77 GB/s 
 +
|34 GFLOP/s
 +
|-
 +
|rowspan="1"|Dslash_multi_gpu (single)
 +
|1 node, 24^4 Dslash 
 +
|91 GB/s 
 +
|81 GFLOP/s
 +
|-
 +
|rowspan="3"|Vector utilities
 +
|Addition     
 +
|113 GB/s 
 +
|4.7 GFLOP/s
 +
|-
 +
|Dot product   
 +
|77 GB/s 
 +
|N/A
 +
|-
 +
|Copy   
 +
|121 GB/s
 
  |N/A
 
  |N/A
 
  |-
 
  |-
 
  |}
 
  |}

Latest revision as of 21:06, 2 May 2015

Shogun (Tesla C2075)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: Tesla C2075
  • CUDA version 4.1
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 42.2 GB/s 18.8 GFLOP/s
hopping (24^4) 40.5 GB/s 18.6 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 104.0 GB/s 46.3 GFLOP/s
2 nodes, 24^4 Dslash 196.6 GB/s 87.6 GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 120.8 GB/s 107.6 GFLOP/s
2 nodes, 24^4 Dslash 221.7 GB/s 197.4 GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 119.11 GB/s 4.96 GFLOP/s
Dot product 120.86 GB/s 30.22 GFLOP/s
Copy 114.87 GB/s N/A

Shogun (GTX Titan Black)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: GeForce GTX TITAN Black
  • CUDA version 7.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 192.1 GB/s 85.6 GFLOP/s
hopping (24^4) 183.7 GB/s 84.2 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 207.8 GB/s 92.5 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 231.7 GB/s 206.3 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 229.50 GB/s 9.56 GFLOP/s
Dot product 218.87 GB/s 54.72 GFLOP/s
Copy 222.30 GB/s N/A

Shogun (GTX Titan X)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: GeForce GTX TITAN X
  • CUDA version 7.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 177.8 GB/s 79.2 GFLOP/s
hopping (24^4) 190.2 GB/s 87.2 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 248.4 GB/s 110.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 274.5 GB/s 244.5 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 246.14 GB/s 10.26 GFLOP/s
Dot product 246.52 GB/s 61.63 GFLOP/s
Copy 234.50 GB/s N/A

Samurai (GTX 680)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 62bada549c777b5a89058299df74a1239cb492cd
  • Hardware: GeForce GTX 680
  • CUDA version 4.2
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 52.2 GB/s 23.2 GFLOP/s
hopping (24^4) 51.8 GB/s 23.7 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 91.8 GB/s 40.9 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 124.1 GB/s 110.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 138.73 GB/s 5.78 GFLOP/s
Dot product 146.64 GB/s 36.66 GFLOP/s
Copy 139.73 GB/s N/A


GWU QCD cluster (GTX Titan)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: GeForce GTX TITAN
  • CUDA version 5.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 173.0 GB/s 77.0 GFLOP/s
hopping (24^4) 165.4 GB/s 75.8 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 199.4 GB/s 88.8 GFLOP/s
2 nodes, 24^4 Dslash 389.3 GB/s 173.4 GFLOP/s
4 nodes, 24^4 Dslash 309.4 GB/s 137.8 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 227.9 GB/s 202.9 GFLOP/s
2 nodes, 24^4 Dslash 409.2 GB/s 364.4 GFLOP/s
4 nodes, 24^4 Dslash 319.7 GB/s 284.7 GFLOP/s
Vector utilities Addition 231.19 GB/s 9.63 GFLOP/s
Dot product 225.21 GB/s 56.30 GFLOP/s
Copy 232.76 GB/s N/A


Ninja (K40c ECC off)

  • Tester: Andrei Alexandru
  • Test date: 2 May 2015
  • Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
  • Hardware: K40c
  • CUDA version 5.5
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 192.3 GB/s 85.6 GFLOP/s
hopping (24^4) 192.4 GB/s 88.2 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 211.7 GB/s 94.3 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 201.6 GB/s 179.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 210.44 GB/s 8.77 GFLOP/s
Dot product 197.71 GB/s 49.43 GFLOP/s
Copy 207.63 GB/s N/A

Ninja (K20 ECC off)

  • Tester: Andrei Alexandru
  • Test date: 19 Feb 2013
  • Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
  • Hardware: K20
  • CUDA version 5.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 160.9 GB/s 71.7 GFLOP/s
hopping (24^4) 156.7 GB/s 71.8 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 170.8 GB/s 76.1 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 163.5 GB/s 145.6 GFLOP/s
2 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
4 nodes, 24^4 Dslash NA GB/s NA GFLOP/s
Vector utilities Addition 162.78 GB/s 6.78 GFLOP/s
Dot product 108.89 GB/s 27.22 GFLOP/s
Copy 125.27 GB/s N/A


Lehman (Fermi)

  • Tester: Andrei Alexandru
  • Test date: 20 Aug 2011
  • Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
  • Hardware: gtx580
  • CUDA version 4.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 100 GB/s 45 GFLOP/s
hopping (24^4) 114 GB/s 53 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 128 GB/s 57 GFLOP/s
2 nodes, 24^4 Dslash 240 GB/s 107 GFLOP/s
4 nodes, 24^4 Dslash 391 GB/s 174 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 165 GB/s 147 GFLOP/s
2 nodes, 24^4 Dslash 302 GB/s 269 GFLOP/s
4 nodes, 24^4 Dslash 402 GB/s 356 GFLOP/s
Vector utilities Addition 159 GB/s 6.6 GFLOP/s
Dot product 129 GB/s N/A
Copy 155 GB/s N/A

Carver (Fermi with ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC on)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 73 GB/s 32 GFLOP/s
hopping (24^4) 74 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 79 GB/s 35 GFLOP/s
2 nodes, 24^4 Dslash 145 GB/s 64 GFLOP/s
4 nodes, 24^4 Dslash 256 GB/s 114 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 79 GB/s 76 GFLOP/s
2 nodes, 24^4 Dslash 156 GB/s 140 GFLOP/s
4 nodes, 24^4 Dslash 283 GB/s 252 GFLOP/s
Vector utilities Addition 82 GB/s 3.4 GFLOP/s
Dot product 88 GB/s N/A
Copy 84 GB/s N/A


Carver (Fermi without ECC)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C2050 (ECC off)
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 85 GB/s 38 GFLOP/s
hopping (24^4) 86 GB/s 39 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 96 GB/s 43 GFLOP/s
2 nodes, 24^4 Dslash 179 GB/s 82 GFLOP/s
4 nodes, 24^4 Dslash 304 GB/s 139 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 114 GB/s 101 GFLOP/s
2 nodes, 24^4 Dslash 210 GB/s 187 GFLOP/s
4 nodes, 24^4 Dslash 279 GB/s 256 GFLOP/s
Vector utilities Addition 110 GB/s 4.6 GFLOP/s
Dot product 119 GB/s N/A
Copy 114 GB/s N/A


Carver (Tesla)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: Tesla C1060
  • CUDA version 3.0
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 69 GB/s 31 GFLOP/s
hopping (24^4) 68 GB/s 31 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 71 GB/s 32 GFLOP/s
2 nodes, 24^4 Dslash 134 GB/s 60 GFLOP/s
4 nodes, 24^4 Dslash 240 GB/s 107 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 65 GB/s 58 GFLOP/s
2 nodes, 24^4 Dslash 122 GB/s 112 GFLOP/s
4 nodes, 24^4 Dslash 224 GB/s 199 GFLOP/s
Vector utilities Addition 83 GB/s 3.5 GFLOP/s
Dot product 81 GB/s N/A
Copy 83 GB/s N/A


jlab (GF100)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
  • Hardware: GeForce GTX 480
  • CUDA version 3.1
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 102 GB/s 46 GFLOP/s
hopping (24^4) 105 GB/s 48 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 113 GB/s 51 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 140 GB/s 125 GFLOP/s
Vector utilities Addition 136 GB/s 5.65 GFLOP/s
Dot product 136 GB/s N/A
Copy 139 GB/s N/A

Samurai (GT200)

  • Tester: Ben Gamari
  • Test date: 14 Jul 2010
  • Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
  • Hardware: GeForce GTX 280
  • CUDA version 3.1
Kernel Configuration Bandwidth FLOPs
Dslash_cuda Dslash (24^4) 76 GB/s 34 GFLOP/s
hopping (24^4) 75 GB/s 34 GFLOP/s
Dslash_multi_gpu (double) 1 node, 24^4 Dslash 77 GB/s 34 GFLOP/s
Dslash_multi_gpu (single) 1 node, 24^4 Dslash 91 GB/s 81 GFLOP/s
Vector utilities Addition 113 GB/s 4.7 GFLOP/s
Dot product 77 GB/s N/A
Copy 121 GB/s N/A