Difference between revisions of "Performance"

Latest revision as of 21:06, 2 May 2015

Shogun (Tesla C2075)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
Hardware: Tesla C2075
CUDA version 4.1

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	42.2 GB/s	18.8 GFLOP/s
Dslash_cuda	hopping (24^4)	40.5 GB/s	18.6 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	104.0 GB/s	46.3 GFLOP/s
	2 nodes, 24^4 Dslash	196.6 GB/s	87.6 GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	120.8 GB/s	107.6 GFLOP/s
	2 nodes, 24^4 Dslash	221.7 GB/s	197.4 GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	119.11 GB/s	4.96 GFLOP/s
	Dot product	120.86 GB/s	30.22 GFLOP/s
	Copy	114.87 GB/s	N/A

Shogun (GTX Titan Black)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
Hardware: GeForce GTX TITAN Black
CUDA version 7.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	192.1 GB/s	85.6 GFLOP/s
Dslash_cuda	hopping (24^4)	183.7 GB/s	84.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	207.8 GB/s	92.5 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	231.7 GB/s	206.3 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	229.50 GB/s	9.56 GFLOP/s
	Dot product	218.87 GB/s	54.72 GFLOP/s
	Copy	222.30 GB/s	N/A

Shogun (GTX Titan X)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
Hardware: GeForce GTX TITAN X
CUDA version 7.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	177.8 GB/s	79.2 GFLOP/s
Dslash_cuda	hopping (24^4)	190.2 GB/s	87.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	248.4 GB/s	110.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	274.5 GB/s	244.5 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	246.14 GB/s	10.26 GFLOP/s
	Dot product	246.52 GB/s	61.63 GFLOP/s
	Copy	234.50 GB/s	N/A

Samurai (GTX 680)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 62bada549c777b5a89058299df74a1239cb492cd
Hardware: GeForce GTX 680
CUDA version 4.2

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	52.2 GB/s	23.2 GFLOP/s
Dslash_cuda	hopping (24^4)	51.8 GB/s	23.7 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	91.8 GB/s	40.9 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	124.1 GB/s	110.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	138.73 GB/s	5.78 GFLOP/s
	Dot product	146.64 GB/s	36.66 GFLOP/s
	Copy	139.73 GB/s	N/A

GWU QCD cluster (GTX Titan)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
Hardware: GeForce GTX TITAN
CUDA version 5.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	173.0 GB/s	77.0 GFLOP/s
Dslash_cuda	hopping (24^4)	165.4 GB/s	75.8 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	199.4 GB/s	88.8 GFLOP/s
	2 nodes, 24^4 Dslash	389.3 GB/s	173.4 GFLOP/s
	4 nodes, 24^4 Dslash	309.4 GB/s	137.8 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	227.9 GB/s	202.9 GFLOP/s
	2 nodes, 24^4 Dslash	409.2 GB/s	364.4 GFLOP/s
	4 nodes, 24^4 Dslash	319.7 GB/s	284.7 GFLOP/s
Vector utilities	Addition	231.19 GB/s	9.63 GFLOP/s
	Dot product	225.21 GB/s	56.30 GFLOP/s
	Copy	232.76 GB/s	N/A

Ninja (K40c ECC off)

Tester: Andrei Alexandru
Test date: 2 May 2015
Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
Hardware: K40c
CUDA version 5.5

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	192.3 GB/s	85.6 GFLOP/s
Dslash_cuda	hopping (24^4)	192.4 GB/s	88.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	211.7 GB/s	94.3 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	201.6 GB/s	179.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	210.44 GB/s	8.77 GFLOP/s
	Dot product	197.71 GB/s	49.43 GFLOP/s
	Copy	207.63 GB/s	N/A

Ninja (K20 ECC off)

Tester: Andrei Alexandru
Test date: 19 Feb 2013
Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
Hardware: K20
CUDA version 5.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	160.9 GB/s	71.7 GFLOP/s
Dslash_cuda	hopping (24^4)	156.7 GB/s	71.8 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	170.8 GB/s	76.1 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	163.5 GB/s	145.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	162.78 GB/s	6.78 GFLOP/s
	Dot product	108.89 GB/s	27.22 GFLOP/s
	Copy	125.27 GB/s	N/A

Lehman (Fermi)

Tester: Andrei Alexandru
Test date: 20 Aug 2011
Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
Hardware: gtx580
CUDA version 4.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	100 GB/s	45 GFLOP/s
Dslash_cuda	hopping (24^4)	114 GB/s	53 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	128 GB/s	57 GFLOP/s
	2 nodes, 24^4 Dslash	240 GB/s	107 GFLOP/s
	4 nodes, 24^4 Dslash	391 GB/s	174 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	165 GB/s	147 GFLOP/s
	2 nodes, 24^4 Dslash	302 GB/s	269 GFLOP/s
	4 nodes, 24^4 Dslash	402 GB/s	356 GFLOP/s
Vector utilities	Addition	159 GB/s	6.6 GFLOP/s
	Dot product	129 GB/s	N/A
	Copy	155 GB/s	N/A

Carver (Fermi with ECC)

Tester: Ben Gamari
Test date: 14 Jul 2010
Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
Hardware: Tesla C2050 (ECC on)
CUDA version 3.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	73 GB/s	32 GFLOP/s
Dslash_cuda	hopping (24^4)	74 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	79 GB/s	35 GFLOP/s
	2 nodes, 24^4 Dslash	145 GB/s	64 GFLOP/s
	4 nodes, 24^4 Dslash	256 GB/s	114 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	79 GB/s	76 GFLOP/s
	2 nodes, 24^4 Dslash	156 GB/s	140 GFLOP/s
	4 nodes, 24^4 Dslash	283 GB/s	252 GFLOP/s
Vector utilities	Addition	82 GB/s	3.4 GFLOP/s
	Dot product	88 GB/s	N/A
	Copy	84 GB/s	N/A

Carver (Fermi without ECC)

Tester: Ben Gamari
Test date: 14 Jul 2010
Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
Hardware: Tesla C2050 (ECC off)
CUDA version 3.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	85 GB/s	38 GFLOP/s
Dslash_cuda	hopping (24^4)	86 GB/s	39 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	96 GB/s	43 GFLOP/s
	2 nodes, 24^4 Dslash	179 GB/s	82 GFLOP/s
	4 nodes, 24^4 Dslash	304 GB/s	139 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	114 GB/s	101 GFLOP/s
	2 nodes, 24^4 Dslash	210 GB/s	187 GFLOP/s
	4 nodes, 24^4 Dslash	279 GB/s	256 GFLOP/s
Vector utilities	Addition	110 GB/s	4.6 GFLOP/s
	Dot product	119 GB/s	N/A
	Copy	114 GB/s	N/A

Carver (Tesla)

Tester: Ben Gamari
Test date: 14 Jul 2010
Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
Hardware: Tesla C1060
CUDA version 3.0

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	69 GB/s	31 GFLOP/s
Dslash_cuda	hopping (24^4)	68 GB/s	31 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	71 GB/s	32 GFLOP/s
	2 nodes, 24^4 Dslash	134 GB/s	60 GFLOP/s
	4 nodes, 24^4 Dslash	240 GB/s	107 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	65 GB/s	58 GFLOP/s
	2 nodes, 24^4 Dslash	122 GB/s	112 GFLOP/s
	4 nodes, 24^4 Dslash	224 GB/s	199 GFLOP/s
Vector utilities	Addition	83 GB/s	3.5 GFLOP/s
	Dot product	81 GB/s	N/A
	Copy	83 GB/s	N/A

jlab (GF100)

Tester: Ben Gamari
Test date: 14 Jul 2010
Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
Hardware: GeForce GTX 480
CUDA version 3.1

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	102 GB/s	46 GFLOP/s
Dslash_cuda	hopping (24^4)	105 GB/s	48 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	113 GB/s	51 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	140 GB/s	125 GFLOP/s
Vector utilities	Addition	136 GB/s	5.65 GFLOP/s
	Dot product	136 GB/s	N/A
	Copy	139 GB/s	N/A

Samurai (GT200)

Tester: Ben Gamari
Test date: 14 Jul 2010
Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
Hardware: GeForce GTX 280
CUDA version 3.1

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	76 GB/s	34 GFLOP/s
Dslash_cuda	hopping (24^4)	75 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	77 GB/s	34 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	91 GB/s	81 GFLOP/s
Vector utilities	Addition	113 GB/s	4.7 GFLOP/s
	Dot product	77 GB/s	N/A
	Copy	121 GB/s	N/A

Difference between revisions of "Performance"

Latest revision as of 21:06, 2 May 2015

Navigation menu

Views

Personal tools

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
-'''Carver'''
+'''Shogun (Tesla C2075)'''
-Tester: Ben Gamari
-Test date: 14 Jul 2010
-Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
-Hardware:
-CUDA version 3.0
-{|
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
+* Hardware: Tesla C2075
+* CUDA version 4.1
+{| class="wikitable"
   !Kernel
   !Configuration
@@ Line 13: / Line 14: @@
   |-
   |rowspan="2"|Dslash_cuda
-  |Dslash (24^4)           |73 GB/s    |32 GFLOP/s
+  |Dslash (24^4)
+ |42.2 GB/s
+ |18.8 GFLOP/s
   |-
-  |hopping (24^4)          |74 GB/s    |34 GFLOP/s
+  |hopping (24^4)
+ |40.5 GB/s
+ |18.6 GFLOP/s
   |-
   |rowspan="3"|Dslash_multi_gpu (double)
-  |1 node, 24^4 Dslash     |79 GB/s    |35 GFLOP/s
+  |1 node, 24^4 Dslash
+ |104.0 GB/s
+ |46.3 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |196.6 GB/s
+ |87.6 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |120.8 GB/s
+ |107.6 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |221.7 GB/s
+ |197.4 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |119.11 GB/s
+ |4.96 GFLOP/s
+ |-
+ |Dot product
+ |120.86 GB/s
+ |30.22 GFLOP/s
+ |-
+ |Copy
+ |114.87 GB/s
+ |N/A
+ |-
+ |}
+'''Shogun (GTX Titan Black)'''
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
+* Hardware: GeForce GTX TITAN Black
+* CUDA version 7.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
   |-
-  |2 nodes, 24^4 Dslash    |145 GB/s   |64 GFLOP/s
+  |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |192.1 GB/s
+ |85.6 GFLOP/s
   |-
-  |4 nodes, 24^4 Dslash    |256 GB/s   |114 GFLOP/s
+  |hopping (24^4)
+ |183.7 GB/s
+ |84.2 GFLOP/s
   |-
   |rowspan="3"|Dslash_multi_gpu (double)
-  |1 node, 24^4 Dslash     |79 GB/s    |76 GFLOP/s
+  |1 node, 24^4 Dslash
+ |207.8 GB/s
+ |92.5 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |231.7 GB/s
+ |206.3 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |229.50 GB/s
+ |9.56 GFLOP/s
+ |-
+ |Dot product
+ |218.87 GB/s
+ |54.72 GFLOP/s
+ |-
+ |Copy
+ |222.30 GB/s
+ |N/A
+ |-
+ |}
+'''Shogun (GTX Titan X)'''
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
+* Hardware: GeForce GTX TITAN X
+* CUDA version 7.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |177.8 GB/s
+ |79.2 GFLOP/s
+ |-
+ |hopping (24^4)
+ |190.2 GB/s
+ |87.2 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |248.4 GB/s
+ |110.6 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |274.5 GB/s
+ |244.5 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |246.14 GB/s
+ |10.26 GFLOP/s
+ |-
+ |Dot product
+ |246.52 GB/s
+ |61.63 GFLOP/s
+ |-
+ |Copy
+ |234.50 GB/s
+ |N/A
+ |-
+ |}
+'''Samurai (GTX 680)'''
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 62bada549c777b5a89058299df74a1239cb492cd
+* Hardware: GeForce GTX 680
+* CUDA version 4.2
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |52.2 GB/s
+ |23.2 GFLOP/s
+ |-
+ |hopping (24^4)
+ |51.8 GB/s
+ |23.7 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |91.8 GB/s
+ |40.9 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |124.1 GB/s
+ |110.6 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |138.73 GB/s
+ |5.78 GFLOP/s
+ |-
+ |Dot product
+ |146.64 GB/s
+ |36.66 GFLOP/s
+ |-
+ |Copy
+ |139.73 GB/s
+ |N/A
+ |-
+ |}
+'''GWU QCD cluster (GTX Titan)'''
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
+* Hardware: GeForce GTX TITAN
+* CUDA version 5.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |173.0 GB/s
+ |77.0 GFLOP/s
+ |-
+ |hopping (24^4)
+ |165.4 GB/s
+ |75.8 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |199.4 GB/s
+ |88.8 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |389.3 GB/s
+ |173.4 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |309.4 GB/s
+ |137.8 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |227.9 GB/s
+ |202.9 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |409.2 GB/s
+ |364.4 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |319.7 GB/s
+ |284.7 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |231.19 GB/s
+ |9.63 GFLOP/s
+ |-
+ |Dot product
+ |225.21 GB/s
+ |56.30 GFLOP/s
+ |-
+ |Copy
+ |232.76 GB/s
+ |N/A
+ |-
+ |}
+'''Ninja (K40c ECC off)'''
+* Tester: Andrei Alexandru
+* Test date: 2 May 2015
+* Commit: 5dcae13af9abada6460a0061e5575af9c101f43a
+* Hardware: K40c
+* CUDA version 5.5
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |192.3 GB/s
+ |85.6 GFLOP/s
+ |-
+ |hopping (24^4)
+ |192.4 GB/s
+ |88.2 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |211.7 GB/s
+ |94.3 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |201.6 GB/s
+ |179.6 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |210.44 GB/s
+ |8.77 GFLOP/s
+ |-
+ |Dot product
+ |197.71 GB/s
+ |49.43 GFLOP/s
+ |-
+ |Copy
+ |207.63 GB/s
+ |N/A
+ |-
+ |}
+'''Ninja (K20 ECC off)'''
+* Tester: Andrei Alexandru
+* Test date: 19 Feb 2013
+* Commit: 4c3956dd3075f49e7bd493a813e8b922a4377877
+* Hardware: K20
+* CUDA version 5.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |160.9 GB/s
+ |71.7 GFLOP/s
+ |-
+ |hopping (24^4)
+ |156.7 GB/s
+ |71.8 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |170.8 GB/s
+ |76.1 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |163.5 GB/s
+ |145.6 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |NA GB/s
+ |NA GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |162.78 GB/s
+ |6.78 GFLOP/s
+ |-
+ |Dot product
+ |108.89 GB/s
+ |27.22 GFLOP/s
+ |-
+ |Copy
+ |125.27 GB/s
+ |N/A
+ |-
+ |}
+'''Lehman (Fermi)'''
+* Tester: Andrei Alexandru
+* Test date: 20 Aug 2011
+* Commit: b54a3437eeeebf773271d7a5424b41949f9283ad
+* Hardware: gtx580
+* CUDA version 4.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |100 GB/s
+ |45 GFLOP/s
+ |-
+ |hopping (24^4)
+ |114 GB/s
+ |53 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |128 GB/s
+ |57 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |240 GB/s
+ |107 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |391 GB/s
+ |174 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |165 GB/s
+ |147 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |302 GB/s
+ |269 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |402 GB/s
+ |356 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |159 GB/s
+ |6.6 GFLOP/s
+ |-
+ |Dot product
+ |129 GB/s
+ |N/A
+ |-
+ |Copy
+ |155 GB/s
+ |N/A
+ |-
+ |}
+'''Carver (Fermi with ECC)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+* Hardware: Tesla C2050 (ECC on)
+* CUDA version 3.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |73 GB/s
+ |32 GFLOP/s
+ |-
+ |hopping (24^4)
+ |74 GB/s
+ |34 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |79 GB/s
+ |35 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |145 GB/s
+ |64 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |256 GB/s
+ |114 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |79 GB/s
+ |76 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |156 GB/s
+ |140 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |283 GB/s
+ |252 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |82 GB/s
+ |3.4 GFLOP/s
+ |-
+ |Dot product
+ |88 GB/s
+ |N/A
+ |-
+ |Copy
+ |84 GB/s
+ |N/A
+ |-
+ |}
+'''Carver (Fermi without ECC)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+* Hardware: Tesla C2050 (ECC off)
+* CUDA version 3.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |85 GB/s
+ |38 GFLOP/s
+ |-
+ |hopping (24^4)
+ |86 GB/s
+ |39 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |96 GB/s
+ |43 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |179 GB/s
+ |82 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |304 GB/s
+ |139 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |114 GB/s
+ |101 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |210 GB/s
+ |187 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |279 GB/s
+ |256 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |110 GB/s
+ |4.6 GFLOP/s
+ |-
+ |Dot product
+ |119 GB/s
+ |N/A
+ |-
+ |Copy
+ |114 GB/s
+ |N/A
+ |-
+ |}
+'''Carver (Tesla)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+* Hardware: Tesla C1060
+* CUDA version 3.0
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |69 GB/s
+ |31 GFLOP/s
+ |-
+ |hopping (24^4)
+ |68 GB/s
+ |31 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |71 GB/s
+ |32 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |134 GB/s
+ |60 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |240 GB/s
+ |107 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |65 GB/s
+ |58 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |122 GB/s
+ |112 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |224 GB/s
+ |199 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |83 GB/s
+ |3.5 GFLOP/s
+ |-
+ |Dot product
+ |81 GB/s
+ |N/A
+ |-
+ |Copy
+ |83 GB/s
+ |N/A
+ |-
+ |}
+'''jlab (GF100)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+* Hardware: GeForce GTX 480
+* CUDA version 3.1
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |102 GB/s
+ |46 GFLOP/s
+ |-
+ |hopping (24^4)
+ |105 GB/s
+ |48 GFLOP/s
+ |-
+ |rowspan="1"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |113 GB/s
+ |51 GFLOP/s
+ |-
+ |rowspan="1"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |140 GB/s
+ |125 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |136 GB/s
+ |5.65 GFLOP/s
+ |-
+ |Dot product
+ |136 GB/s
+ |N/A
+ |-
+ |Copy
+ |139 GB/s
+ |N/A
+ |-
+ |}
+'''Samurai (GT200)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e4b3a766ec873f57cc7ef31b4167bc2f032d418e
+* Hardware: GeForce GTX 280
+* CUDA version 3.1
+{| class="wikitable"
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |76 GB/s
+ |34 GFLOP/s
+ |-
+ |hopping (24^4)
+ |75 GB/s
+ |34 GFLOP/s
   |-
-  |2 nodes, 24^4 Dslash    |156 GB/s   |140 GFLOP/s
+  |rowspan="1"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |77 GB/s
+ |34 GFLOP/s
   |-
-  |4 nodes, 24^4 Dslash    |283 GB/s   |252 GFLOP/s
+  |rowspan="1"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |91 GB/s
+ |81 GFLOP/s
   |-
-  |Vector addition         |82 GB/s    |3.4 GFLOP/s
+  |rowspan="3"|Vector utilities
+ |Addition
+ |113 GB/s
+ |4.7 GFLOP/s
   |-
-  |Vector dot product      |88 GB/s    |N/A
+  |Dot product
+ |77 GB/s
+ |N/A
   |-
-  |Vector copy             |84 GB/s    |N/A
+  |Copy
+ |121 GB/s
+ |N/A
   |-
   |}