Performance

Shogun (GTX Titan Black)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	192.1 GB/s	85.6 GFLOP/s
Dslash_cuda	hopping (24^4)	183.7 GB/s	84.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	207.8 GB/s	92.5 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	231.7 GB/s	206.3 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	229.50 GB/s	9.56 GFLOP/s
	Dot product	218.87 GB/s	54.72 GFLOP/s
	Copy	222.30 GB/s	N/A

Shogun (GTX Titan X)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	177.8 GB/s	79.2 GFLOP/s
Dslash_cuda	hopping (24^4)	190.2 GB/s	87.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	248.4 GB/s	110.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	274.5 GB/s	244.5 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	246.14 GB/s	10.26 GFLOP/s
	Dot product	246.52 GB/s	61.63 GFLOP/s
	Copy	234.50 GB/s	N/A

Samurai (GTX 680)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	52.2 GB/s	23.2 GFLOP/s
Dslash_cuda	hopping (24^4)	51.8 GB/s	23.7 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	91.8 GB/s	40.9 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	124.1 GB/s	110.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	138.73 GB/s	5.78 GFLOP/s
	Dot product	146.64 GB/s	36.66 GFLOP/s
	Copy	139.73 GB/s	N/A

GWU QCD cluster (GTX Titan)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	173.0 GB/s	77.0 GFLOP/s
Dslash_cuda	hopping (24^4)	165.4 GB/s	75.8 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	199.4 GB/s	88.8 GFLOP/s
	2 nodes, 24^4 Dslash	389.3 GB/s	173.4 GFLOP/s
	4 nodes, 24^4 Dslash	309.4 GB/s	137.8 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	227.9 GB/s	202.9 GFLOP/s
	2 nodes, 24^4 Dslash	409.2 GB/s	364.4 GFLOP/s
	4 nodes, 24^4 Dslash	319.7 GB/s	284.7 GFLOP/s
Vector utilities	Addition	231.19 GB/s	9.63 GFLOP/s
	Dot product	225.21 GB/s	56.30 GFLOP/s
	Copy	232.76 GB/s	N/A

Ninja (K40c ECC off)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	192.3 GB/s	85.6 GFLOP/s
Dslash_cuda	hopping (24^4)	192.4 GB/s	88.2 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	211.7 GB/s	94.3 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	201.6 GB/s	179.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	210.44 GB/s	8.77 GFLOP/s
	Dot product	197.71 GB/s	49.43 GFLOP/s
	Copy	207.63 GB/s	N/A

Ninja (K20 ECC off)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	160.9 GB/s	71.7 GFLOP/s
Dslash_cuda	hopping (24^4)	156.7 GB/s	71.8 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	170.8 GB/s	76.1 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	163.5 GB/s	145.6 GFLOP/s
	2 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
	4 nodes, 24^4 Dslash	NA GB/s	NA GFLOP/s
Vector utilities	Addition	162.78 GB/s	6.78 GFLOP/s
	Dot product	108.89 GB/s	27.22 GFLOP/s
	Copy	125.27 GB/s	N/A

Lehman (Fermi)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	100 GB/s	45 GFLOP/s
Dslash_cuda	hopping (24^4)	114 GB/s	53 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	128 GB/s	57 GFLOP/s
	2 nodes, 24^4 Dslash	240 GB/s	107 GFLOP/s
	4 nodes, 24^4 Dslash	391 GB/s	174 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	165 GB/s	147 GFLOP/s
	2 nodes, 24^4 Dslash	302 GB/s	269 GFLOP/s
	4 nodes, 24^4 Dslash	402 GB/s	356 GFLOP/s
Vector utilities	Addition	159 GB/s	6.6 GFLOP/s
	Dot product	129 GB/s	N/A
	Copy	155 GB/s	N/A

Carver (Fermi with ECC)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	73 GB/s	32 GFLOP/s
Dslash_cuda	hopping (24^4)	74 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	79 GB/s	35 GFLOP/s
	2 nodes, 24^4 Dslash	145 GB/s	64 GFLOP/s
	4 nodes, 24^4 Dslash	256 GB/s	114 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	79 GB/s	76 GFLOP/s
	2 nodes, 24^4 Dslash	156 GB/s	140 GFLOP/s
	4 nodes, 24^4 Dslash	283 GB/s	252 GFLOP/s
Vector utilities	Addition	82 GB/s	3.4 GFLOP/s
	Dot product	88 GB/s	N/A
	Copy	84 GB/s	N/A

Carver (Fermi without ECC)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	85 GB/s	38 GFLOP/s
Dslash_cuda	hopping (24^4)	86 GB/s	39 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	96 GB/s	43 GFLOP/s
	2 nodes, 24^4 Dslash	179 GB/s	82 GFLOP/s
	4 nodes, 24^4 Dslash	304 GB/s	139 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	114 GB/s	101 GFLOP/s
	2 nodes, 24^4 Dslash	210 GB/s	187 GFLOP/s
	4 nodes, 24^4 Dslash	279 GB/s	256 GFLOP/s
Vector utilities	Addition	110 GB/s	4.6 GFLOP/s
	Dot product	119 GB/s	N/A
	Copy	114 GB/s	N/A

Carver (Tesla)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	69 GB/s	31 GFLOP/s
Dslash_cuda	hopping (24^4)	68 GB/s	31 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	71 GB/s	32 GFLOP/s
	2 nodes, 24^4 Dslash	134 GB/s	60 GFLOP/s
	4 nodes, 24^4 Dslash	240 GB/s	107 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	65 GB/s	58 GFLOP/s
	2 nodes, 24^4 Dslash	122 GB/s	112 GFLOP/s
	4 nodes, 24^4 Dslash	224 GB/s	199 GFLOP/s
Vector utilities	Addition	83 GB/s	3.5 GFLOP/s
	Dot product	81 GB/s	N/A
	Copy	83 GB/s	N/A

jlab (GF100)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	102 GB/s	46 GFLOP/s
Dslash_cuda	hopping (24^4)	105 GB/s	48 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	113 GB/s	51 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	140 GB/s	125 GFLOP/s
Vector utilities	Addition	136 GB/s	5.65 GFLOP/s
	Dot product	136 GB/s	N/A
	Copy	139 GB/s	N/A

Samurai (GT200)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	76 GB/s	34 GFLOP/s
Dslash_cuda	hopping (24^4)	75 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	77 GB/s	34 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	91 GB/s	81 GFLOP/s
Vector utilities	Addition	113 GB/s	4.7 GFLOP/s
	Dot product	77 GB/s	N/A
	Copy	121 GB/s	N/A

Navigation menu