Difference between revisions of "Performance"

Revision as of 14:17, 14 July 2010

Carver (Fermi with ECC)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	73 GB/s	32 GFLOP/s
Dslash_cuda	hopping (24^4)	74 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	79 GB/s	35 GFLOP/s
	2 nodes, 24^4 Dslash	145 GB/s	64 GFLOP/s
	4 nodes, 24^4 Dslash	256 GB/s	114 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	79 GB/s	76 GFLOP/s
	2 nodes, 24^4 Dslash	156 GB/s	140 GFLOP/s
	4 nodes, 24^4 Dslash	283 GB/s	252 GFLOP/s
Vector utilities	Addition	82 GB/s	3.4 GFLOP/s
	Dot product	88 GB/s	N/A
	Copy	84 GB/s	N/A

Carver (Fermi without ECC)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	85 GB/s	38 GFLOP/s
Dslash_cuda	hopping (24^4)	86 GB/s	39 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	96 GB/s	43 GFLOP/s
	2 nodes, 24^4 Dslash	179 GB/s	82 GFLOP/s
	4 nodes, 24^4 Dslash	304 GB/s	139 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	114 GB/s	101 GFLOP/s
	2 nodes, 24^4 Dslash	210 GB/s	187 GFLOP/s
	4 nodes, 24^4 Dslash	279 GB/s	256 GFLOP/s
Vector utilities	Addition	110 GB/s	4.6 GFLOP/s
	Dot product	119 GB/s	N/A
	Copy	114 GB/s	N/A

Carver (Tesla)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	69 GB/s	31 GFLOP/s
Dslash_cuda	hopping (24^4)	68 GB/s	31 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	71 GB/s	32 GFLOP/s
	2 nodes, 24^4 Dslash	134 GB/s	60 GFLOP/s
	4 nodes, 24^4 Dslash	240 GB/s	107 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	65 GB/s	58 GFLOP/s
	2 nodes, 24^4 Dslash	122 GB/s	112 GFLOP/s
	4 nodes, 24^4 Dslash	224 GB/s	199 GFLOP/s
Vector utilities	Addition	83 GB/s	3.5 GFLOP/s
	Dot product	81 GB/s	N/A
	Copy	83 GB/s	N/A

Samurai (G80)

Kernel	Configuration	Bandwidth	FLOPs
Dslash_cuda	Dslash (24^4)	76 GB/s	34 GFLOP/s
Dslash_cuda	hopping (24^4)	75 GB/s	34 GFLOP/s
Dslash_multi_gpu (double)	1 node, 24^4 Dslash	77 GB/s	34 GFLOP/s
Dslash_multi_gpu (single)	1 node, 24^4 Dslash	91 GB/s	81 GFLOP/s
Vector utilities	Addition	113 GB/s	4.7 GFLOP/s
	Dot product	77 GB/s	N/A
	Copy	121 GB/s	N/A

@@ Line 124: / Line 124: @@
   |Copy
   |114 GB/s
+ |N/A
+ |-
+ |}
+'''Carver (Tesla)'''
+* Tester: Ben Gamari
+* Test date: 14 Jul 2010
+* Commit: e3e4ffafd158abd004c483694a27f4f6bc7d2185
+* Hardware: Tesla C1060
+* CUDA version 3.0
+{|
+ !Kernel
+ !Configuration
+ !Bandwidth
+ !FLOPs
+ |-
+ |rowspan="2"|Dslash_cuda
+ |Dslash (24^4)
+ |69 GB/s
+ |31 GFLOP/s
+ |-
+ |hopping (24^4)
+ |68 GB/s
+ |31 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (double)
+ |1 node, 24^4 Dslash
+ |71 GB/s
+ |32 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |134 GB/s
+ |60 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |240 GB/s
+ |107 GFLOP/s
+ |-
+ |rowspan="3"|Dslash_multi_gpu (single)
+ |1 node, 24^4 Dslash
+ |65 GB/s
+ |58 GFLOP/s
+ |-
+ |2 nodes, 24^4 Dslash
+ |122 GB/s
+ |112 GFLOP/s
+ |-
+ |4 nodes, 24^4 Dslash
+ |224 GB/s
+ |199 GFLOP/s
+ |-
+ |rowspan="3"|Vector utilities
+ |Addition
+ |83 GB/s
+ |3.5 GFLOP/s
+ |-
+ |Dot product
+ |81 GB/s
+ |N/A
+ |-
+ |Copy
+ |83 GB/s
   |N/A
   |-
@@ Line 162: / Line 227: @@
   |81 GFLOP/s
   |-
-  |rowspan="1"|Vector utilities
+  |rowspan="3"|Vector utilities
   |Addition
   |113 GB/s