design: add 14313-benchmark-format.md

rsc · rsc · commit 035860468e5c · 2016-02-12T19:03:39.000Z
For golang/go#14313. Change-Id: Ib9483714bbd004ff2be6cfa0d6e730d2d7f5da42 Reviewed-on: https://go-review.googlesource.com/19490 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
diff --git a/design/14313-benchmark-format.md b/design/14313-benchmark-format.md
@@ -0,0 +1,314 @@
+# Proposal: Go Benchmark Data Format
+
+Authors: Russ Cox, Austin Clements
+
+Last updated: February 2016
+
+Discussion at [golang.org/issue/14313](https://golang.org/issue/14313).
+
+## Abstract
+
+We propose to make the current output of `go test -bench` the defined format for recording all Go benchmark data.
+Having a defined format allows benchmark measurement programs
+and benchmark analysis programs to interoperate while
+evolving independently.
+
+## Background
+
+### Benchmark data formats
+
+We are unaware of any standard formats for recording raw benchmark data,
+and we've been unable to find any using web searches.
+One might expect that a standard benchmark suite such as SPEC CPU2006 would have
+defined a format for raw results, but that appears not to be the case.
+The [collection of published results](https://www.spec.org/cpu2006/results/)
+includes only analyzed data ([example](https://www.spec.org/cpu2006/results/res2011q3/cpu2006-20110620-17230.txt)), not raw data.
+
+Go has a de facto standard format for benchmark data:
+the lines generated by the testing package when using `go test -bench`.
+For example, running compress/flate's benchmarks  produces this output:
+
+	BenchmarkDecodeDigitsSpeed1e4-8   	     100	    154125 ns/op	  64.88 MB/s	   40418 B/op	       7 allocs/op
+	BenchmarkDecodeDigitsSpeed1e5-8   	      10	   1367632 ns/op	  73.12 MB/s	   41356 B/op	      14 allocs/op
+	BenchmarkDecodeDigitsSpeed1e6-8   	       1	  13879794 ns/op	  72.05 MB/s	   52056 B/op	      94 allocs/op
+	BenchmarkDecodeDigitsDefault1e4-8 	     100	    147551 ns/op	  67.77 MB/s	   40418 B/op	       8 allocs/op
+	BenchmarkDecodeDigitsDefault1e5-8 	      10	   1197672 ns/op	  83.50 MB/s	   41508 B/op	      13 allocs/op
+	BenchmarkDecodeDigitsDefault1e6-8 	       1	  11808775 ns/op	  84.68 MB/s	   53800 B/op	      80 allocs/op
+	BenchmarkDecodeDigitsCompress1e4-8	     100	    143348 ns/op	  69.76 MB/s	   40417 B/op	       8 allocs/op
+	BenchmarkDecodeDigitsCompress1e5-8	      10	   1185527 ns/op	  84.35 MB/s	   41508 B/op	      13 allocs/op
+	BenchmarkDecodeDigitsCompress1e6-8	       1	  11740304 ns/op	  85.18 MB/s	   53800 B/op	      80 allocs/op
+	BenchmarkDecodeTwainSpeed1e4-8    	     100	    143665 ns/op	  69.61 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecodeTwainSpeed1e5-8    	      10	   1390359 ns/op	  71.92 MB/s	   45700 B/op	      31 allocs/op
+	BenchmarkDecodeTwainSpeed1e6-8    	       1	  12128469 ns/op	  82.45 MB/s	   89336 B/op	     221 allocs/op
+	BenchmarkDecodeTwainDefault1e4-8  	     100	    141916 ns/op	  70.46 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecodeTwainDefault1e5-8  	      10	   1076669 ns/op	  92.88 MB/s	   43820 B/op	      28 allocs/op
+	BenchmarkDecodeTwainDefault1e6-8  	       1	  10106485 ns/op	  98.95 MB/s	   71096 B/op	     172 allocs/op
+	BenchmarkDecodeTwainCompress1e4-8 	     100	    138516 ns/op	  72.19 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecodeTwainCompress1e5-8 	      10	   1227964 ns/op	  81.44 MB/s	   43316 B/op	      25 allocs/op
+	BenchmarkDecodeTwainCompress1e6-8 	       1	  10040347 ns/op	  99.60 MB/s	   72120 B/op	     173 allocs/op
+	BenchmarkEncodeDigitsSpeed1e4-8   	      30	    482808 ns/op	  20.71 MB/s
+	BenchmarkEncodeDigitsSpeed1e5-8   	       5	   2685455 ns/op	  37.24 MB/s
+	BenchmarkEncodeDigitsSpeed1e6-8   	       1	  24966055 ns/op	  40.05 MB/s
+	BenchmarkEncodeDigitsDefault1e4-8 	      20	    655592 ns/op	  15.25 MB/s
+	BenchmarkEncodeDigitsDefault1e5-8 	       1	  13000839 ns/op	   7.69 MB/s
+	BenchmarkEncodeDigitsDefault1e6-8 	       1	 136341747 ns/op	   7.33 MB/s
+	BenchmarkEncodeDigitsCompress1e4-8	      20	    668083 ns/op	  14.97 MB/s
+	BenchmarkEncodeDigitsCompress1e5-8	       1	  12301511 ns/op	   8.13 MB/s
+	BenchmarkEncodeDigitsCompress1e6-8	       1	 137962041 ns/op	   7.25 MB/s
+
+The testing package always reports ns/op, and each benchmark can request the addition of MB/s (throughput) and also B/op and allocs/op (allocation rates).
+
+### Benchmark processors
+
+Multiple tools have been written that process this format,
+most notably [benchcmp](https://godoc.org/golang.org/x/tools/cmd/benchcmp)
+and its more statistically valid successor [benchstat](https://godoc.org/rsc.io/benchstat).
+There is also [benchmany](https://godoc.org/github.com/aclements/go-misc/benchmany)'s plot subcommand
+and likely more unpublished programs.
+
+### Benchmark runners
+
+Multiple tools have also been written that process this format.
+In addition to the standard Go testing package,
+[compilebench](https://godoc.org/rsc.io/compilebench)
+generates this data format based on runs of the Go compiler,
+and Austin's unpublished shellbench generates this data format
+after running an arbitrary shell command.
+
+The [golang.org/x/benchmarks/bench](https://golang.org/x/benchmarks/bench) benchmarks
+are notable for _not_ generating this format,
+which has made all analysis of those results
+more complex than we believe it should be.
+We intend to update those benchmarks to generate the standard format,
+once a standard format is defined.
+Part of the motivation for the proposal is to avoid
+the need to process custom output formats in future benchmarks.
+
+## Proposal
+
+A Go benchmark data file is a textual file consisting of a sequence of lines.
+Configuration lines and benchmark result lines, described below,
+have semantic meaning in the reporting of benchmark results.
+
+All other lines in the data file, including but not limited to
+blank lines and lines beginning with a # character, are ignored.
+For example, the testing package prints test results above benchmark data,
+usually the text `PASS`. that line is neither a configuration line nor a benchmark
+result line, so it is ignored.
+
+### Configuration Lines
+
+A configuration line is a key-value pair of the form
+
+	key: value
+
+where key contains no space characters (as defined by `unicode.IsSpace`)
+nor upper case characters (as defined by `unicode.IsUpper`),
+and space characters separate “key:” from “value.”
+Conventionally, multiword keys are written with the words
+There are no restrictions on value, except that it cannot contain a newline character.
+Value can be omitted entirely but the colon must still be present.
+
+The interpretation of a key/value pair is up to tooling, but the key/value pair
+is considered to describe all benchmark results that follow,
+until overwritten by a configuration line with the same key.
+
+### Benchmark Results
+
+A benchmark result line has the general form
+
+	<name> <iterations> <value> <unit> [<value> <unit>...]
+
+The fields are separated by runs of space characters (as defined by `unicode.IsSpace`),
+so the line can be parsed with `strings.Fields`.
+The line must have an even number of fields, and at least four.
+
+The first field is the benchmark name, which must begin with `Benchmark`
+and is typically followed by a capital letter, as in `BenchmarkReverseString`.
+Tools displaying benchmark data conventionally omit the `Benchmark` prefix.
+The same benchmark name can appear on multiple result lines,
+indicating that the benchmark was run multiple times.
+
+The second field gives the number of iterations run.
+For most processing this number can be ignored, although
+it may give some indication of the expected accuracy
+of the measurements that follow.
+
+The remaining fields report value/unit pairs in which the value
+is a float64 that can be parsed by `strconv.ParseFloat` 
+and the unit explains the value, as in “64.88 MB/s”.
+The units reported are typically normalized so that they can be
+interpreted without considering to the number of iterations.
+In the example, the CPU cost is reported per-operation and the
+throughput is reported per-second; neither is a total that
+depends on the number of iterations.
+
+### Value Units
+
+A value's unit string is expected to specify not only the measurement unit
+but also, as needed, a description of what is being measured.
+For example, a benchmark might report its overall execution time
+as well as cache miss times with three units “ns/op,” “L1-miss-ns/op,”and “L2-miss-ns/op.”
+
+Tooling can expect that the unit strings are identical for all runs to be compared;
+for example, a result reporting “ns/op” need not be considered comparable
+to one reporting “µs/op.”
+
+However, tooling may assume that the measurement unit is the final
+of the hyphen-separated words in the unit string and may recognize
+and rescale known measurement units.
+For example, consistently large “ns/op” or “L1-miss-ns/op”
+might be rescaled to “ms/op” or “L1-miss-ms/op” for display.
+
+### Benchmark Name Configuration
+
+In the current testing package, benchmark names correspond to Go identifiers:
+each benchmark must be written as a different Go function.
+[Work targeted for Go 1.7](https://github.com/golang/proposal/blob/master/design/12166-subtests.md) will allow tests and benchmarks
+to define sub-tests and sub-benchmarks programatically,
+in particular to vary interesting parameters both when
+testing and when benchmarking.
+That work uses a slash to separate the name of a benchmark
+collection from the description of a sub-benchmark.
+
+We propose that sub-benchmarks adopt the convention of
+choosing names that are key:value pairs;
+that slash-prefixed key:value pairs in the benchmark name are
+treated by benchmark data processors as per-benchmark 
+configuration values;
+and that for sub-benchmarks the -N suffix to describe the
+GOMAXPROCS value is expanded to /gomaxprocs:N.
+
+### Example
+
+The benchmark output given in the background section above
+is already in the format proposed here.
+That is a key feature of the proposal.
+
+However, a future run of the benchmark might add configuration lines,
+and the benchmark might be rewritten to use sub-benchmarks,
+producing this output:
+
+	commit: 7cd9055
+	commit-time: 2016-02-11T13:25:45-0500
+	goos: darwin
+	goarch: amd64
+	cpu: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
+	cpu-count: 8
+	cpu-physical-count: 4
+	os: Mac OS X 10.11.3
+	mem: 16 GB
+
+	BenchmarkDecode/text:digits/level:speed/size:1e4/gomaxprocs:8   	     100	    154125 ns/op	  64.88 MB/s	   40418 B/op	       7 allocs/op
+	BenchmarkDecode/text:digits/level:speed/size:1e5/gomaxprocs:8   	      10	   1367632 ns/op	  73.12 MB/s	   41356 B/op	      14 allocs/op
+	BenchmarkDecode/text:digits/level:speed/size:1e6/gomaxprocs:8   	       1	  13879794 ns/op	  72.05 MB/s	   52056 B/op	      94 allocs/op
+	BenchmarkDecode/text:digits/level:default/size:1e4/gomaxprocs:8 	     100	    147551 ns/op	  67.77 MB/s	   40418 B/op	       8 allocs/op
+	BenchmarkDecode/text:digits/level:default/size:1e5/gomaxprocs:8 	      10	   1197672 ns/op	  83.50 MB/s	   41508 B/op	      13 allocs/op
+	BenchmarkDecode/text:digits/level:default/size:1e6/gomaxprocs:8 	       1	  11808775 ns/op	  84.68 MB/s	   53800 B/op	      80 allocs/op
+	BenchmarkDecode/text:digits/level:best/size:1e4/gomaxprocs:8    	     100	    143348 ns/op	  69.76 MB/s	   40417 B/op	       8 allocs/op
+	BenchmarkDecode/text:digits/level:best/size:1e5/gomaxprocs:8    	      10	   1185527 ns/op	  84.35 MB/s	   41508 B/op	      13 allocs/op
+	BenchmarkDecode/text:digits/level:best/size:1e6/gomaxprocs:8    	       1	  11740304 ns/op	  85.18 MB/s	   53800 B/op	      80 allocs/op
+	BenchmarkDecode/text:twain/level:speed/size:1e4/gomaxprocs:8    	     100	    143665 ns/op	  69.61 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecode/text:twain/level:speed/size:1e5/gomaxprocs:8    	      10	   1390359 ns/op	  71.92 MB/s	   45700 B/op	      31 allocs/op
+	BenchmarkDecode/text:twain/level:speed/size:1e6/gomaxprocs:8    	       1	  12128469 ns/op	  82.45 MB/s	   89336 B/op	     221 allocs/op
+	BenchmarkDecode/text:twain/level:default/size:1e4/gomaxprocs:8  	     100	    141916 ns/op	  70.46 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecode/text:twain/level:default/size:1e5/gomaxprocs:8  	      10	   1076669 ns/op	  92.88 MB/s	   43820 B/op	      28 allocs/op
+	BenchmarkDecode/text:twain/level:default/size:1e6/gomaxprocs:8  	       1	  10106485 ns/op	  98.95 MB/s	   71096 B/op	     172 allocs/op
+	BenchmarkDecode/text:twain/level:best/size:1e4/gomaxprocs:8     	     100	    138516 ns/op	  72.19 MB/s	   40849 B/op	      15 allocs/op
+	BenchmarkDecode/text:twain/level:best/size:1e5/gomaxprocs:8     	      10	   1227964 ns/op	  81.44 MB/s	   43316 B/op	      25 allocs/op
+	BenchmarkDecode/text:twain/level:best/size:1e6/gomaxprocs:8     	       1	  10040347 ns/op	  99.60 MB/s	   72120 B/op	     173 allocs/op
+	BenchmarkEncode/text:digits/level:speed/size:1e4/gomaxprocs:8   	      30	    482808 ns/op	  20.71 MB/s
+	BenchmarkEncode/text:digits/level:speed/size:1e5/gomaxprocs:8   	       5	   2685455 ns/op	  37.24 MB/s
+	BenchmarkEncode/text:digits/level:speed/size:1e6/gomaxprocs:8   	       1	  24966055 ns/op	  40.05 MB/s
+	BenchmarkEncode/text:digits/level:default/size:1e4/gomaxprocs:8 	      20	    655592 ns/op	  15.25 MB/s
+	BenchmarkEncode/text:digits/level:default/size:1e5/gomaxprocs:8 	       1	  13000839 ns/op	   7.69 MB/s
+	BenchmarkEncode/text:digits/level:default/size:1e6/gomaxprocs:8 	       1	 136341747 ns/op	   7.33 MB/s
+	BenchmarkEncode/text:digits/level:best/size:1e4/gomaxprocs:8    	      20	    668083 ns/op	  14.97 MB/s
+	BenchmarkEncode/text:digits/level:best/size:1e5/gomaxprocs:8    	       1	  12301511 ns/op	   8.13 MB/s
+	BenchmarkEncode/text:digits/level:best/size:1e6/gomaxprocs:8    	       1	 137962041 ns/op	   7.25 MB/s
+
+Using sub-benchmarks has benefits beyond this proposal, namely that it would
+avoid the current repetitive code:
+
+	func BenchmarkDecodeDigitsSpeed1e4(b *testing.B)    { benchmarkDecode(b, digits, speed, 1e4) }
+	func BenchmarkDecodeDigitsSpeed1e5(b *testing.B)    { benchmarkDecode(b, digits, speed, 1e5) }
+	func BenchmarkDecodeDigitsSpeed1e6(b *testing.B)    { benchmarkDecode(b, digits, speed, 1e6) }
+	func BenchmarkDecodeDigitsDefault1e4(b *testing.B)  { benchmarkDecode(b, digits, default_, 1e4) }
+	func BenchmarkDecodeDigitsDefault1e5(b *testing.B)  { benchmarkDecode(b, digits, default_, 1e5) }
+	func BenchmarkDecodeDigitsDefault1e6(b *testing.B)  { benchmarkDecode(b, digits, default_, 1e6) }
+	func BenchmarkDecodeDigitsCompress1e4(b *testing.B) { benchmarkDecode(b, digits, compress, 1e4) }
+	func BenchmarkDecodeDigitsCompress1e5(b *testing.B) { benchmarkDecode(b, digits, compress, 1e5) }
+	func BenchmarkDecodeDigitsCompress1e6(b *testing.B) { benchmarkDecode(b, digits, compress, 1e6) }
+	func BenchmarkDecodeTwainSpeed1e4(b *testing.B)     { benchmarkDecode(b, twain, speed, 1e4) }
+	func BenchmarkDecodeTwainSpeed1e5(b *testing.B)     { benchmarkDecode(b, twain, speed, 1e5) }
+	func BenchmarkDecodeTwainSpeed1e6(b *testing.B)     { benchmarkDecode(b, twain, speed, 1e6) }
+	func BenchmarkDecodeTwainDefault1e4(b *testing.B)   { benchmarkDecode(b, twain, default_, 1e4) }
+	func BenchmarkDecodeTwainDefault1e5(b *testing.B)   { benchmarkDecode(b, twain, default_, 1e5) }
+	func BenchmarkDecodeTwainDefault1e6(b *testing.B)   { benchmarkDecode(b, twain, default_, 1e6) }
+	func BenchmarkDecodeTwainCompress1e4(b *testing.B)  { benchmarkDecode(b, twain, compress, 1e4) }
+	func BenchmarkDecodeTwainCompress1e5(b *testing.B)  { benchmarkDecode(b, twain, compress, 1e5) }
+	func BenchmarkDecodeTwainCompress1e6(b *testing.B)  { benchmarkDecode(b, twain, compress, 1e6) }
+
+More importantly for this proposal, using sub-benchmarks also makes the possible
+comparison axes clear: digits vs twait, speed vs default vs best, size 1e4 vs 1e5 vs 1e6.
+
+## Rationale
+
+As discussed in the background section,
+we have already developed a number of analysis programs
+that assume this proposal's format,
+as well as a number of programs that generate this format.
+Standardizing the format should encourage additional work
+on both kinds of programs.
+
+[Issue 12826](https://golang.org/issue/12826) suggests a different approach,
+namely the addition of a new `go test` option `-benchformat`, to control
+the format of benchmark output. In fact it gives the lack of standardization
+as the main justification for a new option:
+
+> Currently `go test -bench .` prints out benchmark results in a 
+> certain format, but there is no guarantee that this format will not 
+> change. Thus a tool that parses go test output may break if an 
+> incompatible change to the output format is made.
+
+Our approach is instead to guarantee that the format will not change,
+or rather that it will only change in ways allowed by this design.
+An analysis tool that parses the output specified here will not break
+in future versions of Go,
+and a tool that generates the output specified here will work
+with all such analysis tools.
+Having one agreed-upon format enables broad interoperation;
+the ability for one tool to generate arbitrarily many different formats
+does not achieve the same result.
+
+The proposed format also seems to be extensible enough to accommodate
+anticipated future work on benchmark reporting.
+
+The main known issue with the current `go test -bench` is that
+we'd like to emit finer-grained detail about runs, for linearity testing
+and more robust statistics.
+This proposal allows that by simply printing more result lines.
+
+Another known issue is that we may want to add custom outputs
+such as garbage collector statistics to certain benchmark runs.
+This proposal allows that by adding more value-unit pairs.
+
+## Compatibility
+
+Tools consuming existing benchmark format may need trivial changes
+to ignore non-benchmark result lines or to cope with additional value-unit pairs
+in benchmark results.
+
+## Implementation
+
+The benchmark format described here is already generated by `go test -bench`
+and expected by tools like `benchcmp` and `benchstat`.
+
+The format is trivial to generate, and it is 
+straightforward but not quite trivial to parse.
+
+We anticipate that the [new x/perf subrepo](https://github.com/golang/go/issues/14304) will include a library for loading
+benchmark data from files, although the format is also simple enough that
+tools that want a different in-memory representation might reasonably
+write separate parsers.
+