Comments added

clintongormley · polyfractal · commit bce564a686d0 · 2014-05-30T13:18:59.000-04:00
diff --git a/060_Distributed_Search/15_Search_options.asciidoc b/060_Distributed_Search/15_Search_options.asciidoc
@@ -82,6 +82,7 @@ GET /_search?routing=user_1,user2
 This technique comes in useful when designing very large search systems and we
 discuss it in detail in <<scale>>.
 
+[[search-type]]
 ==== `search_type`
 
 While `query_then_fetch` is the default search type, other search types can
diff --git a/300_Aggregations/05_overview.asciidoc b/300_Aggregations/05_overview.asciidoc
@@ -1,35 +1,44 @@
 
 === Elasticsearch offers more than just search
 
+// "the following chapters"... obvious. just delete the 2nd sentence, and merge with next para
 Up until this point, the this book has been dedicated to search.  The following
 chapters deal with Aggregations, an entirely different set of functionality
 built into Elasticsearch.
 
 With search, we have a query and we wish to find a subset of documents which
-match the query.  We are looking for the proverbial needle(s) in the 
+match the query.  We are looking for the proverbial needle(s) in the
 haystack.
 
-Aggregations take a step back.  Instead of looking for individual documents, 
-we want to analyze and summarize our complete set of data.  
+// perhaps "zoom out to get an overview"?
+// something about "showing users the data that exists in your index, leading them to the right results"?
+Aggregations take a step back.  Instead of looking for individual documents,
+we want to analyze and summarize our complete set of data.
 
-- How many needles are in the haystack?  
-- What is the average length of the needles?  
+// Popular manufacturers? Unusual clumps of needles in the haystack?
+- How many needles are in the haystack?
+- What is the average length of the needles?
 - What is the median length of needle broken down by manufacturer?
 - How many needles were added to the haystack each month?
 
 Aggregations allow us to ask sophisticated questions of our data.  And yet, while
-the functionality is completely different from search, it leverages the 
+the functionality is completely different from search, it leverages the
 same data-structures.  This means aggregations execute quickly and are
 _near-realtime_, just like search.
 
+// perhaps hadoop instead of sql? reputation for slowness
+// "slice ... realtime" -> "visualize your data in realtime, allowing you to respond immediately"
 This is extremely powerful for reporting and dashboards.  Instead of performing
 nightly "rollups" of your data (_e.g. large, batch SQL joins which
 are run nightly on a cron job because they are so slow_), you can slice and dice
 your data in realtime.
 
+// Perhaps mention "not precalculated, out of date, and irrelevant"?
+// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
 Finally, aggregations operate alongside search requests, which means you can
 both search/filter documents _and_ perform analytics at the same time, on the
 same data, in a single request.
 
+// for aggs -> for analytics?
 Aggregations are so powerful that many companies have built large Elasticsearch
 clusters solely for aggregations.
diff --git a/300_Aggregations/10_facets.asciidoc b/300_Aggregations/10_facets.asciidoc
@@ -2,12 +2,12 @@
 === What about Facets?
 
 If you've used Elasticsearch in the past, you are probably aware of _facets_.
-You can think of Aggregations as "facets on steroids".  Everything you can do 
+You can think of Aggregations as "facets on steroids".  Everything you can do
 with facets, you can do with aggregations.
 
 But there are plenty of operations that are possible in aggregations which are
 simply impossible with facets.
 
-Facets have not been officially depreciated yet, but you can expect that to 
-happen eventually. We recommend migrating your facets over to aggregations when 
+Facets have not been officially depreciated yet, but you can expect that to
+happen eventually. We recommend migrating your facets over to aggregations when
 you get the chance, and starting all new projects with aggregations instead of facets.
diff --git a/300_Aggregations/15_concepts_buckets.asciidoc b/300_Aggregations/15_concepts_buckets.asciidoc
@@ -1,14 +1,17 @@
 
 === High-level concepts
 
+// define composable eg -- independent units of functionality which can be plugged together as needed
 Like the query DSL, aggregations have a composable syntax.  This means that there
 are only a few basic concepts to learn, but nearly limitless combinations
 of those basic components.
 
 To master aggregations, you only need to understand two main concepts:
 
-- Buckets: collections of documents meeting a criteria
-- Metrics: statistics calculated on documents in a bucket
+_Buckets_:: Collections of documents which meet a criterion.
+_Metrics_:: Statistics calculated on the documents in a bucket.
+
+// Perhaps relate buckets and metrics to SELECT count(field_1) GROUP BY field_2
 
 That's it!  Every aggregation is simply a combination of one or more buckets
 and zero or more metrics.
@@ -19,8 +22,9 @@ Let's dig into both of these concepts and see what they entail.
 
 A bucket is simply a collection of documents that meet a certain criteria.
 
-- An employee would land in either the "male" or "female" bucket. 
-- The city of Albany would land in the "New York" state bucket. 
+- An employee would land in either the "male" or "female" bucket.
+- The city of Albany would land in the "New York" state bucket.
+// Use ISO dates, not American dates ;)
 - The date "10-28-2014" would land within the "October" bucket.
 
 As aggregations are executed, the values inside each document are evaluated to
@@ -29,44 +33,48 @@ inside the bucket and the aggregation continues.
 
 Buckets can also be nested inside of other buckets, giving you a hierarchy or
 conditional partitioning scheme.  For example, "Cincinnati" would be placed inside
-the "Ohio" state bucket, and _entire_ "Ohio" bucket would be placed inside the 
+the "Ohio" state bucket, and the _entire_ "Ohio" bucket would be placed inside the
 "USA" country bucket.
 
+// Instead of "histos, ranges, unique etc", use examples? "by hour, by popular terms, by age range"
 There are a variety of different buckets in Elasticsearch, which allow you to
-partition documents in many different ways (histograms, ranges, unique terms, 
+partition documents in many different ways (histograms, ranges, unique terms,
 filters, etc).  But fundamentally they all operate on the same principle:
 partitioning documents based on a criteria.
 
 ==== Metrics
 
-Buckets allow us to partition documents into useful subsets, but ultimately what 
+// on those docs -> on the documents in each bucket?
+Buckets allow us to partition documents into useful subsets, but ultimately what
 we want is some kind of _metric_ calculated on those documents.  Bucketing is the
 means to an end - it provides a way to group documents in a way that you can
 calculate interesting metrics.
 
-Most of metrics are simple mathematical operations (min, mean, max, sum, etc)
+Most metrics are simple mathematical operations (min, mean, max, sum, etc)
 which are calculated using the document values.  In practical terms, metrics allow
-you to calculate quantities such as the average salary, or the maximum sale price, 
+you to calculate quantities such as the average salary, or the maximum sale price,
 or the 95th percentile for query latency.
 
 ==== Combining the two
 
 An aggregation is a combination of buckets and metrics.  An aggregation may have
 a single bucket, or a single metric, or one of each.  It may even have multiple
-buckets nested inside of other buckets.  
+buckets nested inside of other buckets.
 
 For example, we can partition documents by which country they belong to (a bucket),
 then calculate the average salary per country (a metric).
 
-Because buckets can be nested, we can derive a much more complex aggregation:  
+Because buckets can be nested, we can derive a much more complex aggregation:
 
+// nice
 1. Partition documents by country (bucket)
 2. Then partition each country bucket by gender (bucket)
 3. Then partition each gender bucket by age ranges (bucket)
 4. Finally, calculate the average salary for each age range (metric)
 
+// define tuple, or use "combination"
 This will give you the average salary per <country, gender, age> tuple.  All in
-one request and one pass over the data!
+one request and with one pass over the data!
 
 
 
diff --git a/300_Aggregations/20_basic_example.asciidoc b/300_Aggregations/20_basic_example.asciidoc
@@ -1,8 +1,9 @@
-
+// This section feels like you're worrying too much about explaining the syntax, rather than the point of aggs.  By this stage in the book, people should be used to the ES api, so I think we can assume more.  I'd change the emphasis here and state that intention: we want to find out what the most popular colours are.  To do that we'll use a "terms" agg, which counts up every term in the "color" field and returns the 10 most popular.
+// Step two: Add a query, to show that the aggs are calculated live on the results from the user's query.
 === Aggregation Test-drive
 
 We could spend the next few pages defining the various aggregations
-available and their syntax, but aggregations are truly best learned by example.
+and their syntax, but aggregations are truly best learned by example.
 Once you learn how to think about aggregations, and how to nest them appropriately,
 the syntax is fairly trivial.
 
@@ -38,7 +39,8 @@ Now that we have some data, let's construct our first aggregation.  A car dealer
 may want to know which color car sells the best.  This is easily accomplished
 using a simple aggregation.
 
-The syntax may look overwhelming at first, but hold on...we'll decompose the query
+// I don't think it's overwhelming, and users probably won't either... unless you mention it ;)
+The syntax may look overwhelming at first, but hold on... we'll decompose the query
 and discuss what each portion means.  First, the aggregation:
 
 [source,js]
@@ -48,42 +50,48 @@ GET /cars/transactions/_search?search_type=count <1>
     "aggs" : { <2>
         "colors" : { <3>
             "terms" : {
-              "field" : "color" <4> 
+              "field" : "color" <4>
             }
         }
     }
 }
 --------------------------------------------------
 // SENSE: 300_Aggregations/20_basic_example.json
+
+// Add the search_type=count thing as a sidebar, so it doesn't get in the way
 <1> Because we don't care about search results, we are going to use the `count`
-search_type, which will be faster
-<2> Aggregations are placed under the top-level `"aggs"` (the longer `"aggregations"` 
+<<search-type,`search_type`>, which will be faster.
+<2> Aggregations are placed under the top-level `"aggs"` parameter (the longer `"aggregations"`
 will also work if you prefer that)
 <3> We then name the aggregation whatever we want -- "colors" in this example
 <4> Finally, we define a single bucket of type `terms`
 
+// Meh - Point here is that aggregations are executed in the context of the search results, rather than which endpoint is used.
 The first thing to notice is that aggregations are executed as a search, using the
 `/_search` endpoint.  As mentioned at the top of the chapter, aggregations are built
 from the same data structures that power search, which means they use the same
-endpoint.  Aggregations are also defined as a top-level parameter, just like using 
-`"query"` for search.  
+endpoint.  Aggregations are also defined as a top-level parameter, just like using
+`"query"` for search.
 
+// Delete this and make the point in the para above. It feels like you're scared to introduce the idea of context this early. I think it's OK.  You don't have to explain how to change the context yet, but at least make the point that there is one.
 .Can you use aggregations and queries together?
 ****
 Absolutely!  But hold that thought, we'll discuss it later in <todo>
 ****
 
+// I think it is OK to assume that naming aggs is a good idea.  Probably easier to make the point if you name it "popular_colors"
 Next we define a name for our aggregation.  This is entirely up to you...
 the response will be labeled with the name you provide so that your application
 can parse the results. You may also specify more than one aggregation per search
 request, so giving each aggregation a unique, identifiable name is important
 (we'll look at an example of this later).
 
 Next we define the aggregation itself.  For this example, we are defining
-a single `terms` bucket.  The `terms` bucket will dynamically create a new 
-bucket for every unique term it encounters.  Since we are telling it to use the 
+a single `terms` bucket.  The `terms` bucket will dynamically create a new
+bucket for every unique term it encounters.  Since we are telling it to use the
 "color" field, the `terms` bucket will dynamically create a new bucket for each color.
 
+// Trim the results here.  By this stage people have gone through 300 pages, so they should be familiar with what ES returns.  Also, they can execute the query themselves in Sense
 Let's execute that aggregation and take a look at the results:
 
 [source,js]
@@ -120,16 +128,20 @@ Let's execute that aggregation and take a look at the results:
 <1> No search hits are returned because we used the `search_type=count` param
 <2> Our "colors" aggregation is returned as part of the "aggregations" field
 <3> The key to each bucket corresponds to a unique term found in the "color" field
+
+// Perhaps: We always get back the `doc_count` metric which tells us how many documents contained this term.
+
 <4> The count of each bucket represents the number of documents with this color
 
 
 The response contains a list of buckets, each corresponding to a unique color
-(red, green, etc). Each bucket also includes a count of how many documents 
+(red, green, etc). Each bucket also includes a count of how many documents
 "fell into" that particular bucket.  For example, there are four red cars.
 
 Before we move on, there are some important yet not immediately obvious things
 to point out.
 
+// Delete the above line and make the realtime point in a para, which says that you could pipe this into a graphing library and display a dashboard showing real time trends. As soon as you sell a silver car, it'll show up in the graph.  (And no need for the last sentence)
 - The buckets were created dynamically.  Our application had no prior knowledge about
 how many colors in the index.  If you were to index a "silver" car next, a new
 "silver" bucket would automatically appear in the response.
@@ -139,7 +151,7 @@ directly into graphing libraries for near real-time dashboards
 - The aggregation is operating on all of the documents in your index at the moment.
 This can be changed, which we will talk about <here>.
 
-Voila!  Your first aggregation! 
+Voila!  Your first aggregation!
 
 
 
diff --git a/300_Aggregations/25_basic_example_expanded.asciidoc b/300_Aggregations/25_basic_example_expanded.asciidoc
@@ -1,10 +1,11 @@
 
 === Adding a metric to the mix
 
-The previous example told us how many documents were in each bucket, which is 
+The previous example told us how many documents were in each bucket, which is
 useful.  But often, our applications require more sophisticated _metrics_ about
 the documents. For example, what is the average price of cars in each bucket?
 
+// "nesting"-> need to tell Elasticsearch which metrics to calculate, and on which fields.
 To get this information, we need to start nesting metrics inside of the buckets.
 Metrics will calculate some kind of mathematical statistic based on the values
 in the documents residing within a particular bucket.
@@ -36,15 +37,16 @@ GET /cars/transactions/_search?search_type=count
 <2> We then give the metric a name: "avg_price"
 <3> And finally define it as an `avg` metric over the "price" field
 
-As you can see, we took the previous example and tacked on a new `agg` level.
+As you can see, we took the previous example and tacked on a new `agga` level.
 This new aggregation level allows us to nest the `avg` metric inside the
 `terms` bucket.  Effectively, this means we will generate an average for each
 color.
 
-Just like the "colors" example, we need to name our metric ("avg_price") so we 
-can retrieve the values later.  Finally, we specify the metric itself (`avg`) 
+Just like the "colors" example, we need to name our metric ("avg_price") so we
+can retrieve the values later.  Finally, we specify the metric itself (`avg`)
 and what field we want the average to be calculated on (`price`).
 
+// Delete this para
 The response is, not surprisingly, nearly identical to the previous response...except
 there is now a new "avg_price" element added to each color bucket:
 
@@ -84,14 +86,15 @@ there is now a new "avg_price" element added to each color bucket:
 --------------------------------------------------
 <1> New "avg_price" element in response
 
+// Would love to have a graph under each example showing how the data can be displayed (later, i know)
 Although the response has changed minimally, the data we get out of it has grown
 substantially.  Before, we knew there were four red cars.  Now we know that the
 average price of red cars is $32,500.  This is something that you can plug directly
 into reports or graphs.
 
 === Buckets inside of buckets
 
-The true power of aggregations becomes apparent once you start playing with 
+The true power of aggregations becomes apparent once you start playing with
 different nesting schemes.  In the previous examples, we saw how you could nest
 a metric inside a bucket, which is already quite powerful.
 
@@ -132,7 +135,7 @@ GET /cars/transactions/_search?search_type=count
 each car make
 
 A few interesting things happened here.  First, you'll notice that the previous
-"avg_price" metric is left entirely intact.  Each "level" of an aggregation can 
+"avg_price" metric is left entirely intact.  Each "level" of an aggregation can
 have many metrics or buckets.  The "avg_price" metric tells us the average price
 for each car color.  This is independent of other buckets and metrics which
 are also being built.
@@ -211,7 +214,7 @@ GET /cars/transactions/_search?search_type=count
             },
             "make" : {
                 "terms" : {
-                    "field" : "make"   
+                    "field" : "make"
                 },
                 "aggs" : { <1>
                     "min_price" : { "min": { "field": "price"} }, <2>
@@ -224,6 +227,9 @@ GET /cars/transactions/_search?search_type=count
 }
 --------------------------------------------------
 // SENSE: 300_Aggregations/20_basic_example.json
+
+// Careful with the "no surprise", it makes it sound like you're bored :)
+
 <1> No surprise...we need to add another "aggs" level for nesting
 <2> Then we include a `min` metric
 <3> And a `max` metric
@@ -249,7 +255,7 @@ Which gives us the following output (again, truncated):
                            "value": 10000 <1>
                         },
                         "max_price": {
-                           "value": 20000 <2>
+                           "value": 20000 <1>
                         }
                      },
                      {
@@ -270,11 +276,12 @@ Which gives us the following output (again, truncated):
             },
 ...
 --------------------------------------------------
-<1><2> The `min` and `max` metrics that we added now appear under each "make"
+<1> The `min` and `max` metrics that we added now appear under each "make"
 
 With those two buckets, we've expanded the information derived from this query
 to include:
 
+// Nice, but "Similar analytics.." -> "etc."?
 - There are four red cars
 - The average price of a red car is $32,500
 - Three of the red cars are made by Honda, and one is a BMW
diff --git a/300_Aggregations/28_bucket_metric_list.asciidoc b/300_Aggregations/28_bucket_metric_list.asciidoc
diff --git a/300_Aggregations/30_histogram.asciidoc b/300_Aggregations/30_histogram.asciidoc
diff --git a/300_Aggregations/35_date_histogram.asciidoc b/300_Aggregations/35_date_histogram.asciidoc