Agg scope and filtering

polyfractal · polyfractal · commit 9986c1ffd510 · 2014-05-30T13:19:00.000-04:00
diff --git a/300_Aggregations.asciidoc b/300_Aggregations.asciidoc
@@ -14,4 +14,8 @@ include::300_Aggregations/28_bucket_metric_list.asciidoc[]
 
 include::300_Aggregations/30_histogram.asciidoc[]
 
-include::300_Aggregations/35_date_histogram.asciidoc[]
+include::300_Aggregations/35_date_histogram.asciidoc[]
+
+include::300_Aggregations/40_scope.asciidoc[]
+
+include::300_Aggregations/45_filtering.asciidoc[]
diff --git a/300_Aggregations/40_scope.asciidoc b/300_Aggregations/40_scope.asciidoc
@@ -0,0 +1,194 @@
+
+=== Scoping Aggregations
+
+With all of the aggregation examples given so far, you may have noticed that we
+omitted a `query` from the search request.  The entire request was
+simply an aggregation.
+
+Aggregations can be run at the same time as search requests, but you need to
+understand a new concept: _scope_.  By default, aggregations operate in the same 
+scope as the query.  Put another way, aggregations are calculated on the set of 
+documents that match your query.
+
+If we look at one of our first aggregation examples:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "aggs" : {
+        "colors" : {
+            "terms" : {
+              "field" : "color"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/40_scope.json
+
+...you can see that that the aggregation is in isolation.  In reality, Elasticsearch
+assumes "no query specified" is equivalent to "query all documents". The above
+query is internally translated into:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "query" : {
+        "match_all" : {}
+    }
+    "aggs" : {
+        "colors" : {
+            "terms" : {
+              "field" : "color"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/40_scope.json
+
+The aggregation always operates in the scope of the query, so an isolated
+aggregation really operates in the scope of a `match_all` query...that is to say,
+all documents.
+
+Once armed with the knowledge of scoping, we can start to customize 
+aggregations even further.  All of our previous examples calculated statistics
+about _all_ of the data: top selling cars, average price of all cars, most sales
+per month, etc.
+
+With scope, we can ask questions like "How many colors are Ford cars are
+available in?".  We do this by simply adding a query to the request (in this case
+a `match` query):
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search  <1>
+{
+    "query" : {
+        "match" : {
+            "make" : "ford"
+        }
+    },
+    "aggs" : {
+        "colors" : {
+            "terms" : {
+              "field" : "color"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/40_scope.json
+<1> We are omitting `search_type=count` so that search hits are returned too
+
+By omitting the `search_type=count` this time, we can see both the search
+results and the aggregation results:
+
+[source,js]
+--------------------------------------------------
+{
+...
+   "hits": {
+      "total": 2,
+      "max_score": 1.6931472,
+      "hits": [
+         {
+            "_source": {
+               "price": 25000,
+               "color": "blue",
+               "make": "ford",
+               "sold": "2014-02-12"
+            }
+         },
+         {
+            "_source": {
+               "price": 30000,
+               "color": "green",
+               "make": "ford",
+               "sold": "2014-05-18"
+            }
+         }
+      ]
+   },
+   "aggregations": {
+      "colors": {
+         "buckets": [
+            {
+               "key": "blue",
+               "doc_count": 1
+            },
+            {
+               "key": "green",
+               "doc_count": 1
+            }
+         ]
+      }
+   }
+}
+--------------------------------------------------
+
+
+This may seem trivial, but it is the key to advanced and powerful dashboards.
+You can transform any static dashboard into a real-time data exploration device
+by adding a search bar.  This allows the user to search for terms and see all
+of the graphs (which are powered by aggregations, and thus scoped to the query)
+update in real-time.  Try that with Hadoop!
+
+<TODO> Maybe add two screenshots of a Kibana dashboard that changes considerably
+when the search changes?
+
+
+==== Global Bucket
+
+You'll often want your aggregation to be scoped to your query.  But sometimes
+you'll want to search for some subset of data, but aggregate across _all_ of
+your data.
+
+For example, say you want to know the average price for Ford cars compared to the
+average price of _all_ cars. We can use a regular aggregation (scoped to the query) 
+to get the first piece of information.  The second piece of information can be 
+obtained by using a `global` bucket.
+
+The global bucket will contain _all_ of your documents, regardless of the query 
+scope; it bypasses the scope completely.  Because it is a bucket, you can nest
+aggregations inside of it like normal:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "query" : {
+        "match" : {
+            "make" : "ford"
+        }
+    },
+    "aggs" : {
+        "single_avg_price": {
+            "avg" : { "field" : "price" } <1>
+        },
+        "all": {
+            "global" : {}, <2>
+            "aggs" : {
+                "avg_price": {
+                    "avg" : { "field" : "price" } <3>
+                }
+                
+            }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/40_scope.json
+<1> This aggregation operates in the query scope (e.g. all docs matching "ford")
+<2> The `global` bucket has no parameters
+<3> This aggregation operates on the all documents, regardless of the make
+
+
+The first `avg` metric calculates is based on all documents that fall under the
+query scope -- all "ford" cars.  The second `avg` metric is nested under a 
+`global` bucket, which means it ignores scoping entirely and calculates on 
+all the documents.  The average returned for that aggregation represents
+the average price of all cars.
+
diff --git a/300_Aggregations/45_filtering.asciidoc b/300_Aggregations/45_filtering.asciidoc
@@ -0,0 +1,181 @@
+
+=== Filtering Aggregations
+
+A natural extension to aggregation scoping is filtering.  Because the aggregation
+operates in the context of the query scope, any filter applied to the query
+will also apply to the aggregation.
+
+==== Filtered Query
+If we want to find all cars over $10,000 and also calculate the average price
+for those cars, we can simply use a `filtered` query:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "query" : {
+        "filtered": {
+            "range": {
+                "price": {
+                    "gte": 10000
+                }
+            }
+        }
+    },
+    "aggs" : {
+        "single_avg_price": {
+            "avg" : { "field" : "price" }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/45_filtering.json
+
+Fundamentally, using a `filtered` query is no different from using a `match`
+query like we discussed in the last section.  The query (which happens to include
+a filter) returns a certain subset of documents, and the aggregation operates
+on those documents.
+
+==== Filter bucket
+
+But what if you would like to filter just the aggregation results?  Imagine we
+have are building the search page for our car dealership.  We want to display
+search results according to what the user searches for.  But we also want
+to enrich the page by including the average price of cars (matching the search)
+which were sold in the last month.
+
+We can't use simple scoping here, since there are two different criteria.  The 
+search results must match "ford", but the aggregation results must match "ford"
+AND "sold > now - 1M".
+
+To solve this problem, we can use a special bucket called `filter`.  You specify
+a filter, and when documents match the filter's criteria, they are added to the
+bucket.
+
+Here is the resulting query:
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+   "query":{
+      "match": {
+         "make": "ford"
+      }
+   },
+   "aggs":{
+      "recent_sales": {
+         "filter": { <1>
+            "range": {
+               "sold": {
+                  "from": "now-1M"
+               }
+            }
+         },
+         "aggs": {
+            "average_price":{
+               "avg": {
+                  "field": "price" <2>
+               }
+            }
+         }
+      }
+   }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/45_filtering.json
+<1> Using the `filter` bucket to apply a filter in addition to the `query` scope
+<2> This `avg` metric will therefore only average docs which are both "ford" and sold in the last month
+
+Since the `filter` bucket operates like any other bucket, you are free to nest
+other buckets and metrics inside.  All nested components will "inherit" the filter.
+This allows you to filter selective portions of the aggregation as required.
+
+==== Post Filter
+
+So far, we have a way to filter the both search results and aggregations (a
+`filtered` query), as well as filtering individual portions of the aggregation
+(`filter` bucket).
+
+You may be thinking to yourself "hmm...is there a way to filter _just_ the search
+results but not the aggregation?".  The answer is to use a `post_filter`.
+
+This is a top-level search request element which accepts a filter.  The filter is
+applied _after_ the query has executed (hence the "post" moniker...it runs
+_post query_ execution).  Because it operates after the query has executed,
+it does not affect the query scope...and thus does not affect the aggregations
+either.
+
+We can use this behavior to apply additional filters to our search
+criteria that don't affect things like categorical facets in your UI.  Let's 
+design another search page for our car dealer.  This page will allow the user
+to search for a car and filter by color.  Color choices are populated via an
+aggregation.
+
+[source,js]
+--------------------------------------------------
+GET /cars/transactions/_search?search_type=count
+{
+    "query": {
+        "match": {
+            "make": "ford"
+        }
+    },
+    "post_filter": {    <1>
+        "term" : {
+            "color" : "green"
+        }
+    },
+    "aggs" : {
+        "all_colors": {
+            "terms" : { "field" : "color" }
+        }
+    }
+}
+--------------------------------------------------
+// SENSE: 300_Aggregations/45_filtering.json
+<1> The `post_filter` element is a "top-level" element and filters just the search hits
+
+The `query` portion is finding all "ford" cars.  We are then building a list of
+colors with a `terms` aggregation.  Because aggregations operate in the query
+scope, the list of colors will correspond with the colors that Ford cars are
+painted.
+
+Finally, the `post_filter` will filter the search results to show only green
+"ford" cars.  This happens _after_ the query is executed, so the aggregations
+are unaffected.
+
+This is often important for coherent UIs.  Imagine a user clicks a category in 
+your UI (e.g. "green").  The expectation is that the search results are filtered,
+but _not_ the UI options.  If you applied a `filtered` query, the UI would
+instantly transform to show _only_ "green" as an option...not what the user wants!
+
+[WARNING]
+.Performance consideration
+====
+_Only_ use a `post_filter` if you need to differentially filter search results 
+and aggregations. Sometimes people will use `post_filter` for regular searches.
+
+Don't do this!  The nature of the `post_filter` means it runs _after_ the query,
+so any performance benefit of filtering (caches, etc) is lost completely.
+
+The `post_filter` should only be used in combination with aggregations, and only
+when you need differential filtering.
+====
+
+==== Recap
+
+Choosing the appropriate type of filtering -- search hits, aggregations or
+both -- often boils down to how you want your user interface to behave.  Choose
+the appropriate filter (or combinations) depending on how you want to display
+results to your user.
+
+ - `filtered` query: affects both search results and aggregations
+ - `filter` bucket: affects just aggregations
+ - `post_filter`: affects just search results
+
+
+
+
+
+