Skip to content

Commit 9986c1f

Browse files
committed
Agg scope and filtering
1 parent bce564a commit 9986c1f

File tree

3 files changed

+380
-1
lines changed

3 files changed

+380
-1
lines changed

300_Aggregations.asciidoc

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,8 @@ include::300_Aggregations/28_bucket_metric_list.asciidoc[]
1414

1515
include::300_Aggregations/30_histogram.asciidoc[]
1616

17-
include::300_Aggregations/35_date_histogram.asciidoc[]
17+
include::300_Aggregations/35_date_histogram.asciidoc[]
18+
19+
include::300_Aggregations/40_scope.asciidoc[]
20+
21+
include::300_Aggregations/45_filtering.asciidoc[]

300_Aggregations/40_scope.asciidoc

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
2+
=== Scoping Aggregations
3+
4+
With all of the aggregation examples given so far, you may have noticed that we
5+
omitted a `query` from the search request. The entire request was
6+
simply an aggregation.
7+
8+
Aggregations can be run at the same time as search requests, but you need to
9+
understand a new concept: _scope_. By default, aggregations operate in the same
10+
scope as the query. Put another way, aggregations are calculated on the set of
11+
documents that match your query.
12+
13+
If we look at one of our first aggregation examples:
14+
15+
[source,js]
16+
--------------------------------------------------
17+
GET /cars/transactions/_search?search_type=count
18+
{
19+
"aggs" : {
20+
"colors" : {
21+
"terms" : {
22+
"field" : "color"
23+
}
24+
}
25+
}
26+
}
27+
--------------------------------------------------
28+
// SENSE: 300_Aggregations/40_scope.json
29+
30+
...you can see that that the aggregation is in isolation. In reality, Elasticsearch
31+
assumes "no query specified" is equivalent to "query all documents". The above
32+
query is internally translated into:
33+
34+
[source,js]
35+
--------------------------------------------------
36+
GET /cars/transactions/_search?search_type=count
37+
{
38+
"query" : {
39+
"match_all" : {}
40+
}
41+
"aggs" : {
42+
"colors" : {
43+
"terms" : {
44+
"field" : "color"
45+
}
46+
}
47+
}
48+
}
49+
--------------------------------------------------
50+
// SENSE: 300_Aggregations/40_scope.json
51+
52+
The aggregation always operates in the scope of the query, so an isolated
53+
aggregation really operates in the scope of a `match_all` query...that is to say,
54+
all documents.
55+
56+
Once armed with the knowledge of scoping, we can start to customize
57+
aggregations even further. All of our previous examples calculated statistics
58+
about _all_ of the data: top selling cars, average price of all cars, most sales
59+
per month, etc.
60+
61+
With scope, we can ask questions like "How many colors are Ford cars are
62+
available in?". We do this by simply adding a query to the request (in this case
63+
a `match` query):
64+
65+
[source,js]
66+
--------------------------------------------------
67+
GET /cars/transactions/_search <1>
68+
{
69+
"query" : {
70+
"match" : {
71+
"make" : "ford"
72+
}
73+
},
74+
"aggs" : {
75+
"colors" : {
76+
"terms" : {
77+
"field" : "color"
78+
}
79+
}
80+
}
81+
}
82+
--------------------------------------------------
83+
// SENSE: 300_Aggregations/40_scope.json
84+
<1> We are omitting `search_type=count` so that search hits are returned too
85+
86+
By omitting the `search_type=count` this time, we can see both the search
87+
results and the aggregation results:
88+
89+
[source,js]
90+
--------------------------------------------------
91+
{
92+
...
93+
"hits": {
94+
"total": 2,
95+
"max_score": 1.6931472,
96+
"hits": [
97+
{
98+
"_source": {
99+
"price": 25000,
100+
"color": "blue",
101+
"make": "ford",
102+
"sold": "2014-02-12"
103+
}
104+
},
105+
{
106+
"_source": {
107+
"price": 30000,
108+
"color": "green",
109+
"make": "ford",
110+
"sold": "2014-05-18"
111+
}
112+
}
113+
]
114+
},
115+
"aggregations": {
116+
"colors": {
117+
"buckets": [
118+
{
119+
"key": "blue",
120+
"doc_count": 1
121+
},
122+
{
123+
"key": "green",
124+
"doc_count": 1
125+
}
126+
]
127+
}
128+
}
129+
}
130+
--------------------------------------------------
131+
132+
133+
This may seem trivial, but it is the key to advanced and powerful dashboards.
134+
You can transform any static dashboard into a real-time data exploration device
135+
by adding a search bar. This allows the user to search for terms and see all
136+
of the graphs (which are powered by aggregations, and thus scoped to the query)
137+
update in real-time. Try that with Hadoop!
138+
139+
<TODO> Maybe add two screenshots of a Kibana dashboard that changes considerably
140+
when the search changes?
141+
142+
143+
==== Global Bucket
144+
145+
You'll often want your aggregation to be scoped to your query. But sometimes
146+
you'll want to search for some subset of data, but aggregate across _all_ of
147+
your data.
148+
149+
For example, say you want to know the average price for Ford cars compared to the
150+
average price of _all_ cars. We can use a regular aggregation (scoped to the query)
151+
to get the first piece of information. The second piece of information can be
152+
obtained by using a `global` bucket.
153+
154+
The global bucket will contain _all_ of your documents, regardless of the query
155+
scope; it bypasses the scope completely. Because it is a bucket, you can nest
156+
aggregations inside of it like normal:
157+
158+
[source,js]
159+
--------------------------------------------------
160+
GET /cars/transactions/_search?search_type=count
161+
{
162+
"query" : {
163+
"match" : {
164+
"make" : "ford"
165+
}
166+
},
167+
"aggs" : {
168+
"single_avg_price": {
169+
"avg" : { "field" : "price" } <1>
170+
},
171+
"all": {
172+
"global" : {}, <2>
173+
"aggs" : {
174+
"avg_price": {
175+
"avg" : { "field" : "price" } <3>
176+
}
177+
178+
}
179+
}
180+
}
181+
}
182+
--------------------------------------------------
183+
// SENSE: 300_Aggregations/40_scope.json
184+
<1> This aggregation operates in the query scope (e.g. all docs matching "ford")
185+
<2> The `global` bucket has no parameters
186+
<3> This aggregation operates on the all documents, regardless of the make
187+
188+
189+
The first `avg` metric calculates is based on all documents that fall under the
190+
query scope -- all "ford" cars. The second `avg` metric is nested under a
191+
`global` bucket, which means it ignores scoping entirely and calculates on
192+
all the documents. The average returned for that aggregation represents
193+
the average price of all cars.
194+
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
2+
=== Filtering Aggregations
3+
4+
A natural extension to aggregation scoping is filtering. Because the aggregation
5+
operates in the context of the query scope, any filter applied to the query
6+
will also apply to the aggregation.
7+
8+
==== Filtered Query
9+
If we want to find all cars over $10,000 and also calculate the average price
10+
for those cars, we can simply use a `filtered` query:
11+
12+
[source,js]
13+
--------------------------------------------------
14+
GET /cars/transactions/_search?search_type=count
15+
{
16+
"query" : {
17+
"filtered": {
18+
"range": {
19+
"price": {
20+
"gte": 10000
21+
}
22+
}
23+
}
24+
},
25+
"aggs" : {
26+
"single_avg_price": {
27+
"avg" : { "field" : "price" }
28+
}
29+
}
30+
}
31+
--------------------------------------------------
32+
// SENSE: 300_Aggregations/45_filtering.json
33+
34+
Fundamentally, using a `filtered` query is no different from using a `match`
35+
query like we discussed in the last section. The query (which happens to include
36+
a filter) returns a certain subset of documents, and the aggregation operates
37+
on those documents.
38+
39+
==== Filter bucket
40+
41+
But what if you would like to filter just the aggregation results? Imagine we
42+
have are building the search page for our car dealership. We want to display
43+
search results according to what the user searches for. But we also want
44+
to enrich the page by including the average price of cars (matching the search)
45+
which were sold in the last month.
46+
47+
We can't use simple scoping here, since there are two different criteria. The
48+
search results must match "ford", but the aggregation results must match "ford"
49+
AND "sold > now - 1M".
50+
51+
To solve this problem, we can use a special bucket called `filter`. You specify
52+
a filter, and when documents match the filter's criteria, they are added to the
53+
bucket.
54+
55+
Here is the resulting query:
56+
57+
[source,js]
58+
--------------------------------------------------
59+
GET /cars/transactions/_search?search_type=count
60+
{
61+
"query":{
62+
"match": {
63+
"make": "ford"
64+
}
65+
},
66+
"aggs":{
67+
"recent_sales": {
68+
"filter": { <1>
69+
"range": {
70+
"sold": {
71+
"from": "now-1M"
72+
}
73+
}
74+
},
75+
"aggs": {
76+
"average_price":{
77+
"avg": {
78+
"field": "price" <2>
79+
}
80+
}
81+
}
82+
}
83+
}
84+
}
85+
--------------------------------------------------
86+
// SENSE: 300_Aggregations/45_filtering.json
87+
<1> Using the `filter` bucket to apply a filter in addition to the `query` scope
88+
<2> This `avg` metric will therefore only average docs which are both "ford" and sold in the last month
89+
90+
Since the `filter` bucket operates like any other bucket, you are free to nest
91+
other buckets and metrics inside. All nested components will "inherit" the filter.
92+
This allows you to filter selective portions of the aggregation as required.
93+
94+
==== Post Filter
95+
96+
So far, we have a way to filter the both search results and aggregations (a
97+
`filtered` query), as well as filtering individual portions of the aggregation
98+
(`filter` bucket).
99+
100+
You may be thinking to yourself "hmm...is there a way to filter _just_ the search
101+
results but not the aggregation?". The answer is to use a `post_filter`.
102+
103+
This is a top-level search request element which accepts a filter. The filter is
104+
applied _after_ the query has executed (hence the "post" moniker...it runs
105+
_post query_ execution). Because it operates after the query has executed,
106+
it does not affect the query scope...and thus does not affect the aggregations
107+
either.
108+
109+
We can use this behavior to apply additional filters to our search
110+
criteria that don't affect things like categorical facets in your UI. Let's
111+
design another search page for our car dealer. This page will allow the user
112+
to search for a car and filter by color. Color choices are populated via an
113+
aggregation.
114+
115+
[source,js]
116+
--------------------------------------------------
117+
GET /cars/transactions/_search?search_type=count
118+
{
119+
"query": {
120+
"match": {
121+
"make": "ford"
122+
}
123+
},
124+
"post_filter": { <1>
125+
"term" : {
126+
"color" : "green"
127+
}
128+
},
129+
"aggs" : {
130+
"all_colors": {
131+
"terms" : { "field" : "color" }
132+
}
133+
}
134+
}
135+
--------------------------------------------------
136+
// SENSE: 300_Aggregations/45_filtering.json
137+
<1> The `post_filter` element is a "top-level" element and filters just the search hits
138+
139+
The `query` portion is finding all "ford" cars. We are then building a list of
140+
colors with a `terms` aggregation. Because aggregations operate in the query
141+
scope, the list of colors will correspond with the colors that Ford cars are
142+
painted.
143+
144+
Finally, the `post_filter` will filter the search results to show only green
145+
"ford" cars. This happens _after_ the query is executed, so the aggregations
146+
are unaffected.
147+
148+
This is often important for coherent UIs. Imagine a user clicks a category in
149+
your UI (e.g. "green"). The expectation is that the search results are filtered,
150+
but _not_ the UI options. If you applied a `filtered` query, the UI would
151+
instantly transform to show _only_ "green" as an option...not what the user wants!
152+
153+
[WARNING]
154+
.Performance consideration
155+
====
156+
_Only_ use a `post_filter` if you need to differentially filter search results
157+
and aggregations. Sometimes people will use `post_filter` for regular searches.
158+
159+
Don't do this! The nature of the `post_filter` means it runs _after_ the query,
160+
so any performance benefit of filtering (caches, etc) is lost completely.
161+
162+
The `post_filter` should only be used in combination with aggregations, and only
163+
when you need differential filtering.
164+
====
165+
166+
==== Recap
167+
168+
Choosing the appropriate type of filtering -- search hits, aggregations or
169+
both -- often boils down to how you want your user interface to behave. Choose
170+
the appropriate filter (or combinations) depending on how you want to display
171+
results to your user.
172+
173+
- `filtered` query: affects both search results and aggregations
174+
- `filter` bucket: affects just aggregations
175+
- `post_filter`: affects just search results
176+
177+
178+
179+
180+
181+

0 commit comments

Comments
 (0)