Skip to content

Commit bce564a

Browse files
clintongormleypolyfractal
authored andcommitted
Comments added
1 parent 53f4e77 commit bce564a

9 files changed

+110
-63
lines changed

060_Distributed_Search/15_Search_options.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ GET /_search?routing=user_1,user2
8282
This technique comes in useful when designing very large search systems and we
8383
discuss it in detail in <<scale>>.
8484

85+
[[search-type]]
8586
==== `search_type`
8687

8788
While `query_then_fetch` is the default search type, other search types can

300_Aggregations/05_overview.asciidoc

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,44 @@
11

22
=== Elasticsearch offers more than just search
33

4+
// "the following chapters"... obvious. just delete the 2nd sentence, and merge with next para
45
Up until this point, the this book has been dedicated to search. The following
56
chapters deal with Aggregations, an entirely different set of functionality
67
built into Elasticsearch.
78

89
With search, we have a query and we wish to find a subset of documents which
9-
match the query. We are looking for the proverbial needle(s) in the
10+
match the query. We are looking for the proverbial needle(s) in the
1011
haystack.
1112

12-
Aggregations take a step back. Instead of looking for individual documents,
13-
we want to analyze and summarize our complete set of data.
13+
// perhaps "zoom out to get an overview"?
14+
// something about "showing users the data that exists in your index, leading them to the right results"?
15+
Aggregations take a step back. Instead of looking for individual documents,
16+
we want to analyze and summarize our complete set of data.
1417

15-
- How many needles are in the haystack?
16-
- What is the average length of the needles?
18+
// Popular manufacturers? Unusual clumps of needles in the haystack?
19+
- How many needles are in the haystack?
20+
- What is the average length of the needles?
1721
- What is the median length of needle broken down by manufacturer?
1822
- How many needles were added to the haystack each month?
1923

2024
Aggregations allow us to ask sophisticated questions of our data. And yet, while
21-
the functionality is completely different from search, it leverages the
25+
the functionality is completely different from search, it leverages the
2226
same data-structures. This means aggregations execute quickly and are
2327
_near-realtime_, just like search.
2428

29+
// perhaps hadoop instead of sql? reputation for slowness
30+
// "slice ... realtime" -> "visualize your data in realtime, allowing you to respond immediately"
2531
This is extremely powerful for reporting and dashboards. Instead of performing
2632
nightly "rollups" of your data (_e.g. large, batch SQL joins which
2733
are run nightly on a cron job because they are so slow_), you can slice and dice
2834
your data in realtime.
2935

36+
// Perhaps mention "not precalculated, out of date, and irrelevant"?
37+
// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
3038
Finally, aggregations operate alongside search requests, which means you can
3139
both search/filter documents _and_ perform analytics at the same time, on the
3240
same data, in a single request.
3341

42+
// for aggs -> for analytics?
3443
Aggregations are so powerful that many companies have built large Elasticsearch
3544
clusters solely for aggregations.

300_Aggregations/10_facets.asciidoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
=== What about Facets?
33

44
If you've used Elasticsearch in the past, you are probably aware of _facets_.
5-
You can think of Aggregations as "facets on steroids". Everything you can do
5+
You can think of Aggregations as "facets on steroids". Everything you can do
66
with facets, you can do with aggregations.
77

88
But there are plenty of operations that are possible in aggregations which are
99
simply impossible with facets.
1010

11-
Facets have not been officially depreciated yet, but you can expect that to
12-
happen eventually. We recommend migrating your facets over to aggregations when
11+
Facets have not been officially depreciated yet, but you can expect that to
12+
happen eventually. We recommend migrating your facets over to aggregations when
1313
you get the chance, and starting all new projects with aggregations instead of facets.

300_Aggregations/15_concepts_buckets.asciidoc

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11

22
=== High-level concepts
33

4+
// define composable eg -- independent units of functionality which can be plugged together as needed
45
Like the query DSL, aggregations have a composable syntax. This means that there
56
are only a few basic concepts to learn, but nearly limitless combinations
67
of those basic components.
78

89
To master aggregations, you only need to understand two main concepts:
910

10-
- Buckets: collections of documents meeting a criteria
11-
- Metrics: statistics calculated on documents in a bucket
11+
_Buckets_:: Collections of documents which meet a criterion.
12+
_Metrics_:: Statistics calculated on the documents in a bucket.
13+
14+
// Perhaps relate buckets and metrics to SELECT count(field_1) GROUP BY field_2
1215

1316
That's it! Every aggregation is simply a combination of one or more buckets
1417
and zero or more metrics.
@@ -19,8 +22,9 @@ Let's dig into both of these concepts and see what they entail.
1922

2023
A bucket is simply a collection of documents that meet a certain criteria.
2124

22-
- An employee would land in either the "male" or "female" bucket.
23-
- The city of Albany would land in the "New York" state bucket.
25+
- An employee would land in either the "male" or "female" bucket.
26+
- The city of Albany would land in the "New York" state bucket.
27+
// Use ISO dates, not American dates ;)
2428
- The date "10-28-2014" would land within the "October" bucket.
2529

2630
As aggregations are executed, the values inside each document are evaluated to
@@ -29,44 +33,48 @@ inside the bucket and the aggregation continues.
2933

3034
Buckets can also be nested inside of other buckets, giving you a hierarchy or
3135
conditional partitioning scheme. For example, "Cincinnati" would be placed inside
32-
the "Ohio" state bucket, and _entire_ "Ohio" bucket would be placed inside the
36+
the "Ohio" state bucket, and the _entire_ "Ohio" bucket would be placed inside the
3337
"USA" country bucket.
3438

39+
// Instead of "histos, ranges, unique etc", use examples? "by hour, by popular terms, by age range"
3540
There are a variety of different buckets in Elasticsearch, which allow you to
36-
partition documents in many different ways (histograms, ranges, unique terms,
41+
partition documents in many different ways (histograms, ranges, unique terms,
3742
filters, etc). But fundamentally they all operate on the same principle:
3843
partitioning documents based on a criteria.
3944

4045
==== Metrics
4146

42-
Buckets allow us to partition documents into useful subsets, but ultimately what
47+
// on those docs -> on the documents in each bucket?
48+
Buckets allow us to partition documents into useful subsets, but ultimately what
4349
we want is some kind of _metric_ calculated on those documents. Bucketing is the
4450
means to an end - it provides a way to group documents in a way that you can
4551
calculate interesting metrics.
4652

47-
Most of metrics are simple mathematical operations (min, mean, max, sum, etc)
53+
Most metrics are simple mathematical operations (min, mean, max, sum, etc)
4854
which are calculated using the document values. In practical terms, metrics allow
49-
you to calculate quantities such as the average salary, or the maximum sale price,
55+
you to calculate quantities such as the average salary, or the maximum sale price,
5056
or the 95th percentile for query latency.
5157

5258
==== Combining the two
5359

5460
An aggregation is a combination of buckets and metrics. An aggregation may have
5561
a single bucket, or a single metric, or one of each. It may even have multiple
56-
buckets nested inside of other buckets.
62+
buckets nested inside of other buckets.
5763

5864
For example, we can partition documents by which country they belong to (a bucket),
5965
then calculate the average salary per country (a metric).
6066

61-
Because buckets can be nested, we can derive a much more complex aggregation:
67+
Because buckets can be nested, we can derive a much more complex aggregation:
6268

69+
// nice
6370
1. Partition documents by country (bucket)
6471
2. Then partition each country bucket by gender (bucket)
6572
3. Then partition each gender bucket by age ranges (bucket)
6673
4. Finally, calculate the average salary for each age range (metric)
6774

75+
// define tuple, or use "combination"
6876
This will give you the average salary per <country, gender, age> tuple. All in
69-
one request and one pass over the data!
77+
one request and with one pass over the data!
7078

7179

7280

300_Aggregations/20_basic_example.asciidoc

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
1+
// This section feels like you're worrying too much about explaining the syntax, rather than the point of aggs. By this stage in the book, people should be used to the ES api, so I think we can assume more. I'd change the emphasis here and state that intention: we want to find out what the most popular colours are. To do that we'll use a "terms" agg, which counts up every term in the "color" field and returns the 10 most popular.
2+
// Step two: Add a query, to show that the aggs are calculated live on the results from the user's query.
23
=== Aggregation Test-drive
34

45
We could spend the next few pages defining the various aggregations
5-
available and their syntax, but aggregations are truly best learned by example.
6+
and their syntax, but aggregations are truly best learned by example.
67
Once you learn how to think about aggregations, and how to nest them appropriately,
78
the syntax is fairly trivial.
89

@@ -38,7 +39,8 @@ Now that we have some data, let's construct our first aggregation. A car dealer
3839
may want to know which color car sells the best. This is easily accomplished
3940
using a simple aggregation.
4041

41-
The syntax may look overwhelming at first, but hold on...we'll decompose the query
42+
// I don't think it's overwhelming, and users probably won't either... unless you mention it ;)
43+
The syntax may look overwhelming at first, but hold on... we'll decompose the query
4244
and discuss what each portion means. First, the aggregation:
4345

4446
[source,js]
@@ -48,42 +50,48 @@ GET /cars/transactions/_search?search_type=count <1>
4850
"aggs" : { <2>
4951
"colors" : { <3>
5052
"terms" : {
51-
"field" : "color" <4>
53+
"field" : "color" <4>
5254
}
5355
}
5456
}
5557
}
5658
--------------------------------------------------
5759
// SENSE: 300_Aggregations/20_basic_example.json
60+
61+
// Add the search_type=count thing as a sidebar, so it doesn't get in the way
5862
<1> Because we don't care about search results, we are going to use the `count`
59-
search_type, which will be faster
60-
<2> Aggregations are placed under the top-level `"aggs"` (the longer `"aggregations"`
63+
<<search-type,`search_type`>, which will be faster.
64+
<2> Aggregations are placed under the top-level `"aggs"` parameter (the longer `"aggregations"`
6165
will also work if you prefer that)
6266
<3> We then name the aggregation whatever we want -- "colors" in this example
6367
<4> Finally, we define a single bucket of type `terms`
6468

69+
// Meh - Point here is that aggregations are executed in the context of the search results, rather than which endpoint is used.
6570
The first thing to notice is that aggregations are executed as a search, using the
6671
`/_search` endpoint. As mentioned at the top of the chapter, aggregations are built
6772
from the same data structures that power search, which means they use the same
68-
endpoint. Aggregations are also defined as a top-level parameter, just like using
69-
`"query"` for search.
73+
endpoint. Aggregations are also defined as a top-level parameter, just like using
74+
`"query"` for search.
7075

76+
// Delete this and make the point in the para above. It feels like you're scared to introduce the idea of context this early. I think it's OK. You don't have to explain how to change the context yet, but at least make the point that there is one.
7177
.Can you use aggregations and queries together?
7278
****
7379
Absolutely! But hold that thought, we'll discuss it later in <todo>
7480
****
7581

82+
// I think it is OK to assume that naming aggs is a good idea. Probably easier to make the point if you name it "popular_colors"
7683
Next we define a name for our aggregation. This is entirely up to you...
7784
the response will be labeled with the name you provide so that your application
7885
can parse the results. You may also specify more than one aggregation per search
7986
request, so giving each aggregation a unique, identifiable name is important
8087
(we'll look at an example of this later).
8188

8289
Next we define the aggregation itself. For this example, we are defining
83-
a single `terms` bucket. The `terms` bucket will dynamically create a new
84-
bucket for every unique term it encounters. Since we are telling it to use the
90+
a single `terms` bucket. The `terms` bucket will dynamically create a new
91+
bucket for every unique term it encounters. Since we are telling it to use the
8592
"color" field, the `terms` bucket will dynamically create a new bucket for each color.
8693

94+
// Trim the results here. By this stage people have gone through 300 pages, so they should be familiar with what ES returns. Also, they can execute the query themselves in Sense
8795
Let's execute that aggregation and take a look at the results:
8896

8997
[source,js]
@@ -120,16 +128,20 @@ Let's execute that aggregation and take a look at the results:
120128
<1> No search hits are returned because we used the `search_type=count` param
121129
<2> Our "colors" aggregation is returned as part of the "aggregations" field
122130
<3> The key to each bucket corresponds to a unique term found in the "color" field
131+
132+
// Perhaps: We always get back the `doc_count` metric which tells us how many documents contained this term.
133+
123134
<4> The count of each bucket represents the number of documents with this color
124135

125136

126137
The response contains a list of buckets, each corresponding to a unique color
127-
(red, green, etc). Each bucket also includes a count of how many documents
138+
(red, green, etc). Each bucket also includes a count of how many documents
128139
"fell into" that particular bucket. For example, there are four red cars.
129140

130141
Before we move on, there are some important yet not immediately obvious things
131142
to point out.
132143

144+
// Delete the above line and make the realtime point in a para, which says that you could pipe this into a graphing library and display a dashboard showing real time trends. As soon as you sell a silver car, it'll show up in the graph. (And no need for the last sentence)
133145
- The buckets were created dynamically. Our application had no prior knowledge about
134146
how many colors in the index. If you were to index a "silver" car next, a new
135147
"silver" bucket would automatically appear in the response.
@@ -139,7 +151,7 @@ directly into graphing libraries for near real-time dashboards
139151
- The aggregation is operating on all of the documents in your index at the moment.
140152
This can be changed, which we will talk about <here>.
141153

142-
Voila! Your first aggregation!
154+
Voila! Your first aggregation!
143155

144156

145157

300_Aggregations/25_basic_example_expanded.asciidoc

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11

22
=== Adding a metric to the mix
33

4-
The previous example told us how many documents were in each bucket, which is
4+
The previous example told us how many documents were in each bucket, which is
55
useful. But often, our applications require more sophisticated _metrics_ about
66
the documents. For example, what is the average price of cars in each bucket?
77

8+
// "nesting"-> need to tell Elasticsearch which metrics to calculate, and on which fields.
89
To get this information, we need to start nesting metrics inside of the buckets.
910
Metrics will calculate some kind of mathematical statistic based on the values
1011
in the documents residing within a particular bucket.
@@ -36,15 +37,16 @@ GET /cars/transactions/_search?search_type=count
3637
<2> We then give the metric a name: "avg_price"
3738
<3> And finally define it as an `avg` metric over the "price" field
3839

39-
As you can see, we took the previous example and tacked on a new `agg` level.
40+
As you can see, we took the previous example and tacked on a new `agga` level.
4041
This new aggregation level allows us to nest the `avg` metric inside the
4142
`terms` bucket. Effectively, this means we will generate an average for each
4243
color.
4344

44-
Just like the "colors" example, we need to name our metric ("avg_price") so we
45-
can retrieve the values later. Finally, we specify the metric itself (`avg`)
45+
Just like the "colors" example, we need to name our metric ("avg_price") so we
46+
can retrieve the values later. Finally, we specify the metric itself (`avg`)
4647
and what field we want the average to be calculated on (`price`).
4748

49+
// Delete this para
4850
The response is, not surprisingly, nearly identical to the previous response...except
4951
there is now a new "avg_price" element added to each color bucket:
5052

@@ -84,14 +86,15 @@ there is now a new "avg_price" element added to each color bucket:
8486
--------------------------------------------------
8587
<1> New "avg_price" element in response
8688

89+
// Would love to have a graph under each example showing how the data can be displayed (later, i know)
8790
Although the response has changed minimally, the data we get out of it has grown
8891
substantially. Before, we knew there were four red cars. Now we know that the
8992
average price of red cars is $32,500. This is something that you can plug directly
9093
into reports or graphs.
9194

9295
=== Buckets inside of buckets
9396

94-
The true power of aggregations becomes apparent once you start playing with
97+
The true power of aggregations becomes apparent once you start playing with
9598
different nesting schemes. In the previous examples, we saw how you could nest
9699
a metric inside a bucket, which is already quite powerful.
97100

@@ -132,7 +135,7 @@ GET /cars/transactions/_search?search_type=count
132135
each car make
133136

134137
A few interesting things happened here. First, you'll notice that the previous
135-
"avg_price" metric is left entirely intact. Each "level" of an aggregation can
138+
"avg_price" metric is left entirely intact. Each "level" of an aggregation can
136139
have many metrics or buckets. The "avg_price" metric tells us the average price
137140
for each car color. This is independent of other buckets and metrics which
138141
are also being built.
@@ -211,7 +214,7 @@ GET /cars/transactions/_search?search_type=count
211214
},
212215
"make" : {
213216
"terms" : {
214-
"field" : "make"
217+
"field" : "make"
215218
},
216219
"aggs" : { <1>
217220
"min_price" : { "min": { "field": "price"} }, <2>
@@ -224,6 +227,9 @@ GET /cars/transactions/_search?search_type=count
224227
}
225228
--------------------------------------------------
226229
// SENSE: 300_Aggregations/20_basic_example.json
230+
231+
// Careful with the "no surprise", it makes it sound like you're bored :)
232+
227233
<1> No surprise...we need to add another "aggs" level for nesting
228234
<2> Then we include a `min` metric
229235
<3> And a `max` metric
@@ -249,7 +255,7 @@ Which gives us the following output (again, truncated):
249255
"value": 10000 <1>
250256
},
251257
"max_price": {
252-
"value": 20000 <2>
258+
"value": 20000 <1>
253259
}
254260
},
255261
{
@@ -270,11 +276,12 @@ Which gives us the following output (again, truncated):
270276
},
271277
...
272278
--------------------------------------------------
273-
<1><2> The `min` and `max` metrics that we added now appear under each "make"
279+
<1> The `min` and `max` metrics that we added now appear under each "make"
274280

275281
With those two buckets, we've expanded the information derived from this query
276282
to include:
277283

284+
// Nice, but "Similar analytics.." -> "etc."?
278285
- There are four red cars
279286
- The average price of a red car is $32,500
280287
- Three of the red cars are made by Honda, and one is a BMW

0 commit comments

Comments
 (0)