Skip to content

Commit 30f87c6

Browse files
committed
add how does it work section to appendix
1 parent d1c6bad commit 30f87c6

24 files changed

+1455
-302
lines changed

40-higher-order.Rmd

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1+
# (PART) Advanced {-}
2+
13
# Higher order widgets {#higher-order-widgets}
24

3-
You may have noticed when we ran the `show_widgets()` function in [working with widgets](#working-with-show_widgets) that there is a mysterious "Order" column that categorizes widgets as either "first-order" or "second-order". What's that all about? In order to answer that, let's take a look at another example.
5+
You may have noticed when we ran the `show_widgets()` function in [working with widgets](#working-with-finding-widgets) that there is a mysterious "Order" column that categorizes widgets as either "first-order" or "second-order". What's that all about? In order to answer that, let's take a look at another example.
46

57
## Children of wealth
68

35-optional-arguments.Rmd renamed to 50-optional-arguments.Rmd

-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
# (PART) Advanced {-}
2-
31
```{r opt-load, echo = FALSE, message = FALSE, warning = FALSE}
42
library(discoveryengine)
53
```

99-how-does-it-work.Rmd

+315
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
# How does it work? {#how-it-works}
2+
3+
In this section, I'll go over the basic building blocks that make the Disco Engine work. To help organize the lesson, I'll assume we're building a brand new Disco Engine from scratch. Of course, what we build will necessarily be just a simplified version of the Disco Engine, but we'll cover the most important concepts. All of the code behind the real Disco Engine is available for anyone to view [on GitHub](https://github.com/tarakc02/discoveryengine). Where possible, I'll add a link to the appropriate bit of production code and explain what's different from our simplified example.
4+
5+
## A Widget as a SQL template
6+
7+
To get data from the database, the Disco Engine converts definitions into valid SQL. The key insight here is that any first-order predicate ([what's first-order mean?](#higher-order-widgets)) can be represented in terms of a very basic SQL query, which we can turn into this all-purpose template:
8+
9+
```{sql, eval = FALSE}
10+
select distinct ID_FIELD as ID_TYPE
11+
from TABLE_NAME
12+
where FIELD_NAME in (LIST_OF_VALUES)
13+
```
14+
15+
*Note: The actual implementation allows for a broader range of queries, but this captures the basic idea. To see the actual templates used, check out [the source code in the listbuilder package](https://github.com/tarakc02/listbuilder/blob/master/R/templates.R)*
16+
17+
There are a number of R packages that allow you to be able to construct templates this way and populate them with R objects. `discoveryengine` uses the [whisker package](https://cran.r-project.org/web/packages/whisker/index.html), but for this simplified example, I'll use `getcdw::parameterize_template`:
18+
19+
```{r}
20+
library(getcdw)
21+
generate_query <- parameterize_template("
22+
select distinct ##ID_FIELD## as ##ID_TYPE##
23+
from ##TABLE_NAME##
24+
where ##FIELD_NAME## in (##LIST_OF_VALUES##)
25+
")
26+
27+
# generate_query is now a function with the same arguments as the
28+
# names highlighted in the template
29+
generate_query
30+
```
31+
32+
Now, we can get a clunky, but functional, `has_affiliation` widget working:
33+
34+
```{r}
35+
has_affiliation <- function(affiliations) {
36+
list(ID_FIELD = "entity_id",
37+
ID_TYPE = "entity_id",
38+
TABLE_NAME = "cdw.d_bio_affiliation_mv",
39+
FIELD_NAME = "affil_code",
40+
LIST_OF_VALUES = paste0("'", affiliations, "'",
41+
collapse = ", "))
42+
}
43+
```
44+
45+
Wait, what? I wanted to create a constituency definition, but all I got was this lousy list!
46+
47+
## A template as a data structure
48+
49+
Recall that definitions don't turn into IDs until we use the `display` function. The basic job of our `display` function will be to take care of two steps:
50+
51+
1. Convert the definition (currently just a list) to SQL, and
52+
2. Send the SQL to the database, returning the appropriate data
53+
54+
So it makes sense that `has_affiliation` just makes a list. It has all of the components of our definition, so we can inspect and figure out what it's supposed to return, but it won't actually return data from the data warehouse until we build a `display` function.
55+
56+
Luckily, step 2 of the `display` function is easy, because that's exactly what the function `getcdw::get_cdw` does (if you're building for your own database, you just need a function here that can send SQL to your database and return a `data.frame`).
57+
58+
```{r}
59+
display <- function(definition) {
60+
get_cdw(to_sql(definition))
61+
}
62+
```
63+
64+
How can we implement `to_sql`? Well, that's just a matter of combining our template from `generate_query` with the data to fill it in from `has_affiliation`. The R function `do.call` lets us call any R function using a list of arguments. For example:
65+
66+
```{r}
67+
# here is the usual way we call a function:
68+
# this takes 2 samples from the integers between 1 and 100
69+
sample(x = 1:100, size = 2)
70+
71+
# but what if we've collected the arguments from some process?
72+
args <- list(x = 1:100, size = 2)
73+
74+
# sample(args) won't work. what we want is do.call, which allows us to
75+
# "feed" args as arguments to sample()
76+
do.call("sample", args)
77+
```
78+
79+
Ok, so it looks like we know how to implement `to_sql`:
80+
81+
```{r}
82+
to_sql <- function(definition) {
83+
do.call("generate_query", definition)
84+
}
85+
```
86+
87+
That's basically it! Let's test what we have:
88+
89+
```{r}
90+
is_prytanean = has_affiliation("OC6")
91+
display(is_prytanean)
92+
93+
## and we can use multiple affiliations:
94+
is_constituent = has_affiliation(c("M2", "MA4"))
95+
display(is_constituent)
96+
```
97+
98+
Ok! So we've now implemented basic widget functionality. We still have a couple of outstanding issues:
99+
100+
1. The real disco engine allows you to type `has_affiliation(M2, MA4)` which is so much more readable and easy-to-understand than `has_affiliation(c("M2", "MA4"))`.
101+
2. How can we combine multiple predicates to make complex definitions?
102+
103+
## Non-standard evaluation
104+
105+
The answer to our first outstanding issue is provided by the magic of [non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html) (that link is to the Non-standard evaluation chapter of the excellent book *Advanced R* by Hadley Wickham, which I highly recommend). For the purposes of this simplified example, I'll use the following:
106+
107+
```{r}
108+
has_affiliation <- function(...) {
109+
affiliations <- eval(substitute(alist(...)))
110+
affiliations <- as.character(affiliations)
111+
112+
list(ID_FIELD = "entity_id",
113+
ID_TYPE = "entity_id",
114+
TABLE_NAME = "cdw.d_bio_affiliation_mv",
115+
FIELD_NAME = "affil_code",
116+
LIST_OF_VALUES = paste0("'", affiliations, "'",
117+
collapse = ", "))
118+
}
119+
```
120+
121+
Here `eval(substitute(alist(...)))` and `as.character(affiliations)` take the symbols entered by the user (I'm noting they are symbols and not characters) and convert them to a character vector. So far, so good:
122+
123+
```{r}
124+
display(
125+
has_affiliation(M2, MA4)
126+
)
127+
```
128+
129+
*Note that the actual package is a little bit more careful about how it does things here, by using the [lazyeval package](https://cran.r-project.org/web/packages/lazyeval/index.html). To see the actual code used in the Disco Engine, check out `prep_dots` and `partial_sub` in [this discoveryengine source code file](https://github.com/tarakc02/discoveryengine/blob/master/R/helper-utils.R)*
130+
131+
## Combining simple definitions
132+
133+
Now we want to know, given two existing definitions, how do we combine them into one more complex definition? The answer is surprisingly simple, but may take a minute to wrap your head around if you haven't programmed in this way before. I'll take things step-by-step.
134+
135+
### "Atomic" vs. "Complex" definitions
136+
137+
Our first step is to distinguish between what I'll call "atomic" and "complex" definitions. So far we've been working with atomic definitions -- that is, simple definitions that do not use `%and%`, `%or%`, or `%but_not%`.
138+
139+
Since we'll want to build more widgets (in order to benefit from being able to combine them!), I'm going to write some scaffolding code that will make it easier to produce widgets:
140+
141+
```{r}
142+
# widget will be a function to make ATOMIC definitions
143+
widget <- function(..., ID_FIELD, ID_TYPE = ID_FIELD, TABLE_NAME,
144+
FIELD_NAME) {
145+
args <- eval(substitute(alist(...)))
146+
args <- as.character(args)
147+
148+
res <- list(ID_FIELD = ID_FIELD,
149+
ID_TYPE = ID_TYPE,
150+
TABLE_NAME = TABLE_NAME,
151+
FIELD_NAME = FIELD_NAME,
152+
LIST_OF_VALUES = paste0("'", args, "'",
153+
collapse = ", "))
154+
155+
# tagging atomic definitions makes it easier to keep track
156+
# here we add an "attribute" specifying that the definition is atomic
157+
# if you're unfamiliar with attributes, see ?attributes
158+
structure(res, atomic = TRUE)
159+
}
160+
```
161+
162+
Now let's re-create `has_affiliation` using our new scaffolding code, and also let's create a widget called `participated_in` for student activities and one called `on_committee` for committee participation:
163+
164+
```{r}
165+
has_affiliation <- function(...) {
166+
# notice I can just pass the ... along to the next function
167+
widget(...,
168+
ID_FIELD = "entity_id",
169+
TABLE_NAME = "cdw.d_bio_affiliation_mv",
170+
FIELD_NAME = "affil_code")
171+
}
172+
173+
participated_in <- function(...) {
174+
widget(...,
175+
ID_FIELD = "entity_id",
176+
TABLE_NAME = "cdw.d_bio_student_activity_mv",
177+
FIELD_NAME = "student_activity_code")
178+
}
179+
180+
on_committee <- function(...) {
181+
widget(...,
182+
ID_FIELD = "entity_id",
183+
TABLE_NAME = "cdw.d_bio_committee_mv",
184+
FIELD_NAME = "committee_code")
185+
}
186+
```
187+
188+
Just to make sure I haven't broken anything, I quickly inspect the SQL that is being built by these widgets:
189+
190+
```{r}
191+
# to make it easier to inspect the queries
192+
show_query <- function(definition) cat(to_sql(definition))
193+
194+
show_query( has_affiliation(MA6, OC3) )
195+
show_query( participated_in(SA1, SA2) )
196+
show_query( on_committee(AE7, ME3, ME5, AE5) )
197+
```
198+
199+
### Operations on definitions
200+
201+
We now want to implement `%and%`, `%or%`, and `%but_not%`. Luckily, these are closely related to, in order, the SQL operators `intersect`, `union`, and `minus`.
202+
203+
So here is the surprisingly simple template we need to construct complex queries:
204+
205+
```{r}
206+
generate_complex_query <- parameterize_template("
207+
(##LHS##)
208+
##operator##
209+
(##RHS##)
210+
")
211+
```
212+
213+
Here `LHS` and `RHS` (which stand for "left-hand-side" and "right-hand-side") are the SQL translations of definitions (they can be atomic or complex -- as long as we know they have been converted to SQL properly, the result of this template will also be a valid SQL query). So when we translate a complex definition to SQL, all we have to do is translate the individual components to SQL. Those components (that is, `LHS` and `RHS`) may be either atomic (in which case we already know what to do) or complex (in which case we'll just break it down again using the same logic, until we get down to atomic definitions).
214+
215+
Now to implement our operations, we once again collect the necessary information into a list, this time tagging it as not atomic:
216+
217+
```{r}
218+
operate <- function(LHS, RHS, operator) {
219+
res <- list(
220+
operator = operator,
221+
LHS = LHS,
222+
RHS = RHS
223+
)
224+
225+
# we need to make sure to tag the result as NOT atomic
226+
structure(res, atomic = FALSE)
227+
}
228+
229+
`%and%` <- function(LHS, RHS) operate(LHS, RHS, "intersect")
230+
`%or%` <- function(LHS, RHS) operate(LHS, RHS, "union")
231+
`%but_not%` <- function(LHS, RHS) operate(LHS, RHS, "minus")
232+
```
233+
234+
## Re-visiting `to_sql`
235+
236+
We now have two different SQL templates -- one for atomic definitions, and one for non-atomic definitions. Accordingly, we'll update `to_sql` so that it uses the correct template. The fact that every definition we create, whether atomic or not, has an attribute called `atomic` that can be either `TRUE` or `FALSE` helps us. We'll start by making a helper function that tells us if a definition is complex or not:
237+
238+
```{r}
239+
is_atomic <- function(definition) {
240+
# recall this attribute is always TRUE or FALSE
241+
attr(definition, "atomic")
242+
}
243+
```
244+
245+
Now we can update our `to_sql` function to check whether a definition is atomic or not and then populate the appropriate template:
246+
247+
```{r}
248+
to_sql <- function(definition) {
249+
# we already know what to do with atomic definitions
250+
if (is_atomic(definition)) do.call("generate_query", definition)
251+
252+
# with complex definitions, we translate the LHS and RHS to SQL,
253+
# then pass everything back to the complex query template
254+
else {
255+
translated_pieces <- list(
256+
LHS = to_sql(definition$LHS),
257+
RHS = to_sql(definition$RHS),
258+
operator = definition$operator
259+
)
260+
do.call("generate_complex_query", translated_pieces)
261+
}
262+
}
263+
```
264+
265+
If you look closely at the non-atomic part of `to_sql`, you may be surprised to find that it seems to be circularly defined! It helps to describe the process in plain language first:
266+
267+
> To convert a complex (i.e. non-atomic) definition to SQL, we first convert its constituent pieces to SQL, then combine the resulting SQL using the appropriate operator (intersect, union, or minus)
268+
269+
That process is pretty easy to understand if both pieces of a non-atomic definition are atomic, but what if, say, the left-hand-side (LHS) is also non-atomic? Well, we just do the same thing, trying to convert the constituent pieces to SQL, and so forth. This works because *we know that eventually we'll hit an atomic definition*.
270+
271+
## Seeing it all in action
272+
273+
Let's create some definitions and then see how they are being converted to SQL.
274+
275+
```{r}
276+
# a simple definition
277+
has_engineering_affil = has_affiliation(MA4, SWE, URAE, DEN1)
278+
279+
# a slightly more complex definition
280+
engineering_constituency_1 =
281+
has_engineering_affil %and%
282+
participated_in(UWSE, CHES, ENWE, ENSE)
283+
284+
engineering_constituency_2 =
285+
engineering_constituency_1 %but_not%
286+
on_committee(ME3)
287+
288+
```
289+
290+
To reassure myself that the system is working as expected, I `display` the most complicated constituency, and look in CADS to verify that the resulting IDs do in fact match the definition I created:
291+
292+
```{r}
293+
display(engineering_constituency_2)
294+
```
295+
296+
Now, let's take a look at the actual SQL that is being generated behind the scenes:
297+
298+
```{r}
299+
# we've already seen the atomic definitions turned to SQL:
300+
show_query(has_engineering_affil)
301+
302+
# this is a complex definition, but both pieces are atomic:
303+
show_query(engineering_constituency_1)
304+
305+
# as the definitions get more complex, even the LHS and RHS can
306+
# be complex, but everything still can be analyzed down to atomic pieces:
307+
show_query(engineering_constituency_2)
308+
```
309+
310+
You'll notice that, especially as the definitions grow more complex, the resulting SQL may not look exactly how you'd type it yourself. But it does work, is correct, and thanks to the [Relational Algebra](https://en.wikipedia.org/wiki/Relational_algebra), I know the SQL will be optimized before running against our data.
311+
312+
## For further study
313+
314+
If you've made it this far, you should have a pretty solid understanding of the basic functioning of the Discovery Engine. We did not cover some core features, in particular [code lookup (aka synonym search)](#synonym-search), [higher order widgets](#higher-order-widgets), and the bots --
315+
the [brainstorm bot](#brainstorm-bot) and the [matrix bot](#matrix-bot). But armed with the knowledge you do have, you can explore [the source code](https://github.com/tarakc02/discoveryengine) and see how everything else works. Feel free to [reach out to Tarak via email](mailto:[email protected]) if you have any questions.

0 commit comments

Comments
 (0)