|
| 1 | +# How does it work? {#how-it-works} |
| 2 | + |
| 3 | +In this section, I'll go over the basic building blocks that make the Disco Engine work. To help organize the lesson, I'll assume we're building a brand new Disco Engine from scratch. Of course, what we build will necessarily be just a simplified version of the Disco Engine, but we'll cover the most important concepts. All of the code behind the real Disco Engine is available for anyone to view [on GitHub](https://github.com/tarakc02/discoveryengine). Where possible, I'll add a link to the appropriate bit of production code and explain what's different from our simplified example. |
| 4 | + |
| 5 | +## A Widget as a SQL template |
| 6 | + |
| 7 | +To get data from the database, the Disco Engine converts definitions into valid SQL. The key insight here is that any first-order predicate ([what's first-order mean?](#higher-order-widgets)) can be represented in terms of a very basic SQL query, which we can turn into this all-purpose template: |
| 8 | + |
| 9 | +```{sql, eval = FALSE} |
| 10 | +select distinct ID_FIELD as ID_TYPE |
| 11 | +from TABLE_NAME |
| 12 | +where FIELD_NAME in (LIST_OF_VALUES) |
| 13 | +``` |
| 14 | + |
| 15 | +*Note: The actual implementation allows for a broader range of queries, but this captures the basic idea. To see the actual templates used, check out [the source code in the listbuilder package](https://github.com/tarakc02/listbuilder/blob/master/R/templates.R)* |
| 16 | + |
| 17 | +There are a number of R packages that allow you to be able to construct templates this way and populate them with R objects. `discoveryengine` uses the [whisker package](https://cran.r-project.org/web/packages/whisker/index.html), but for this simplified example, I'll use `getcdw::parameterize_template`: |
| 18 | + |
| 19 | +```{r} |
| 20 | +library(getcdw) |
| 21 | +generate_query <- parameterize_template(" |
| 22 | +select distinct ##ID_FIELD## as ##ID_TYPE## |
| 23 | +from ##TABLE_NAME## |
| 24 | +where ##FIELD_NAME## in (##LIST_OF_VALUES##) |
| 25 | +") |
| 26 | +
|
| 27 | +# generate_query is now a function with the same arguments as the |
| 28 | +# names highlighted in the template |
| 29 | +generate_query |
| 30 | +``` |
| 31 | + |
| 32 | +Now, we can get a clunky, but functional, `has_affiliation` widget working: |
| 33 | + |
| 34 | +```{r} |
| 35 | +has_affiliation <- function(affiliations) { |
| 36 | + list(ID_FIELD = "entity_id", |
| 37 | + ID_TYPE = "entity_id", |
| 38 | + TABLE_NAME = "cdw.d_bio_affiliation_mv", |
| 39 | + FIELD_NAME = "affil_code", |
| 40 | + LIST_OF_VALUES = paste0("'", affiliations, "'", |
| 41 | + collapse = ", ")) |
| 42 | +} |
| 43 | +``` |
| 44 | + |
| 45 | +Wait, what? I wanted to create a constituency definition, but all I got was this lousy list! |
| 46 | + |
| 47 | +## A template as a data structure |
| 48 | + |
| 49 | +Recall that definitions don't turn into IDs until we use the `display` function. The basic job of our `display` function will be to take care of two steps: |
| 50 | + |
| 51 | +1. Convert the definition (currently just a list) to SQL, and |
| 52 | +2. Send the SQL to the database, returning the appropriate data |
| 53 | + |
| 54 | +So it makes sense that `has_affiliation` just makes a list. It has all of the components of our definition, so we can inspect and figure out what it's supposed to return, but it won't actually return data from the data warehouse until we build a `display` function. |
| 55 | + |
| 56 | +Luckily, step 2 of the `display` function is easy, because that's exactly what the function `getcdw::get_cdw` does (if you're building for your own database, you just need a function here that can send SQL to your database and return a `data.frame`). |
| 57 | + |
| 58 | +```{r} |
| 59 | +display <- function(definition) { |
| 60 | + get_cdw(to_sql(definition)) |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +How can we implement `to_sql`? Well, that's just a matter of combining our template from `generate_query` with the data to fill it in from `has_affiliation`. The R function `do.call` lets us call any R function using a list of arguments. For example: |
| 65 | + |
| 66 | +```{r} |
| 67 | +# here is the usual way we call a function: |
| 68 | +# this takes 2 samples from the integers between 1 and 100 |
| 69 | +sample(x = 1:100, size = 2) |
| 70 | +
|
| 71 | +# but what if we've collected the arguments from some process? |
| 72 | +args <- list(x = 1:100, size = 2) |
| 73 | +
|
| 74 | +# sample(args) won't work. what we want is do.call, which allows us to |
| 75 | +# "feed" args as arguments to sample() |
| 76 | +do.call("sample", args) |
| 77 | +``` |
| 78 | + |
| 79 | +Ok, so it looks like we know how to implement `to_sql`: |
| 80 | + |
| 81 | +```{r} |
| 82 | +to_sql <- function(definition) { |
| 83 | + do.call("generate_query", definition) |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +That's basically it! Let's test what we have: |
| 88 | + |
| 89 | +```{r} |
| 90 | +is_prytanean = has_affiliation("OC6") |
| 91 | +display(is_prytanean) |
| 92 | +
|
| 93 | +## and we can use multiple affiliations: |
| 94 | +is_constituent = has_affiliation(c("M2", "MA4")) |
| 95 | +display(is_constituent) |
| 96 | +``` |
| 97 | + |
| 98 | +Ok! So we've now implemented basic widget functionality. We still have a couple of outstanding issues: |
| 99 | + |
| 100 | +1. The real disco engine allows you to type `has_affiliation(M2, MA4)` which is so much more readable and easy-to-understand than `has_affiliation(c("M2", "MA4"))`. |
| 101 | +2. How can we combine multiple predicates to make complex definitions? |
| 102 | + |
| 103 | +## Non-standard evaluation |
| 104 | + |
| 105 | +The answer to our first outstanding issue is provided by the magic of [non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html) (that link is to the Non-standard evaluation chapter of the excellent book *Advanced R* by Hadley Wickham, which I highly recommend). For the purposes of this simplified example, I'll use the following: |
| 106 | + |
| 107 | +```{r} |
| 108 | +has_affiliation <- function(...) { |
| 109 | + affiliations <- eval(substitute(alist(...))) |
| 110 | + affiliations <- as.character(affiliations) |
| 111 | + |
| 112 | + list(ID_FIELD = "entity_id", |
| 113 | + ID_TYPE = "entity_id", |
| 114 | + TABLE_NAME = "cdw.d_bio_affiliation_mv", |
| 115 | + FIELD_NAME = "affil_code", |
| 116 | + LIST_OF_VALUES = paste0("'", affiliations, "'", |
| 117 | + collapse = ", ")) |
| 118 | +} |
| 119 | +``` |
| 120 | + |
| 121 | +Here `eval(substitute(alist(...)))` and `as.character(affiliations)` take the symbols entered by the user (I'm noting they are symbols and not characters) and convert them to a character vector. So far, so good: |
| 122 | + |
| 123 | +```{r} |
| 124 | +display( |
| 125 | + has_affiliation(M2, MA4) |
| 126 | +) |
| 127 | +``` |
| 128 | + |
| 129 | +*Note that the actual package is a little bit more careful about how it does things here, by using the [lazyeval package](https://cran.r-project.org/web/packages/lazyeval/index.html). To see the actual code used in the Disco Engine, check out `prep_dots` and `partial_sub` in [this discoveryengine source code file](https://github.com/tarakc02/discoveryengine/blob/master/R/helper-utils.R)* |
| 130 | + |
| 131 | +## Combining simple definitions |
| 132 | + |
| 133 | +Now we want to know, given two existing definitions, how do we combine them into one more complex definition? The answer is surprisingly simple, but may take a minute to wrap your head around if you haven't programmed in this way before. I'll take things step-by-step. |
| 134 | + |
| 135 | +### "Atomic" vs. "Complex" definitions |
| 136 | + |
| 137 | +Our first step is to distinguish between what I'll call "atomic" and "complex" definitions. So far we've been working with atomic definitions -- that is, simple definitions that do not use `%and%`, `%or%`, or `%but_not%`. |
| 138 | + |
| 139 | +Since we'll want to build more widgets (in order to benefit from being able to combine them!), I'm going to write some scaffolding code that will make it easier to produce widgets: |
| 140 | + |
| 141 | +```{r} |
| 142 | +# widget will be a function to make ATOMIC definitions |
| 143 | +widget <- function(..., ID_FIELD, ID_TYPE = ID_FIELD, TABLE_NAME, |
| 144 | + FIELD_NAME) { |
| 145 | + args <- eval(substitute(alist(...))) |
| 146 | + args <- as.character(args) |
| 147 | + |
| 148 | + res <- list(ID_FIELD = ID_FIELD, |
| 149 | + ID_TYPE = ID_TYPE, |
| 150 | + TABLE_NAME = TABLE_NAME, |
| 151 | + FIELD_NAME = FIELD_NAME, |
| 152 | + LIST_OF_VALUES = paste0("'", args, "'", |
| 153 | + collapse = ", ")) |
| 154 | + |
| 155 | + # tagging atomic definitions makes it easier to keep track |
| 156 | + # here we add an "attribute" specifying that the definition is atomic |
| 157 | + # if you're unfamiliar with attributes, see ?attributes |
| 158 | + structure(res, atomic = TRUE) |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +Now let's re-create `has_affiliation` using our new scaffolding code, and also let's create a widget called `participated_in` for student activities and one called `on_committee` for committee participation: |
| 163 | + |
| 164 | +```{r} |
| 165 | +has_affiliation <- function(...) { |
| 166 | + # notice I can just pass the ... along to the next function |
| 167 | + widget(..., |
| 168 | + ID_FIELD = "entity_id", |
| 169 | + TABLE_NAME = "cdw.d_bio_affiliation_mv", |
| 170 | + FIELD_NAME = "affil_code") |
| 171 | +} |
| 172 | +
|
| 173 | +participated_in <- function(...) { |
| 174 | + widget(..., |
| 175 | + ID_FIELD = "entity_id", |
| 176 | + TABLE_NAME = "cdw.d_bio_student_activity_mv", |
| 177 | + FIELD_NAME = "student_activity_code") |
| 178 | +} |
| 179 | +
|
| 180 | +on_committee <- function(...) { |
| 181 | + widget(..., |
| 182 | + ID_FIELD = "entity_id", |
| 183 | + TABLE_NAME = "cdw.d_bio_committee_mv", |
| 184 | + FIELD_NAME = "committee_code") |
| 185 | +} |
| 186 | +``` |
| 187 | + |
| 188 | +Just to make sure I haven't broken anything, I quickly inspect the SQL that is being built by these widgets: |
| 189 | + |
| 190 | +```{r} |
| 191 | +# to make it easier to inspect the queries |
| 192 | +show_query <- function(definition) cat(to_sql(definition)) |
| 193 | +
|
| 194 | +show_query( has_affiliation(MA6, OC3) ) |
| 195 | +show_query( participated_in(SA1, SA2) ) |
| 196 | +show_query( on_committee(AE7, ME3, ME5, AE5) ) |
| 197 | +``` |
| 198 | + |
| 199 | +### Operations on definitions |
| 200 | + |
| 201 | +We now want to implement `%and%`, `%or%`, and `%but_not%`. Luckily, these are closely related to, in order, the SQL operators `intersect`, `union`, and `minus`. |
| 202 | + |
| 203 | +So here is the surprisingly simple template we need to construct complex queries: |
| 204 | + |
| 205 | +```{r} |
| 206 | +generate_complex_query <- parameterize_template(" |
| 207 | +(##LHS##) |
| 208 | +##operator## |
| 209 | +(##RHS##) |
| 210 | +") |
| 211 | +``` |
| 212 | + |
| 213 | +Here `LHS` and `RHS` (which stand for "left-hand-side" and "right-hand-side") are the SQL translations of definitions (they can be atomic or complex -- as long as we know they have been converted to SQL properly, the result of this template will also be a valid SQL query). So when we translate a complex definition to SQL, all we have to do is translate the individual components to SQL. Those components (that is, `LHS` and `RHS`) may be either atomic (in which case we already know what to do) or complex (in which case we'll just break it down again using the same logic, until we get down to atomic definitions). |
| 214 | + |
| 215 | +Now to implement our operations, we once again collect the necessary information into a list, this time tagging it as not atomic: |
| 216 | + |
| 217 | +```{r} |
| 218 | +operate <- function(LHS, RHS, operator) { |
| 219 | + res <- list( |
| 220 | + operator = operator, |
| 221 | + LHS = LHS, |
| 222 | + RHS = RHS |
| 223 | + ) |
| 224 | + |
| 225 | + # we need to make sure to tag the result as NOT atomic |
| 226 | + structure(res, atomic = FALSE) |
| 227 | +} |
| 228 | +
|
| 229 | +`%and%` <- function(LHS, RHS) operate(LHS, RHS, "intersect") |
| 230 | +`%or%` <- function(LHS, RHS) operate(LHS, RHS, "union") |
| 231 | +`%but_not%` <- function(LHS, RHS) operate(LHS, RHS, "minus") |
| 232 | +``` |
| 233 | + |
| 234 | +## Re-visiting `to_sql` |
| 235 | + |
| 236 | +We now have two different SQL templates -- one for atomic definitions, and one for non-atomic definitions. Accordingly, we'll update `to_sql` so that it uses the correct template. The fact that every definition we create, whether atomic or not, has an attribute called `atomic` that can be either `TRUE` or `FALSE` helps us. We'll start by making a helper function that tells us if a definition is complex or not: |
| 237 | + |
| 238 | +```{r} |
| 239 | +is_atomic <- function(definition) { |
| 240 | + # recall this attribute is always TRUE or FALSE |
| 241 | + attr(definition, "atomic") |
| 242 | +} |
| 243 | +``` |
| 244 | + |
| 245 | +Now we can update our `to_sql` function to check whether a definition is atomic or not and then populate the appropriate template: |
| 246 | + |
| 247 | +```{r} |
| 248 | +to_sql <- function(definition) { |
| 249 | + # we already know what to do with atomic definitions |
| 250 | + if (is_atomic(definition)) do.call("generate_query", definition) |
| 251 | + |
| 252 | + # with complex definitions, we translate the LHS and RHS to SQL, |
| 253 | + # then pass everything back to the complex query template |
| 254 | + else { |
| 255 | + translated_pieces <- list( |
| 256 | + LHS = to_sql(definition$LHS), |
| 257 | + RHS = to_sql(definition$RHS), |
| 258 | + operator = definition$operator |
| 259 | + ) |
| 260 | + do.call("generate_complex_query", translated_pieces) |
| 261 | + } |
| 262 | +} |
| 263 | +``` |
| 264 | + |
| 265 | +If you look closely at the non-atomic part of `to_sql`, you may be surprised to find that it seems to be circularly defined! It helps to describe the process in plain language first: |
| 266 | + |
| 267 | +> To convert a complex (i.e. non-atomic) definition to SQL, we first convert its constituent pieces to SQL, then combine the resulting SQL using the appropriate operator (intersect, union, or minus) |
| 268 | +
|
| 269 | +That process is pretty easy to understand if both pieces of a non-atomic definition are atomic, but what if, say, the left-hand-side (LHS) is also non-atomic? Well, we just do the same thing, trying to convert the constituent pieces to SQL, and so forth. This works because *we know that eventually we'll hit an atomic definition*. |
| 270 | + |
| 271 | +## Seeing it all in action |
| 272 | + |
| 273 | +Let's create some definitions and then see how they are being converted to SQL. |
| 274 | + |
| 275 | +```{r} |
| 276 | +# a simple definition |
| 277 | +has_engineering_affil = has_affiliation(MA4, SWE, URAE, DEN1) |
| 278 | +
|
| 279 | +# a slightly more complex definition |
| 280 | +engineering_constituency_1 = |
| 281 | + has_engineering_affil %and% |
| 282 | + participated_in(UWSE, CHES, ENWE, ENSE) |
| 283 | +
|
| 284 | +engineering_constituency_2 = |
| 285 | + engineering_constituency_1 %but_not% |
| 286 | + on_committee(ME3) |
| 287 | +
|
| 288 | +``` |
| 289 | + |
| 290 | +To reassure myself that the system is working as expected, I `display` the most complicated constituency, and look in CADS to verify that the resulting IDs do in fact match the definition I created: |
| 291 | + |
| 292 | +```{r} |
| 293 | +display(engineering_constituency_2) |
| 294 | +``` |
| 295 | + |
| 296 | +Now, let's take a look at the actual SQL that is being generated behind the scenes: |
| 297 | + |
| 298 | +```{r} |
| 299 | +# we've already seen the atomic definitions turned to SQL: |
| 300 | +show_query(has_engineering_affil) |
| 301 | +
|
| 302 | +# this is a complex definition, but both pieces are atomic: |
| 303 | +show_query(engineering_constituency_1) |
| 304 | +
|
| 305 | +# as the definitions get more complex, even the LHS and RHS can |
| 306 | +# be complex, but everything still can be analyzed down to atomic pieces: |
| 307 | +show_query(engineering_constituency_2) |
| 308 | +``` |
| 309 | + |
| 310 | +You'll notice that, especially as the definitions grow more complex, the resulting SQL may not look exactly how you'd type it yourself. But it does work, is correct, and thanks to the [Relational Algebra](https://en.wikipedia.org/wiki/Relational_algebra), I know the SQL will be optimized before running against our data. |
| 311 | + |
| 312 | +## For further study |
| 313 | + |
| 314 | +If you've made it this far, you should have a pretty solid understanding of the basic functioning of the Discovery Engine. We did not cover some core features, in particular [code lookup (aka synonym search)](#synonym-search), [higher order widgets](#higher-order-widgets), and the bots -- |
| 315 | +the [brainstorm bot ](#brainstorm-bot) and the [matrix bot ](#matrix-bot). But armed with the knowledge you do have, you can explore [the source code ](https://github.com/tarakc02/discoveryengine) and see how everything else works. Feel free to [reach out to Tarak via email ](mailto:[email protected]) if you have any questions. |
0 commit comments