|
| 1 | +#+STARTUP: fold indent |
| 2 | +#+OPTIONS: tex:t toc:2 H:6 ^:{} |
| 3 | + |
| 4 | +#+TITLE: Databases and SQL for Data Scientists |
| 5 | +#+AUTHOR: Derek Devnich |
| 6 | +#+BEGIN_SRC sql |
| 7 | +#+END_SRC |
| 8 | +#+BEGIN_SRC bash |
| 9 | +#+END_SRC |
| 10 | +* COMMENT SQL interaction |
| 11 | +1. Start SQLite inferior process |
| 12 | + ~M-x sql-sqlite~ |
| 13 | + |
| 14 | +2. Set SQL dialect for syntax highlighting |
| 15 | + ~M-x sql-set-product~ |
| 16 | + ~sqlite~ |
| 17 | + |
| 18 | +* Introducing databases and SQL: Why use a database? |
| 19 | +** Performance |
| 20 | + |
| 21 | +** Correctness |
| 22 | +There are two aspects of "correctness": Enforcing consistency and eliminating ambiguity. A database enforces consistency with a combination of data types, rules (e.g., foreign keys, triggers, etc.), and atomic transactions. It eliminates ambiguity by forbidding NULLs. |
| 23 | + |
| 24 | +1. You can represent simple data in a single table |
| 25 | + [[file:images/animals.png]] |
| 26 | + |
| 27 | +2. The single table breaks down when your data is complex |
| 28 | + [[file:images/animals_blob.png]] |
| 29 | + |
| 30 | + If you use a nested representation, the individual table cells are no longer atomic. The tool for query, search, or perform analyses rely on the atomic structure of the table, and they break down when the cell contents are complex. |
| 31 | + |
| 32 | +3. Complex data with duplicate row |
| 33 | + [[file:images/animals_dup.png]] |
| 34 | + |
| 35 | + - Storing redundant information has storage costs |
| 36 | + - Redundant rows violate the Don't Repeat Yourself [DRY] principle. Every copy is an opportunity to introduce errors or inconsistencies into the data. |
| 37 | + - Storing multidimensional data in a single table increases the chance that your records will have NULL fields, which will complicate future queries (more on this later) |
| 38 | + |
| 39 | +4. Solution: Normalize the data by breaking it into multiple tables |
| 40 | + [[file:images/animals_half.png]] [[file:images/sightings_half.png]] |
| 41 | + |
| 42 | + - Every row of every table contains unique information |
| 43 | + - Normalization is a continuum. We could normalize this data further, but there is a trade-off in terms of sane table management. Finding the correct trade-off is a matter of taste, judgment, and domain-specific knowledge. |
| 44 | + |
| 45 | +** Encode Domain Knowledge |
| 46 | +[[file:images/bank_account_schema.jpg]] |
| 47 | + |
| 48 | +- Encodes shape of domain |
| 49 | +- Embeds domain rules: e.g. cannot have a customer transaction without a customer account |
| 50 | +- Rules provide additional layer of correctness in the form of constraints |
| 51 | +- note that forbidding NULL seems much more reasonable in this context! |
| 52 | + |
| 53 | +** Extensions |
| 54 | +- Functions |
| 55 | +- Data types (GIS, JSON, date/time, searchable document, currency…) |
| 56 | +- Full-text search |
| 57 | + |
| 58 | +* Accessing data with queries |
| 59 | +** Basic queries |
| 60 | +1. Select everything from a table |
| 61 | + #+BEGIN_SRC sql |
| 62 | + SELECT * |
| 63 | + FROM surveys; |
| 64 | + #+END_SRC |
| 65 | + |
| 66 | +2. Select a column |
| 67 | + #+BEGIN_SRC sql |
| 68 | + SELECT year |
| 69 | + FROM surveys; |
| 70 | + #+END_SRC |
| 71 | + |
| 72 | +3. Select multiple columns |
| 73 | + #+BEGIN_SRC sql |
| 74 | + SELECT year, month, day |
| 75 | + FROM surveys; |
| 76 | + #+END_SRC |
| 77 | + |
| 78 | +4. Limit results |
| 79 | + #+BEGIN_SRC sql |
| 80 | + SELECT * |
| 81 | + FROM surveys |
| 82 | + LIMIT 10; |
| 83 | + #+END_SRC |
| 84 | + |
| 85 | +5. Get unique values |
| 86 | + #+BEGIN_SRC sql |
| 87 | + SELECT DISTINCT species_id |
| 88 | + FROM surveys; |
| 89 | + #+END_SRC |
| 90 | + |
| 91 | + #+BEGIN_SRC sql |
| 92 | + -- Return distinct pairs |
| 93 | + SELECT DISTINCT year, species_id |
| 94 | + FROM surveys; |
| 95 | + #+END_SRC |
| 96 | + |
| 97 | +6. Calculate values |
| 98 | + #+BEGIN_SRC sql |
| 99 | + -- Convert kg to g |
| 100 | + SELECT plot_id, species_id, weight/1000 |
| 101 | + FROM surveys; |
| 102 | + #+END_SRC |
| 103 | + |
| 104 | +7. SQL databases have functions |
| 105 | + #+BEGIN_SRC sql |
| 106 | + SELECT plot_id, species_id, ROUND(weight/1000, 2) |
| 107 | + FROM surveys; |
| 108 | + #+END_SRC |
| 109 | + |
| 110 | +** Filtering |
| 111 | +1. Filter by a criterion |
| 112 | + #+BEGIN_SRC sql |
| 113 | + SELECT * |
| 114 | + FROM surveys |
| 115 | + WHERE species_id='DM'; |
| 116 | + #+END_SRC |
| 117 | + |
| 118 | + #+BEGIN_SRC sql |
| 119 | + SELECT * |
| 120 | + FROM surveys |
| 121 | + WHERE year >= 2000; |
| 122 | + #+END_SRC |
| 123 | + |
| 124 | +2. Combine criteria with booleans |
| 125 | + #+BEGIN_SRC sql |
| 126 | + SELECT * |
| 127 | + FROM surveys |
| 128 | + WHERE (year >= 2000) AND (species_id = 'DM'); |
| 129 | + #+END_SRC |
| 130 | + |
| 131 | + #+BEGIN_SRC sql |
| 132 | + SELECT * |
| 133 | + FROM surveys |
| 134 | + WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS'); |
| 135 | + #+END_SRC |
| 136 | + |
| 137 | +** *Challenge 1*: Large bois |
| 138 | +Get all of the individuals in Plot 1 that weighed more than 75 grams, telling us the date, species id code, and weight (in kg). |
| 139 | + |
| 140 | +** Building complex queries |
| 141 | +Use sets ("tuples") to condense criteria. |
| 142 | +#+BEGIN_SRC sql |
| 143 | +SELECT * |
| 144 | +FROM surveys |
| 145 | +WHERE (year >= 2000) AND (species_id IN ('DM', 'DO', 'DS')); |
| 146 | +#+END_SRC |
| 147 | + |
| 148 | +** Sorting |
| 149 | +1. Sort by a column value |
| 150 | + #+BEGIN_SRC sql |
| 151 | + SELECT * |
| 152 | + FROM species |
| 153 | + ORDER BY taxa ASC; |
| 154 | + #+END_SRC |
| 155 | + |
| 156 | +2. Descending sort |
| 157 | + #+BEGIN_SRC sql |
| 158 | + SELECT * |
| 159 | + FROM species |
| 160 | + ORDER BY taxa DESC; |
| 161 | + #+END_SRC |
| 162 | + |
| 163 | +3. Nested sort |
| 164 | + #+BEGIN_SRC sql |
| 165 | + SELECT * |
| 166 | + FROM species |
| 167 | + ORDER BY genus ASC, species ASC; |
| 168 | + #+END_SRC |
| 169 | + |
| 170 | +** *Challenge 2* |
| 171 | +Write a query that returns year, species_id, and weight in kg from the surveys table, sorted with the largest weights at the top. |
| 172 | + |
| 173 | +** Order of execution |
| 174 | +Queries are pipelines |
| 175 | +[[file:images/written_vs_execution_order.png]] |
| 176 | + |
| 177 | +* Aggregating and grouping data (i.e. reporting) |
| 178 | +** COUNT and GROUP BY |
| 179 | +1. The COUNT function |
| 180 | + #+BEGIN_SRC sql |
| 181 | + SELECT COUNT(*) |
| 182 | + FROM surveys; |
| 183 | + #+END_SRC |
| 184 | + |
| 185 | + #+BEGIN_SRC sql |
| 186 | + -- SELECT only returns the non-NULL weights |
| 187 | + SELECT COUNT(weight), AVG(weight) |
| 188 | + FROM surveys; |
| 189 | + #+END_SRC |
| 190 | + |
| 191 | +2. |
| 192 | +** Ordering aggregated results |
| 193 | +** Aliases |
| 194 | +** The HAVING keyword |
| 195 | +** Saving queries for future use |
| 196 | +** NULL |
| 197 | +Go to slides, rather than extensively demo (do demo "is null") |
| 198 | + |
| 199 | +* Combining data with joins |
| 200 | + |
| 201 | +* Data hygiene |
| 202 | +** TODO The problem with nulls |
| 203 | +Missing data and deceptive query results |
| 204 | + |
| 205 | +** Data integrity constraints: Keys, not null, etc |
| 206 | + |
| 207 | +** TODO Levels of Normalization |
| 208 | + |
| 209 | +* Creating and modifying data |
| 210 | +** Insert statements |
| 211 | +** Create tables |
| 212 | +** Table contraints |
| 213 | +sqlite check command |
| 214 | +https://stackoverflow.com/questions/29476818/how-to-avoid-inserting-the-wrong-data-type-in-sqlite-tables |
| 215 | +https://www.sqlitetutorial.net/sqlite-check-constraint/ |
| 216 | +** Atomic commits |
| 217 | +By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced. |
| 218 | + |
| 219 | +* (Optional) SQLite on the command line |
| 220 | +** Basic commands |
| 221 | +#+BEGIN_SRC bash |
| 222 | +sqlite3 # enter sqlite prompt |
| 223 | +.tables # show table names |
| 224 | +.schema # show table schema |
| 225 | +.help # view built-in commands |
| 226 | +.quit |
| 227 | +#+END_SRC |
| 228 | + |
| 229 | +** Getting output |
| 230 | +1. Formatted output in the terminal |
| 231 | + #+BEGIN_SRC sql |
| 232 | + .headers on |
| 233 | + .help mode |
| 234 | + .mode column |
| 235 | + #+END_SRC |
| 236 | + |
| 237 | + #+BEGIN_SRC sql |
| 238 | + select * from species where taxa == 'Rodent'; |
| 239 | + #+END_SRC |
| 240 | + |
| 241 | +2. Output to .csv file |
| 242 | + #+BEGIN_SRC bash |
| 243 | + .mode csv |
| 244 | + .output test.csv |
| 245 | + #+END_SRC |
| 246 | + |
| 247 | + #+BEGIN_SRC sql |
| 248 | + select * from species where taxa == 'Rodent'; |
| 249 | + #+END_SRC |
| 250 | + |
| 251 | + #+BEGIN_SRC bash |
| 252 | + .output stdout |
| 253 | + #+END_SRC |
| 254 | + |
| 255 | +* TODO (Optional) Database access via programming languages |
| 256 | +** R language bindings |
| 257 | +** Python language bindings |
| 258 | + |
| 259 | +* (Optional) What kind of data storage system do I need? |
| 260 | +** Non-atomic write; sequential read |
| 261 | +1. Files |
| 262 | + |
| 263 | +** Single atomic write (database-level lock); query-driven read |
| 264 | +1. SQLite |
| 265 | +2. Microsoft Access |
| 266 | + |
| 267 | +** Multiple atomic writes (row-level lock); query-driven read |
| 268 | +1. PostgreSQL: https://www.postgresql.org |
| 269 | +2. MySQL/MariaDB |
| 270 | + - https://mariadb.org |
| 271 | + - https://www.mysql.com |
| 272 | +3. Oracle |
| 273 | +4. Microsoft SQL Server |
| 274 | +5. ...etc. |
| 275 | + |
| 276 | +* (Optional) Performance tweaks and limitations |
| 277 | +** Getting the most out of your database |
| 278 | +1. Use recommended settings, not default settings |
| 279 | +2. Make judicious use of indexes |
| 280 | +3. Use the query planner (this will provide feedback for item 2) |
| 281 | +4. Cautiously de-normalize your schema |
| 282 | + |
| 283 | +** Where relational databases break down |
| 284 | +1. Very large data (hardware, bandwidth, and data integration problems) |
| 285 | +2. Distributed data (uncertainty about correctness) |
| 286 | + |
| 287 | +** Why are distributed systems hard? |
| 288 | +1. CAP theorem |
| 289 | + - In theory, pick any two: Consistent, Available, Partition-Tolerant |
| 290 | + - In practice, Consistent or Available in the presence of a Partition |
| 291 | + |
| 292 | +2. Levels of data consistency |
| 293 | + - https://jepsen.io/consistency |
| 294 | + - https://github.com/aphyr/distsys-class |
| 295 | + |
| 296 | +3. Fallacies of distributed computing |
| 297 | + 1. The network is reliable |
| 298 | + 2. Latency is zero |
| 299 | + 3. Bandwidth is infinite |
| 300 | + 4. The network is secure |
| 301 | + 5. Topology doesn't change |
| 302 | + 6. There is one administrator |
| 303 | + 7. Transport cost is zero |
| 304 | + 8. The network is homogeneous |
| 305 | + |
| 306 | +* *Endnotes* |
| 307 | +* Credits |
| 308 | +- Data management with SQL for ecologists: https://datacarpentry.org/sql-ecology-lesson/ |
| 309 | +- Databases and SQL: http://swcarpentry.github.io/sql-novice-survey/ (data hygiene, creating and modifying data) |
| 310 | +- Simplified bank account schema: https://soft-builder.com/bank-management-system-database-model/ |
| 311 | +- Botanical Information and Ecology Network schema: https://bien.nceas.ucsb.edu/bien/biendata/bien-3/bien-3-schema/ |
| 312 | + |
| 313 | +* References |
| 314 | +- C. J. Date, /SQL and Relational Theory/: https://learning.oreilly.com/library/view/sql-and-relational/9781491941164/ |
| 315 | +- Common database mistakes: https://stackoverflow.com/a/621891 |
| 316 | +- Fallacies of distributed computing: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing |
| 317 | + |
| 318 | +* Data Sources |
| 319 | +- Portal Project Teaching Database: https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459 |
| 320 | + Specifically, portal_mammals.sqlite: https://figshare.com/ndownloader/files/11188550 |
| 321 | + |
| 322 | +* COMMENT Export to Markdown using Pandoc |
| 323 | + Do this if you want code syntax highlighting and a table of contents on Github. |
| 324 | +** Generate generic Markdown file |
| 325 | +#+BEGIN_SRC bash |
| 326 | +pandoc README.org -o tmp.md --wrap=preserve |
| 327 | +#+END_SRC |
| 328 | + |
| 329 | +** Edit generic Markdown file to remove illegal front matter |
| 330 | +1. Org directives |
| 331 | +2. Anything that isn't part of the document structure (e.g. TODO items) |
| 332 | + |
| 333 | +** Generate Github Markdown with table of contents |
| 334 | +#+BEGIN_SRC bash |
| 335 | +pandoc -f markdown --toc --toc-depth=2 --wrap=preserve -s tmp.md -o README.md |
| 336 | +#+END_SRC |
| 337 | + |
| 338 | +** Find and replace code block markers in final document (if applicable) |
| 339 | +#+BEGIN_EXAMPLE |
| 340 | +M-x qrr " {.python}" "python" |
| 341 | +M-x qrr " {.bash}" "bash" |
| 342 | +#+END_EXAMPLE |
0 commit comments