Skip to content

Commit 5ccbd6f

Browse files
committed
Initial commit
0 parents  commit 5ccbd6f

17 files changed

+873
-0
lines changed

.gitignore

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Generated documentation and backups
2+
3+
*.odt
4+
*.docx
5+
*.pdf
6+
*.html
7+
*.~lock.*
8+
*.bak
9+
tmp.md
10+
11+
# Binaries
12+
*.xlsx
13+
14+
# System files
15+
.DS_Store
16+
17+
# R environmental variables
18+
.Rhistory
19+
.Rapp.history
20+
.RData
21+
.Ruserdata
22+
.Renviron
23+
/*.Rcheck/
24+
.Rproj.user/
25+
26+
# Data folders
27+
data/*.sqlite
28+
29+
# Archives
30+
*.zip

README.org

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
#+STARTUP: fold indent
2+
#+OPTIONS: tex:t toc:2 H:6 ^:{}
3+
4+
#+TITLE: Databases and SQL for Data Scientists
5+
#+AUTHOR: Derek Devnich
6+
#+BEGIN_SRC sql
7+
#+END_SRC
8+
#+BEGIN_SRC bash
9+
#+END_SRC
10+
* COMMENT SQL interaction
11+
1. Start SQLite inferior process
12+
~M-x sql-sqlite~
13+
14+
2. Set SQL dialect for syntax highlighting
15+
~M-x sql-set-product~
16+
~sqlite~
17+
18+
* Introducing databases and SQL: Why use a database?
19+
** Performance
20+
21+
** Correctness
22+
There are two aspects of "correctness": Enforcing consistency and eliminating ambiguity. A database enforces consistency with a combination of data types, rules (e.g., foreign keys, triggers, etc.), and atomic transactions. It eliminates ambiguity by forbidding NULLs.
23+
24+
1. You can represent simple data in a single table
25+
[[file:images/animals.png]]
26+
27+
2. The single table breaks down when your data is complex
28+
[[file:images/animals_blob.png]]
29+
30+
If you use a nested representation, the individual table cells are no longer atomic. The tool for query, search, or perform analyses rely on the atomic structure of the table, and they break down when the cell contents are complex.
31+
32+
3. Complex data with duplicate row
33+
[[file:images/animals_dup.png]]
34+
35+
- Storing redundant information has storage costs
36+
- Redundant rows violate the Don't Repeat Yourself [DRY] principle. Every copy is an opportunity to introduce errors or inconsistencies into the data.
37+
- Storing multidimensional data in a single table increases the chance that your records will have NULL fields, which will complicate future queries (more on this later)
38+
39+
4. Solution: Normalize the data by breaking it into multiple tables
40+
[[file:images/animals_half.png]] [[file:images/sightings_half.png]]
41+
42+
- Every row of every table contains unique information
43+
- Normalization is a continuum. We could normalize this data further, but there is a trade-off in terms of sane table management. Finding the correct trade-off is a matter of taste, judgment, and domain-specific knowledge.
44+
45+
** Encode Domain Knowledge
46+
[[file:images/bank_account_schema.jpg]]
47+
48+
- Encodes shape of domain
49+
- Embeds domain rules: e.g. cannot have a customer transaction without a customer account
50+
- Rules provide additional layer of correctness in the form of constraints
51+
- note that forbidding NULL seems much more reasonable in this context!
52+
53+
** Extensions
54+
- Functions
55+
- Data types (GIS, JSON, date/time, searchable document, currency…)
56+
- Full-text search
57+
58+
* Accessing data with queries
59+
** Basic queries
60+
1. Select everything from a table
61+
#+BEGIN_SRC sql
62+
SELECT *
63+
FROM surveys;
64+
#+END_SRC
65+
66+
2. Select a column
67+
#+BEGIN_SRC sql
68+
SELECT year
69+
FROM surveys;
70+
#+END_SRC
71+
72+
3. Select multiple columns
73+
#+BEGIN_SRC sql
74+
SELECT year, month, day
75+
FROM surveys;
76+
#+END_SRC
77+
78+
4. Limit results
79+
#+BEGIN_SRC sql
80+
SELECT *
81+
FROM surveys
82+
LIMIT 10;
83+
#+END_SRC
84+
85+
5. Get unique values
86+
#+BEGIN_SRC sql
87+
SELECT DISTINCT species_id
88+
FROM surveys;
89+
#+END_SRC
90+
91+
#+BEGIN_SRC sql
92+
-- Return distinct pairs
93+
SELECT DISTINCT year, species_id
94+
FROM surveys;
95+
#+END_SRC
96+
97+
6. Calculate values
98+
#+BEGIN_SRC sql
99+
-- Convert kg to g
100+
SELECT plot_id, species_id, weight/1000
101+
FROM surveys;
102+
#+END_SRC
103+
104+
7. SQL databases have functions
105+
#+BEGIN_SRC sql
106+
SELECT plot_id, species_id, ROUND(weight/1000, 2)
107+
FROM surveys;
108+
#+END_SRC
109+
110+
** Filtering
111+
1. Filter by a criterion
112+
#+BEGIN_SRC sql
113+
SELECT *
114+
FROM surveys
115+
WHERE species_id='DM';
116+
#+END_SRC
117+
118+
#+BEGIN_SRC sql
119+
SELECT *
120+
FROM surveys
121+
WHERE year >= 2000;
122+
#+END_SRC
123+
124+
2. Combine criteria with booleans
125+
#+BEGIN_SRC sql
126+
SELECT *
127+
FROM surveys
128+
WHERE (year >= 2000) AND (species_id = 'DM');
129+
#+END_SRC
130+
131+
#+BEGIN_SRC sql
132+
SELECT *
133+
FROM surveys
134+
WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS');
135+
#+END_SRC
136+
137+
** *Challenge 1*: Large bois
138+
Get all of the individuals in Plot 1 that weighed more than 75 grams, telling us the date, species id code, and weight (in kg).
139+
140+
** Building complex queries
141+
Use sets ("tuples") to condense criteria.
142+
#+BEGIN_SRC sql
143+
SELECT *
144+
FROM surveys
145+
WHERE (year >= 2000) AND (species_id IN ('DM', 'DO', 'DS'));
146+
#+END_SRC
147+
148+
** Sorting
149+
1. Sort by a column value
150+
#+BEGIN_SRC sql
151+
SELECT *
152+
FROM species
153+
ORDER BY taxa ASC;
154+
#+END_SRC
155+
156+
2. Descending sort
157+
#+BEGIN_SRC sql
158+
SELECT *
159+
FROM species
160+
ORDER BY taxa DESC;
161+
#+END_SRC
162+
163+
3. Nested sort
164+
#+BEGIN_SRC sql
165+
SELECT *
166+
FROM species
167+
ORDER BY genus ASC, species ASC;
168+
#+END_SRC
169+
170+
** *Challenge 2*
171+
Write a query that returns year, species_id, and weight in kg from the surveys table, sorted with the largest weights at the top.
172+
173+
** Order of execution
174+
Queries are pipelines
175+
[[file:images/written_vs_execution_order.png]]
176+
177+
* Aggregating and grouping data (i.e. reporting)
178+
** COUNT and GROUP BY
179+
1. The COUNT function
180+
#+BEGIN_SRC sql
181+
SELECT COUNT(*)
182+
FROM surveys;
183+
#+END_SRC
184+
185+
#+BEGIN_SRC sql
186+
-- SELECT only returns the non-NULL weights
187+
SELECT COUNT(weight), AVG(weight)
188+
FROM surveys;
189+
#+END_SRC
190+
191+
2.
192+
** Ordering aggregated results
193+
** Aliases
194+
** The HAVING keyword
195+
** Saving queries for future use
196+
** NULL
197+
Go to slides, rather than extensively demo (do demo "is null")
198+
199+
* Combining data with joins
200+
201+
* Data hygiene
202+
** TODO The problem with nulls
203+
Missing data and deceptive query results
204+
205+
** Data integrity constraints: Keys, not null, etc
206+
207+
** TODO Levels of Normalization
208+
209+
* Creating and modifying data
210+
** Insert statements
211+
** Create tables
212+
** Table contraints
213+
sqlite check command
214+
https://stackoverflow.com/questions/29476818/how-to-avoid-inserting-the-wrong-data-type-in-sqlite-tables
215+
https://www.sqlitetutorial.net/sqlite-check-constraint/
216+
** Atomic commits
217+
By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
218+
219+
* (Optional) SQLite on the command line
220+
** Basic commands
221+
#+BEGIN_SRC bash
222+
sqlite3 # enter sqlite prompt
223+
.tables # show table names
224+
.schema # show table schema
225+
.help # view built-in commands
226+
.quit
227+
#+END_SRC
228+
229+
** Getting output
230+
1. Formatted output in the terminal
231+
#+BEGIN_SRC sql
232+
.headers on
233+
.help mode
234+
.mode column
235+
#+END_SRC
236+
237+
#+BEGIN_SRC sql
238+
select * from species where taxa == 'Rodent';
239+
#+END_SRC
240+
241+
2. Output to .csv file
242+
#+BEGIN_SRC bash
243+
.mode csv
244+
.output test.csv
245+
#+END_SRC
246+
247+
#+BEGIN_SRC sql
248+
select * from species where taxa == 'Rodent';
249+
#+END_SRC
250+
251+
#+BEGIN_SRC bash
252+
.output stdout
253+
#+END_SRC
254+
255+
* TODO (Optional) Database access via programming languages
256+
** R language bindings
257+
** Python language bindings
258+
259+
* (Optional) What kind of data storage system do I need?
260+
** Non-atomic write; sequential read
261+
1. Files
262+
263+
** Single atomic write (database-level lock); query-driven read
264+
1. SQLite
265+
2. Microsoft Access
266+
267+
** Multiple atomic writes (row-level lock); query-driven read
268+
1. PostgreSQL: https://www.postgresql.org
269+
2. MySQL/MariaDB
270+
- https://mariadb.org
271+
- https://www.mysql.com
272+
3. Oracle
273+
4. Microsoft SQL Server
274+
5. ...etc.
275+
276+
* (Optional) Performance tweaks and limitations
277+
** Getting the most out of your database
278+
1. Use recommended settings, not default settings
279+
2. Make judicious use of indexes
280+
3. Use the query planner (this will provide feedback for item 2)
281+
4. Cautiously de-normalize your schema
282+
283+
** Where relational databases break down
284+
1. Very large data (hardware, bandwidth, and data integration problems)
285+
2. Distributed data (uncertainty about correctness)
286+
287+
** Why are distributed systems hard?
288+
1. CAP theorem
289+
- In theory, pick any two: Consistent, Available, Partition-Tolerant
290+
- In practice, Consistent or Available in the presence of a Partition
291+
292+
2. Levels of data consistency
293+
- https://jepsen.io/consistency
294+
- https://github.com/aphyr/distsys-class
295+
296+
3. Fallacies of distributed computing
297+
1. The network is reliable
298+
2. Latency is zero
299+
3. Bandwidth is infinite
300+
4. The network is secure
301+
5. Topology doesn't change
302+
6. There is one administrator
303+
7. Transport cost is zero
304+
8. The network is homogeneous
305+
306+
* *Endnotes*
307+
* Credits
308+
- Data management with SQL for ecologists: https://datacarpentry.org/sql-ecology-lesson/
309+
- Databases and SQL: http://swcarpentry.github.io/sql-novice-survey/ (data hygiene, creating and modifying data)
310+
- Simplified bank account schema: https://soft-builder.com/bank-management-system-database-model/
311+
- Botanical Information and Ecology Network schema: https://bien.nceas.ucsb.edu/bien/biendata/bien-3/bien-3-schema/
312+
313+
* References
314+
- C. J. Date, /SQL and Relational Theory/: https://learning.oreilly.com/library/view/sql-and-relational/9781491941164/
315+
- Common database mistakes: https://stackoverflow.com/a/621891
316+
- Fallacies of distributed computing: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
317+
318+
* Data Sources
319+
- Portal Project Teaching Database: https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459
320+
Specifically, portal_mammals.sqlite: https://figshare.com/ndownloader/files/11188550
321+
322+
* COMMENT Export to Markdown using Pandoc
323+
Do this if you want code syntax highlighting and a table of contents on Github.
324+
** Generate generic Markdown file
325+
#+BEGIN_SRC bash
326+
pandoc README.org -o tmp.md --wrap=preserve
327+
#+END_SRC
328+
329+
** Edit generic Markdown file to remove illegal front matter
330+
1. Org directives
331+
2. Anything that isn't part of the document structure (e.g. TODO items)
332+
333+
** Generate Github Markdown with table of contents
334+
#+BEGIN_SRC bash
335+
pandoc -f markdown --toc --toc-depth=2 --wrap=preserve -s tmp.md -o README.md
336+
#+END_SRC
337+
338+
** Find and replace code block markers in final document (if applicable)
339+
#+BEGIN_EXAMPLE
340+
M-x qrr " {.python}" "python"
341+
M-x qrr " {.bash}" "bash"
342+
#+END_EXAMPLE

_config.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
title: Databases and SQL for Data Scientists
2+
description: Class notes and code examples
3+
4+
remote_theme: jekyll/minima
5+
#theme: jekyll-theme-slate

0 commit comments

Comments
 (0)