Skip to content

Commit 644c5b6

Browse files
committed
WIP
1 parent d8492e7 commit 644c5b6

17 files changed

+211
-229
lines changed

docs/development.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Development
22

3-
``datajudge`` development relies on [pixi](https://pixi.sh/latest/).
4-
In order to work on ``datajudge``, you can create a development environment as follows:
3+
`datajudge` development relies on [pixi](https://pixi.sh/latest/).
4+
In order to work on `datajudge`, you can create a development environment as follows:
55

66
```bash
77
git clone https://github.com/Quantco/datajudge
@@ -24,7 +24,7 @@ To run integration tests against Postgres, first start a docker container with a
2424
./start_postgres.sh
2525
```
2626

27-
In your current environment, install the ``psycopg2`` package.
27+
In your current environment, install the `psycopg2` package.
2828
After this, you may execute integration tests as follows:
2929

3030
```bash

docs/examples/company-data.md

+13-13
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,22 @@ The table "companies_archive" contains three entries:
66

77
**companies_archive**
88

9-
| id | name | num_employees |
10-
|----|---------|---------------|
11-
| 1 | QuantCo | 90 |
12-
| 2 | Google | 140,000 |
13-
| 3 | BMW | 110,000 |
9+
| id | name | num_employees |
10+
| --- | ------- | ------------- |
11+
| 1 | QuantCo | 90 |
12+
| 2 | Google | 140,000 |
13+
| 3 | BMW | 110,000 |
1414

1515
While "companies" contains an additional entry:
1616

1717
**companies**
1818

19-
| id | name | num_employees |
20-
|----|---------|---------------|
21-
| 1 | QuantCo | 100 |
22-
| 2 | Google | 150,000 |
23-
| 3 | BMW | 120,000 |
24-
| 4 | Apple | 145,000 |
19+
| id | name | num_employees |
20+
| --- | ------- | ------------- |
21+
| 1 | QuantCo | 100 |
22+
| 2 | Google | 150,000 |
23+
| 3 | BMW | 120,000 |
24+
| 4 | Apple | 145,000 |
2525

2626
```python
2727
import sqlalchemy as sa
@@ -108,7 +108,7 @@ requirements = [companies_req, companies_between_req]
108108
test_constraint = collect_data_tests(requirements)
109109
```
110110

111-
Saving this file as ``specification.py`` and running ``$ pytest specification.py``
111+
Saving this file as `specification.py` and running `$ pytest specification.py`
112112
will verify that all constraints are satisfied. The output you see in the terminal
113113
should be similar to this:
114114

@@ -125,4 +125,4 @@ specification.py::test_constraint[RowSuperset::companies|companies_archive] PASS
125125
==================================== 4 passed in 0.31s ====================================
126126
```
127127

128-
You can also use a formatted html report using the ``--html=report.html`` flag.
128+
You can also use a formatted html report using the `--html=report.html` flag.

docs/examples/dates.md

+25-26
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,39 @@
11
# Dates
22

3-
This example concerns itself with expressing ``Constraint``\s against data revolving
4-
around dates. While date ``Constraint``\s between tables exist, we will only illustrate
5-
``Constraint``\s on a single table and reference values here. As a consequence, we will
6-
only use ``WithinRequirement``, as opposed to ``BetweenRequirement``.
3+
This example concerns itself with expressing `Constraint`\s against data revolving
4+
around dates. While date `Constraint`\s between tables exist, we will only illustrate
5+
`Constraint`\s on a single table and reference values here. As a consequence, we will
6+
only use `WithinRequirement`, as opposed to `BetweenRequirement`.
77

88
Concretely, we will assume a table containing prices for a given product of id 1.
99
Importantly, these prices are valid for a certain date range only. More precisely,
10-
we assume that the price for a product - identified via the ``preduct_id`` column -
11-
is indicated in the ``price`` column, the date from which it is valid - the date
12-
itself included - in ``date_from`` and the the until when it is valid - the date
13-
itself included - in the ``date_to`` column.
10+
we assume that the price for a product - identified via the `preduct_id` column -
11+
is indicated in the `price` column, the date from which it is valid - the date
12+
itself included - in `date_from` and the the until when it is valid - the date
13+
itself included - in the `date_to` column.
1414

1515
Such a table might look as follows:
1616

1717
**prices**
1818

19-
| product_id | price | date_from | date_to |
20-
|------------|-------|-----------|---------|
21-
| 1 | 13.99 | 22/01/01 | 22/01/10|
22-
| 1 | 14.5 | 22/01/11 | 22/01/17|
23-
| 1 | 13.37 | 22/01/16 | 22/01/31|
19+
| product_id | price | date_from | date_to |
20+
| ---------- | ----- | --------- | -------- |
21+
| 1 | 13.99 | 22/01/01 | 22/01/10 |
22+
| 1 | 14.5 | 22/01/11 | 22/01/17 |
23+
| 1 | 13.37 | 22/01/16 | 22/01/31 |
2424

2525
Given this table, we would like to ensure - for the sake of illustrational purposes -
2626
that 6 constraints are satisfied:
2727

28-
1. All values from column ``date_from`` should be in January 2022.
29-
2. All values from column ``date_to`` should be in January 2022.
30-
3. The minimum value in column ``date_from`` should be the first of January 2022.
31-
4. The maximum value in column ``date_to`` should be the 31st of January 2022.
32-
5. There is no gap between ``date_from`` and ``date_to``. In other words, every date
28+
1. All values from column `date_from` should be in January 2022.
29+
2. All values from column `date_to` should be in January 2022.
30+
3. The minimum value in column `date_from` should be the first of January 2022.
31+
4. The maximum value in column `date_to` should be the 31st of January 2022.
32+
5. There is no gap between `date_from` and `date_to`. In other words, every date
3333
of January has to be assigned to at least one row for a given product.
34-
6. There is no overlap between ``date_from`` and ``date_to``. In other words, every
34+
6. There is no overlap between `date_from` and `date_to`. In other words, every
3535
date of January has to be assigned to at most one row for a given product.
3636

37-
3837
Assuming that such a table exists in database, we can write a specification against it.
3938

4039
```python
@@ -140,17 +139,17 @@ requirements = [prices_req]
140139
test_constraint = collect_data_tests(requirements)
141140
```
142141

143-
Please note that the ``DateNoOverlap`` and ``DateNoGap`` constraints also exist
144-
in a slightly different form: ``DateNoOverlap2d`` and ``DateNoGap2d``.
142+
Please note that the `DateNoOverlap` and `DateNoGap` constraints also exist
143+
in a slightly different form: `DateNoOverlap2d` and `DateNoGap2d`.
145144
As the names suggest, these can operate in 'two date dimensions'.
146145

147146
For example, let's assume a table with four date columns, representing two
148147
ranges in distinct dimensions, respectively:
149148

150-
* ``date_from``: Date from when a price is valid
151-
* ``date_to``: Date until when a price is valid
152-
* ``date_definition_from``: Date when a price definition was inserted
153-
* ``date_definition_to``: Date until when a price definition was used
149+
- `date_from`: Date from when a price is valid
150+
- `date_to`: Date until when a price is valid
151+
- `date_definition_from`: Date when a price definition was inserted
152+
- `date_definition_to`: Date until when a price definition was used
154153

155154
Analogously to the unidimensional scenario illustrated here, one might care
156155
for certain constraints in two dimensions.

docs/examples/exploration.md

+34-35
Original file line numberDiff line numberDiff line change
@@ -19,28 +19,28 @@ usually doesn't.
1919
In the following we will attempt to illustrate possible usages of datajudge for
2020
exploration by looking at three simple examples.
2121

22-
These examples rely on some insight about how most datajudge ``Constraint`` s work under
23-
the hood. Importantly, ``Constraint`` s typically come with
22+
These examples rely on some insight about how most datajudge `Constraint` s work under
23+
the hood. Importantly, `Constraint` s typically come with
2424

25-
* a ``retrieve`` method: this method fetches relevant data from database, given a
26-
``DataReference``
27-
* a ``get_factual_value`` method: this is typically a wrapper around ``retrieve`` for the
28-
first ``DataReference`` of the given ``Requirement`` / ``Constraint``
29-
* a ``get_target_value`` method: this is either a wrapper around ``retrieve`` for the
30-
second ``DataReference`` in the case of a ``BetweenRequirement`` or an echoing of the
31-
``Constraint`` s key reference value in the case of a ``WithinRequirement``
25+
- a `retrieve` method: this method fetches relevant data from database, given a
26+
`DataReference`
27+
- a `get_factual_value` method: this is typically a wrapper around `retrieve` for the
28+
first `DataReference` of the given `Requirement` / `Constraint`
29+
- a `get_target_value` method: this is either a wrapper around `retrieve` for the
30+
second `DataReference` in the case of a `BetweenRequirement` or an echoing of the
31+
`Constraint` s key reference value in the case of a `WithinRequirement`
3232

3333
Moreover, as is the case when using datajudge for testing purposes, these approaches rely
3434
on a [sqlalchemy engine](ttps://docs.sqlalchemy.org/en/14/core/connections.html). The
3535
latter is the gateway to the database at hand.
3636

3737
## Example 1: Comparing numbers of rows
3838

39-
Assume we have two tables in the same database called ``table1`` and ``table2``. Now we
39+
Assume we have two tables in the same database called `table1` and `table2`. Now we
4040
would like to compare their numbers of rows. Naturally, we would like to retrieve
4141
the respective numbers of rows before we can compare them. For this purpose we create
42-
a ``BetweenTableRequirement`` referring to both tables and add a ``NRowsEquality``
43-
``Constraint`` onto it.
42+
a `BetweenTableRequirement` referring to both tables and add a `NRowsEquality`
43+
`Constraint` onto it.
4444

4545
```python
4646
import sqlalchemy as sa
@@ -60,36 +60,36 @@ n_rows1 = req[0].get_factual_value(engine)
6060
n_rows2 = req[0].get_target_value(engine)
6161
```
6262

63-
Note that here, we access the first (and only) ``Constraint`` that has been added to the
64-
``BetweenRequirement`` by writing ``req[0]``. ``Requirements`` are are sequences of
65-
``Constraint`` s, after all.
63+
Note that here, we access the first (and only) `Constraint` that has been added to the
64+
`BetweenRequirement` by writing `req[0]`. `Requirements` are are sequences of
65+
`Constraint` s, after all.
6666

6767
Once the numbers of rows are retrieved, we can compare them as we wish. For instance, we
6868
could compute the absolute and relative growth (or loss) of numbers of rows from
69-
``table1`` to ``table2``:
69+
`table1` to `table2`:
7070

7171
```python
7272
absolute_change = abs(n_rows2 - n_rows1)
7373
relative_change = (absolute_change) / n_rows1 if n_rows1 != 0 else None
7474
```
7575

76-
Importantly, many datajudge staples, such as ``Condition`` s can be used, too. We shall see
76+
Importantly, many datajudge staples, such as `Condition` s can be used, too. We shall see
7777
this in our next example.
7878

7979
## Example 2: Investigating unique values
8080

81-
In this example we will suppose that there is a table called ``table`` consisting of
82-
several columns. Two of its columns are supposed to be called ``col_int`` and
83-
``col_varchar``. We are now interested in the unique values in these two columns combined.
81+
In this example we will suppose that there is a table called `table` consisting of
82+
several columns. Two of its columns are supposed to be called `col_int` and
83+
`col_varchar`. We are now interested in the unique values in these two columns combined.
8484
Put differently, we are wondering:
8585

86-
> Which unique pairs of values in ``col_int`` and ``col_varchar`` have we encountered?
86+
> Which unique pairs of values in `col_int` and `col_varchar` have we encountered?
8787
88-
To add to the mix, we will moreover only be interested in tuples in which ``col_int`` has a
88+
To add to the mix, we will moreover only be interested in tuples in which `col_int` has a
8989
value of larger than 10.
9090

91-
As before, we will start off by creating a ``Requirement``. Since we are only dealing with
92-
a single table this time, we will create a ``WithinRequirement``.
91+
As before, we will start off by creating a `Requirement`. Since we are only dealing with
92+
a single table this time, we will create a `WithinRequirement`.
9393

9494
```python
9595
import sqlalchemy as sa
@@ -113,20 +113,20 @@ req.add_uniques_equality_constraint(
113113
uniques = req[0].get_factual_value(engine)
114114
```
115115

116-
If one was to investigate this ``uniques`` variable further, one could, e.g. see the
116+
If one was to investigate this `uniques` variable further, one could, e.g. see the
117117
following:
118118

119119
```python
120120
([(10, 'hi10'), (11, 'hi11'), (12, 'hi12'), (13, 'hi13'), (14, 'hi14'), (15, 'hi15'), (16, 'hi16'), (17, 'hi17'), (18, 'hi18'), (19, 'hi19')], [1, 100, 12, 1, 7, 8, 1, 1, 1337, 1])
121121
```
122122

123-
This becomes easier to parse when inspecting the underlying ``retrieve`` method of the
124-
``UniquesEquality`` ``Constraint``: the first value of the tuple corresponds to the list
125-
of unique pairs in columns ``col_int`` and ``col_varchar``. The second value of the tuple
123+
This becomes easier to parse when inspecting the underlying `retrieve` method of the
124+
`UniquesEquality` `Constraint`: the first value of the tuple corresponds to the list
125+
of unique pairs in columns `col_int` and `col_varchar`. The second value of the tuple
126126
are the respective counts thereof.
127127

128128
Moreoever, one could manually customize the underlying SQL query. In order to do so, one
129-
can use the fact that ``retrieve`` methods typically return an actual result or value
129+
can use the fact that `retrieve` methods typically return an actual result or value
130130
as well as the sqlalchemy selections that led to said result or value. We can use these
131131
selections and compile them to a standard, textual SQL query:
132132

@@ -161,13 +161,13 @@ table. Moreover, for columns present in both tables, we'd like to learn about th
161161
respective types.
162162

163163
In order to illustrate such an example, we will again assume that there are two tables
164-
called ``table1`` and ``table2``, irrespective of prior examples.
164+
called `table1` and `table2`, irrespective of prior examples.
165165

166-
We can now create a ``BetweenRequirement`` for these two tables and use the
167-
``ColumnSubset`` ``Constraint``. As before, we will rely on the ``get_factual_value``
166+
We can now create a `BetweenRequirement` for these two tables and use the
167+
`ColumnSubset` `Constraint`. As before, we will rely on the `get_factual_value`
168168
method to retrieve the values of interest for the first table passed to the
169-
``BetweenRequirement`` and the ``get_target_value`` method for the second table passed
170-
to the ``BetweenRequirement``.
169+
`BetweenRequirement` and the `get_target_value` method for the second table passed
170+
to the `BetweenRequirement`.
171171

172172
```python
173173
import sqlalchemy as sa
@@ -194,7 +194,6 @@ print(f"Columns present in only table1: {set(columns1) - set(columns2)}")
194194
print(f"Columns present in only table2: {set(columns2) - set(columns1)}")
195195
```
196196

197-
198197
This could, for instance result in the following printout:
199198

200199
```

docs/examples/twitch.md

+15-15
Original file line numberDiff line numberDiff line change
@@ -41,18 +41,18 @@ df_v2.to_sql("twitch_v2", engine, schema="public", if_exists="replace")
4141
df_v1.to_sql("twitch_v1", engine, schema="public", if_exists="replace")
4242
```
4343

44-
Once the tables are stored in a database, we can actually write a ``datajudge``
44+
Once the tables are stored in a database, we can actually write a `datajudge`
4545
specification against them. But first, we'll have a look at what the data roughly
4646
looks like by investigating a random sample of four rows:
4747

4848
**A sample of the data**
4949

50-
| channel | watch_time | stream_time | peak_viewers | average_viewers | followers | followers_gained | views_gained | partnered | mature | language |
51-
|----------|------------|-------------|--------------|-----------------|-----------|------------------|--------------|-----------|--------|-----------|
52-
| xQcOW | 6196161750 | 215250 | 222720 | 27716 | 3246298 | 1734810 | 93036735 | True | False | English |
53-
| summit1g | 6091677300 | 211845 | 310998 | 25610 | 5310163 | 1374810 | 89705964 | True | False | English |
54-
| Gaules | 5644590915 | 515280 | 387315 | 10976 | 1767635 | 1023779 | 102611607 | True | True | Portuguese|
55-
| ESL_CSGO | 3970318140 | 517740 | 300575 | 7714 | 3944850 | 703986 | 106546942 | True | False | English |
50+
| channel | watch_time | stream_time | peak_viewers | average_viewers | followers | followers_gained | views_gained | partnered | mature | language |
51+
| -------- | ---------- | ----------- | ------------ | --------------- | --------- | ---------------- | ------------ | --------- | ------ | ---------- |
52+
| xQcOW | 6196161750 | 215250 | 222720 | 27716 | 3246298 | 1734810 | 93036735 | True | False | English |
53+
| summit1g | 6091677300 | 211845 | 310998 | 25610 | 5310163 | 1374810 | 89705964 | True | False | English |
54+
| Gaules | 5644590915 | 515280 | 387315 | 10976 | 1767635 | 1023779 | 102611607 | True | True | Portuguese |
55+
| ESL_CSGO | 3970318140 | 517740 | 300575 | 7714 | 3944850 | 703986 | 106546942 | True | False | English |
5656

5757
Note that we expect both version 1 and version 2 to follow this structure. Due to them
5858
being assembled at different points in time, merely their rows shows differ.
@@ -80,7 +80,7 @@ express expectations against them. In this example, we have two tables in the sa
8080
one table per version of the Twitch data.
8181

8282
Yet, let's start with a straightforward example only using version 2. We want to use our
83-
domain knowledge that constrains the values of the ``language`` column only to contain letters
83+
domain knowledge that constrains the values of the `language` column only to contain letters
8484
and have a length strictly larger than 0.
8585

8686
```python
@@ -145,7 +145,7 @@ between_requirement_version.add_uniques_equality_constraint(
145145
Now having compared the 'same kind of data' between version 1 and version 2,
146146
we may as well compare 'different kind of data' within version 2, as a means of
147147
a sanity check. This sanity check consists of checking whether the mean
148-
``average_viewer`` value of mature channels should deviate at most 10% from
148+
`average_viewer` value of mature channels should deviate at most 10% from
149149
the overall mean.
150150

151151
```python
@@ -168,7 +168,7 @@ between_requirement_columns.add_numeric_mean_constraint(
168168
```
169169

170170
Lastly, we need to collect all of our requirements in a list and make sure
171-
``pytest`` can find them by calling ``collect_data_tests``.
171+
`pytest` can find them by calling `collect_data_tests`.
172172

173173
```python
174174
from datajudge.pytest_integration import collect_data_tests
@@ -268,9 +268,9 @@ to investigate what is wrong with the data, what this has been caused by and how
268268

269269
Concretely, what exactly do we learn from the error messages?
270270

271-
* The column ``language`` now has a row with value ``'Sw3d1zh'``. This break two of our
272-
constraints. The ``VarCharRegex`` constraint compared the columns' values to a regular
273-
expression. The ``UniquesEquality`` constraint expected the unique values of the
274-
``language`` column to not have changed between version 1 and version 2.
275-
* The mean value of ``average_viewers`` of ``mature`` channels is substantially - more
271+
- The column `language` now has a row with value `'Sw3d1zh'`. This break two of our
272+
constraints. The `VarCharRegex` constraint compared the columns' values to a regular
273+
expression. The `UniquesEquality` constraint expected the unique values of the
274+
`language` column to not have changed between version 1 and version 2.
275+
- The mean value of `average_viewers` of `mature` channels is substantially - more
276276
than our 10% tolerance - lower than the global mean.

0 commit comments

Comments
 (0)