Skip to content

Commit 243c7e1

Browse files
committed
Update migrations README and migration 001
1 parent 2fa84df commit 243c7e1

File tree

2 files changed

+23
-2
lines changed

2 files changed

+23
-2
lines changed

migrations/001_2025_05_30_backfill_run_timestamp_column.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,15 @@
1414
b. retrieve the file creation date of the parquet file, this becomes the run_timestamp
1515
c. rewrite the parquet file with a new run_timestamp column
1616
17+
Side effects:
18+
19+
1- Loss of "Last Modified" date in S3
20+
21+
This migration is using the original "Last Modified" date in S3 that was minted when the
22+
parquet file was written. It is storing that data in a `run_timestamp` column and thus
23+
will persist, but the actual parquet file will LOSE this "Last Modified" date when it is
24+
recreated.
25+
1726
Usage:
1827
1928
pipenv run python migrations/001_2025_05_30_backfill_run_timestamp_column.py \
@@ -158,6 +167,10 @@ def backfill_parquet_file(
158167
def get_s3_object_creation_date(file_path: str, filesystem: fs.FileSystem) -> datetime:
159168
"""Get the creation date of an S3 object.
160169
170+
This function assumes that all datetimes coming back are coming from the same source
171+
and will be formatted similarly, which means either all values are timezone aware or
172+
not.
173+
161174
Args:
162175
file_path: Path to the S3 object
163176
filesystem: PyArrow S3 filesystem instance

migrations/README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
# TIMDEX Dataset Migrations
22

3-
This directory includes manual, bulk migrations of data and schema in the TIMDEX parquet dataset. Consider it like migrations for a SQL database, except a bit more unstructured and ad-hoc.
3+
This directory stores data and/or schema modifications that were made to the TIMDEX parquet dataset. Consider them like ["migrations"](https://en.wikipedia.org/wiki/Schema_migration) for a SQL database, but -- at least at the time of this writing -- considerably more informal and ad-hoc.
4+
5+
Unless otherwise noted, it assumed that these migrations were:
6+
7+
* manually run by a developer, either on a local machine or some cloud operations
8+
* have been performed already, should not be performed again
9+
* the migration script does not contain a way to rollback the changes
410

511
## Structure
612

7-
Each migration is either a single python file, or a dedicated directory, with that follows the naming convention:
13+
Each migration is either a single python file, or a dedicated directory, that follow this naming convention:
814

915
- `###_`: incrementing migration sequence number
1016
- `YYYY_MM_DD_`: approximate date of migration creation and run
@@ -15,6 +21,8 @@ Examples:
1521
- `001_2025_05_30_backfill_run_timestamp_column.py` --> single file
1622
- `002_2025_06_15_remove_errant_parquet_files` --> directory that contains 1+ files
1723

24+
Files inside a migration directory like `002_2025_06_15_remove_errant_parquet_files` are _not_ expected to follow any particular format (though a `README.md` is encourage to inform future developers how it was performed!).
25+
1826
The entrypoint for each migration should contain a docstring at the root of the file with a structure like:
1927

2028
```python

0 commit comments

Comments
 (0)