Update migrations README and migration 001

ghukill · ghukill · commit 243c7e1c3594 · 2025-06-03T13:47:14.000-04:00
diff --git a/migrations/001_2025_05_30_backfill_run_timestamp_column.py b/migrations/001_2025_05_30_backfill_run_timestamp_column.py
@@ -14,6 +14,15 @@
     b. retrieve the file creation date of the parquet file, this becomes the run_timestamp
     c. rewrite the parquet file with a new run_timestamp column
 
+Side effects:
+
+1- Loss of "Last Modified" date in S3
+
+This migration is using the original "Last Modified" date in S3 that was minted when the
+parquet file was written.  It is storing that data in a `run_timestamp` column and thus
+will persist, but the actual parquet file will LOSE this "Last Modified" date when it is
+recreated.
+
 Usage:
 
 pipenv run python migrations/001_2025_05_30_backfill_run_timestamp_column.py \
@@ -158,6 +167,10 @@ def backfill_parquet_file(
 def get_s3_object_creation_date(file_path: str, filesystem: fs.FileSystem) -> datetime:
     """Get the creation date of an S3 object.
 
+    This function assumes that all datetimes coming back are coming from the same source
+    and will be formatted similarly, which means either all values are timezone aware or
+    not.
+
     Args:
         file_path: Path to the S3 object
         filesystem: PyArrow S3 filesystem instance
diff --git a/migrations/README.md b/migrations/README.md
@@ -1,10 +1,16 @@
 # TIMDEX Dataset Migrations
 
-This directory includes manual, bulk migrations of data and schema in the TIMDEX parquet dataset.  Consider it like migrations for a SQL database, except a bit more unstructured and ad-hoc.
+This directory stores data and/or schema modifications that were made to the TIMDEX parquet dataset.  Consider them like ["migrations"](https://en.wikipedia.org/wiki/Schema_migration) for a SQL database, but -- at least at the time of this writing -- considerably more informal and ad-hoc.
+
+Unless otherwise noted, it assumed that these migrations were:
+
+  * manually run by a developer, either on a local machine or some cloud operations
+  * have been performed already, should not be performed again
+  * the migration script does not contain a way to rollback the changes
 
 ##  Structure
 
-Each migration is either a single python file, or a dedicated directory, with that follows the naming convention:
+Each migration is either a single python file, or a dedicated directory, that follow this naming convention:
 
   - `###_`: incrementing migration sequence number
   - `YYYY_MM_DD_`: approximate date of migration creation and run
@@ -15,6 +21,8 @@ Examples:
   - `001_2025_05_30_backfill_run_timestamp_column.py` --> single file
   - `002_2025_06_15_remove_errant_parquet_files` --> directory that contains 1+ files
 
+Files inside a migration directory like `002_2025_06_15_remove_errant_parquet_files` are _not_ expected to follow any particular format (though a `README.md` is encourage to inform future developers how it was performed!).
+
 The entrypoint for each migration should contain a docstring at the root of the file with a structure like:
 
 ```python