You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+116-2Lines changed: 116 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,7 @@
1
+
from timdex_dataset_api import TIMDEXDataset
2
+
1
3
# timdex-dataset-api
2
-
Python library for interacting with a TIMDEX parquet dataset located remotely or in S3.
4
+
Python library for interacting with a TIMDEX parquet dataset located remotely or in S3. This library is often abbreviated as "TDA".
3
5
4
6
## Development
5
7
@@ -9,6 +11,13 @@ Python library for interacting with a TIMDEX parquet dataset located remotely or
9
11
- To run unit tests: `make test`
10
12
- To lint the repo: `make lint`
11
13
14
+
The library version number is set in [`timdex_dataset_api/__init__.py`](timdex_dataset_api/__init__.py), e.g.:
15
+
```python
16
+
__version__="2.1.0"
17
+
```
18
+
19
+
Updating the version number when making changes to the library will prompt applications that install it, when they have _their_ dependencies updated, to pickup the new version.
20
+
12
21
## Installation
13
22
14
23
This library is designed to be utilized by other projects, and can therefore be added as a dependency directly from the Github repository.
# load the dataset, which discovers all parquet files
83
+
timdex_dataset.load()
84
+
85
+
# or, load the dataset but ensure that only current records are ever yielded
86
+
timdex_dataset.load(current_records=True)
87
+
```
88
+
89
+
All read methods for `TIMDEXDataset` allow for the same group of filters which are defined in `timdex_dataset_api.dataset.DatasetFilters`. Examples are shown below.
# get batches of records, filtering to a particular run
97
+
for batch in timdex_dataset.read_batches_iter(
98
+
source="alma",
99
+
run_date="2025-06-01",
100
+
run_id="abc123"
101
+
):
102
+
# do thing with pyarrow batch...
103
+
104
+
105
+
# use convenience method to yield only transformed records
106
+
#NOTE: this is what TIM uses for indexing to Opensearch for a given ETL run
107
+
for transformed_record in timdex_dataset.read_transformed_records_iter(
108
+
source="aspace",
109
+
run_date="2025-06-01",
110
+
run_id="ghi789"
111
+
):
112
+
# do something with transformed record dictionary...
113
+
114
+
115
+
# load all records for a given run into a pandas dataframe
116
+
#NOTE: this can be potentially expensive memory-wise if the run is large
117
+
run_df = timdex_dataset.read_dataframe(
118
+
source="dspace",
119
+
run_date="2025-06-01",
120
+
run_id="def456"
121
+
)
122
+
```
123
+
124
+
### Writing Data
125
+
126
+
At this time, the only application that writes to the ETL parquet dataset is Transmogrifier.
127
+
128
+
To write records to the dataset, you must prepare an iterator of `timdex_dataset_api.record.DatasetRecord`. Here is some pseudocode for how a dataset write can work:
129
+
130
+
```python
131
+
from timdex_dataset_api import DatasetRecord, TIMDEXDataset
132
+
133
+
# different ways to achieve, just need some kind of iterator (e.g. list, generator, etc.)
0 commit comments