nextml-code · samedii · Jan 3, 2025 · Jan 3, 2025 · Jan 3, 2025 · Jan 3, 2025
diff --git a/.cursorrules b/.cursorrules
@@ -0,0 +1,8 @@
+- Use pydantic 2
+- Pytest
+- Use black formatting
+- Avoid methods with sideeffects and if they are needed then add a "\_" suffix
+- Prefer pathlib over os
+- Prefer getter method names like `tasks` over `get_tasks`
+- Commands need to be run using `poetry run <command>`
+- Use simple tests with a bit of logging that we can run with `poetry run pytest -s` to check that the code works as expected
diff --git a/.github/workflows/gh-pages.yml b/.github/workflows/gh-pages.yml
@@ -0,0 +1,36 @@
+name: Deploy Documentation
+
+on:
+  push:
+    branches:
+      - master
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install poetry
+          poetry install
+
+      - name: Build documentation
+        run: poetry run mkdocs build
+
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site 
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -9,7 +9,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: [3.8]
+        python-version: [3.9]
     steps:
       - uses: actions/checkout@v2
       - name: Set up Python ${{ matrix.python-version }}

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -7,7 +7,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: [3.7, 3.8, 3.9]
+        python-version: [3.9, "3.10"]
     steps:
       - uses: actions/checkout@v2
       - name: Set up Python ${{ matrix.python-version }}
@@ -41,34 +41,3 @@ jobs:
       - name: Build wheels
         run: |
           poetry build
-
-  build-docs:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: [3.8]
-
-    steps:
-      - uses: actions/checkout@v2
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v2
-        with:
-          python-version: ${{ matrix.python-version }}
-
-      - name: Cache pip
-        uses: actions/cache@v2
-        with:
-          path: ~/.cache/pip
-          key: ${{ runner.os }}-pip-${{ hashFiles('docs/source/requirements.txt') }}-${ GITHUB_REF }
-          restore-keys: |
-            ${{ runner.os }}-pip-
-            ${{ runner.os }}-
-
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install -r docs/source/requirements.txt
-
-      - name: Build html
-        run: |
-          (cd docs && make html)
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,7 @@ dist
 .eggs/
 build/
 *.pyc
+site/
 
 AUTHORS
 ChangeLog
diff --git a/.readthedocs.yml b/.readthedocs.yml
diff --git a/README.md b/README.md
@@ -0,0 +1,124 @@
+# Pytorch Datastream
+
+[![PyPI version](https://badge.fury.io/py/pytorch-datastream.svg)](https://badge.fury.io/py/pytorch-datastream)
+[![Python versions](https://img.shields.io/pypi/pyversions/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
+[![Documentation Status](https://github.com/nextml-code/pytorch-datastream/actions/workflows/deploy-docs.yml/badge.svg)](https://nextml-code.github.io/pytorch-datastream)
+[![License](https://img.shields.io/pypi/l/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
+
+This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`.
+
+`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`.
+
+`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`.
+
+## Install
+
+```bash
+poetry add pytorch-datastream
+```
+
+Or, for the old-timers:
+
+```bash
+pip install pytorch-datastream
+```
+
+## Usage
+
+The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://nextml-code.github.io/pytorch-datastream) for a more extensive list on API and usage.
+
+```python
+Dataset.from_subscriptable
+Dataset.from_dataframe
+Dataset
+.map
+.subset
+.split
+.cache
+.with_columns
+
+Datastream.merge
+Datastream.zip
+Datastream
+.map
+.data*loader
+.zip_index
+.update_weights*
+.update*example_weight*
+.weight
+.state_dict
+.load_state_dict
+```
+
+### Simple image dataset example
+
+Here's a basic example of loading images from a directory:
+
+```python
+from datastream import Dataset
+from pathlib import Path
+from PIL import Image
+
+# Assuming images are in a directory structure like:
+# images/
+#   class1/
+#     image1.jpg
+#     image2.jpg
+#   class2/
+#     image3.jpg
+#     image4.jpg
+
+image_dir = Path("images")
+image_paths = list(image_dir.glob("\*_/_.jpg"))
+
+dataset = (
+Dataset.from_paths(
+image_paths,
+pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg"
+)
+.map(lambda row: dict(
+image=Image.open(row["path"]),
+class_name=row["class_name"],
+image_name=row["image_name"],
+))
+)
+
+# Access an item from the dataset
+
+first_item = dataset[0]
+print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")
+```
+
+### Merge / stratify / oversample datastreams
+
+The fruit datastreams given below repeatedly yields the string of its fruit type.
+
+````python
+
+> > > datastream = Datastream.merge([
+> > > ... (apple_datastream, 2),
+> > > ... (pear_datastream, 1),
+> > > ... (banana_datastream, 1),
+> > > ... ])
+> > > next(iter(datastream.data_loader(batch_size=8)))
+> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']
+> > > ```
+
+### Zip independently sampled datastreams
+
+The fruit datastreams given below repeatedly yields the string of its fruit type.
+
+```python
+
+> > > datastream = Datastream.zip([
+> > > ... apple_datastream,
+> > > ... Datastream.merge([pear_datastream, banana_datastream]),
+> > > ... ])
+> > > next(iter(datastream.data_loader(batch_size=4)))
+> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]
+> > > ```
+
+### More usage examples
+
+See the [documentation](https://nextml-code.github.io/pytorch-datastream) for more usage examples.
+````
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,6 +7,7 @@ dist @@
     .eggs/
     build/
     *.pyc
+    site/
     AUTHORS
     ChangeLog