Skip to content

doc: migrate to mkdocs #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .cursorrules
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- Use pydantic 2
- Pytest
- Use black formatting
- Avoid methods with sideeffects and if they are needed then add a "\_" suffix
- Prefer pathlib over os
- Prefer getter method names like `tasks` over `get_tasks`
- Commands need to be run using `poetry run <command>`
- Use simple tests with a bit of logging that we can run with `poetry run pytest -s` to check that the code works as expected
36 changes: 36 additions & 0 deletions .github/workflows/gh-pages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Deploy Documentation

on:
push:
branches:
- master
workflow_dispatch:

permissions:
contents: write

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install poetry
poetry install

- name: Build documentation
run: poetry run mkdocs build

- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./site
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.8]
python-version: [3.9]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
33 changes: 1 addition & 32 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
python-version: [3.9, "3.10"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down Expand Up @@ -41,34 +41,3 @@ jobs:
- name: Build wheels
run: |
poetry build

build-docs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.8]

steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

- name: Cache pip
uses: actions/cache@v2
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('docs/source/requirements.txt') }}-${ GITHUB_REF }
restore-keys: |
${{ runner.os }}-pip-
${{ runner.os }}-

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r docs/source/requirements.txt

- name: Build html
run: |
(cd docs && make html)
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dist
.eggs/
build/
*.pyc
site/

AUTHORS
ChangeLog
30 changes: 0 additions & 30 deletions .readthedocs.yml

This file was deleted.

124 changes: 124 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Pytorch Datastream

[![PyPI version](https://badge.fury.io/py/pytorch-datastream.svg)](https://badge.fury.io/py/pytorch-datastream)
[![Python versions](https://img.shields.io/pypi/pyversions/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
[![Documentation Status](https://github.com/nextml-code/pytorch-datastream/actions/workflows/deploy-docs.yml/badge.svg)](https://nextml-code.github.io/pytorch-datastream)
[![License](https://img.shields.io/pypi/l/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)

This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`.

`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`.

`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`.

## Install

```bash
poetry add pytorch-datastream
```

Or, for the old-timers:

```bash
pip install pytorch-datastream
```

## Usage

The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://nextml-code.github.io/pytorch-datastream) for a more extensive list on API and usage.

```python
Dataset.from_subscriptable
Dataset.from_dataframe
Dataset
.map
.subset
.split
.cache
.with_columns

Datastream.merge
Datastream.zip
Datastream
.map
.data*loader
.zip_index
.update_weights*
.update*example_weight*
.weight
.state_dict
.load_state_dict
```

### Simple image dataset example

Here's a basic example of loading images from a directory:

```python
from datastream import Dataset
from pathlib import Path
from PIL import Image

# Assuming images are in a directory structure like:
# images/
# class1/
# image1.jpg
# image2.jpg
# class2/
# image3.jpg
# image4.jpg

image_dir = Path("images")
image_paths = list(image_dir.glob("\*_/_.jpg"))

dataset = (
Dataset.from_paths(
image_paths,
pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg"
)
.map(lambda row: dict(
image=Image.open(row["path"]),
class_name=row["class_name"],
image_name=row["image_name"],
))
)

# Access an item from the dataset

first_item = dataset[0]
print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")
```

### Merge / stratify / oversample datastreams

The fruit datastreams given below repeatedly yields the string of its fruit type.

````python

> > > datastream = Datastream.merge([
> > > ... (apple_datastream, 2),
> > > ... (pear_datastream, 1),
> > > ... (banana_datastream, 1),
> > > ... ])
> > > next(iter(datastream.data_loader(batch_size=8)))
> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']
> > > ```

### Zip independently sampled datastreams

The fruit datastreams given below repeatedly yields the string of its fruit type.

```python

> > > datastream = Datastream.zip([
> > > ... apple_datastream,
> > > ... Datastream.merge([pear_datastream, banana_datastream]),
> > > ... ])
> > > next(iter(datastream.data_loader(batch_size=4)))
> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]
> > > ```

### More usage examples

See the [documentation](https://nextml-code.github.io/pytorch-datastream) for more usage examples.
````
Loading
Loading