Lab0 is a lightweight data preprocessing toolbox written in Python.
It provides a small library of functions for cleaning and transforming data (numeric and text), plus a command-line interface (CLI) built with Click so you can run common preprocessing tasks from the terminal.
This README explains what the repository contains, how to use the library programmatically, and how to run and test the CLI.
src/preprocessing.py— core logic: functions for missing-value handling, deduplication, numeric transforms (normalization, z-score, clipping, log), text processing, and list utilities (flatten, shuffle).src/cli.py— Click-based command line interface exposing the most useful preprocessing functions in grouped commands:clean,numeric,text,struct.tests/— pytest test suites (unit + CLI integration).pyproject.toml— project configuration and development environment libraries.
- Missing value handling:
- remove or fill missing values (handles
None,'',NaN)
- remove or fill missing values (handles
- Deduplication:
- remove duplicates preserving first-appearance order
- Numerical transformations:
- min-max normalization, z-score standardization
- clipping to a given range
- convert string lists to integers (ignores non-integers)
- natural logarithm scaling (positive values only)
- Text processing:
- tokenization (lowercase, keep alphanumeric and spaces)
- remove punctuation (preserve case)
- remove stop-words
- List utilities:
- flatten nested lists
- reproducible shuffle with seed
.
├── LICENSE.txt
├── main.py
├── pyproject.toml
├── README.md
├── src
│ ├── cli.py
│ ├── __init__.py
│ └── preprocessing.py
├── tests
│ ├── test_cli.py
│ └── test_logic.py
├── tree.txt
└── uv.lock
3 directories, 11 files
- Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone the repository
git clone https://github.com/NeoLafuente/Lab0.git
cd Lab0- Create and activate a virtual environment:
uv sync
source .venv/bin/activateImport functions from src.preprocessing to use them inside your Python scripts or notebooks.
Example:
from src.preprocessing import normalize_minmax, remove_missing_values
data = [1, None, 2, '', 3, float('nan'), 4]
clean = remove_missing_values(data) # array([1,2,3,4])
normalized = normalize_minmax([1,2,3,4,5]) # array([0.,0.25,0.5,0.75,1.])Text example:
from src.preprocessing import tokenize_alphanumeric, remove_stopwords
text = "The quick brown fox jumps over the lazy dog"
tok = tokenize_alphanumeric(text) # 'the quick brown fox jumps over the lazy dog'
filtered = remove_stopwords(tok, {'the', 'over'})
# 'quick brown fox jumps lazy dog'The CLI is implemented in src/cli.py. You can run it with Python:
Run as a module:
python -m src.cli --helpOr call specific groups and commands. The CLI has 4 groups: clean, numeric, text, struct.
General form:
- Commands accept a positional comma-separated list of values (legacy), OR the explicit
--values(-v) option (preferred when an input element starts with-). - Many commands have options (for example
--min/--maxfor normalization/clip).
Examples
-
Clean
- Remove missing values:
Or (preferred if a value begins with
python -m src.cli clean remove-missing "1,2,,3,None,4"-):python -m src.cli clean remove-missing --values "-1,-2,,3" - Fill missing values:
python -m src.cli clean fill-missing "1,2,,3" --fill-value 99
- Remove missing values:
-
Numeric
- Normalize:
python -m src.cli numeric normalize "1,2,3,4,5" # or explicitly: python -m src.cli numeric normalize --values "-5,-3,0" --min -1 --max 1
- Standardize:
python -m src.cli numeric standardize "1,2,3,4,5" - Clip:
python -m src.cli numeric clip "1,5,10,15,20" --min 5 --max 15 - Convert to integers:
python -m src.cli numeric to-integers "1,2,abc,4" - Logarithmic:
python -m src.cli numeric logarithmic "1,10,100"
- Normalize:
-
Text
- Tokenize (lowercase, alphanumeric only):
python -m src.cli text tokenize "Hello, World! 123" - Remove punctuation (preserve case):
python -m src.cli text remove-punctuation "user@email.com" - Remove stop-words (customizable):
python -m src.cli text remove-stopwords "the quick brown fox" --stopwords "the,a,is"
- Tokenize (lowercase, alphanumeric only):
-
Struct
- Shuffle:
python -m src.cli struct shuffle "1,2,3,4,5" --seed 42 - Flatten (expects JSON formatted list-of-lists):
python -m src.cli struct flatten '[[1,2],[3,4]]' - Unique:
python -m src.cli struct unique "1,2,2,3,3"
- Shuffle:
Notes about negative numbers
- If a comma-separated value starts with a minus sign (e.g.
-5,-3,0) Click might parse the token as an option. The CLI supports--values/-vto pass such lists safely (see examples above). You can still pass a positional argument for positive-only lists.
Unit and integration tests are written with pytest and Click's testing utilities.
Run tests:
# Run the full test suite
uv run python -m pytest -v
# With coverage reporting (requires pytest-cov)
uv run python -m pytest -v --cov=src- Tests live in the
tests/folder. - The tests include unit tests for each function and integration tests invoking the CLI via Click's
CliRunner. - Some tests deliberately use
--valuesfor lists that start with-to avoid Click option parsing issues.
Contributions are welcome. Typical workflow:
- Fork the repository.
- Create a topic branch:
git checkout -b feat/my-change - Run tests locally and add tests for your change.
- Open a pull request describing your change.
Please follow the existing style and add unit tests for new logic.
This project is licensed under the MIT License - see the LICENSE file for details.