Skip to content

NeoLafuente/Lab0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lab0

Lab0 is a lightweight data preprocessing toolbox written in Python.
It provides a small library of functions for cleaning and transforming data (numeric and text), plus a command-line interface (CLI) built with Click so you can run common preprocessing tasks from the terminal.

This README explains what the repository contains, how to use the library programmatically, and how to run and test the CLI.


Contents

  • src/preprocessing.py — core logic: functions for missing-value handling, deduplication, numeric transforms (normalization, z-score, clipping, log), text processing, and list utilities (flatten, shuffle).
  • src/cli.py — Click-based command line interface exposing the most useful preprocessing functions in grouped commands: clean, numeric, text, struct.
  • tests/ — pytest test suites (unit + CLI integration).
  • pyproject.toml — project configuration and development environment libraries.

Features

  • Missing value handling:
    • remove or fill missing values (handles None, '', NaN)
  • Deduplication:
    • remove duplicates preserving first-appearance order
  • Numerical transformations:
    • min-max normalization, z-score standardization
    • clipping to a given range
    • convert string lists to integers (ignores non-integers)
    • natural logarithm scaling (positive values only)
  • Text processing:
    • tokenization (lowercase, keep alphanumeric and spaces)
    • remove punctuation (preserve case)
    • remove stop-words
  • List utilities:
    • flatten nested lists
    • reproducible shuffle with seed

Project Organization

.
├── LICENSE.txt
├── main.py
├── pyproject.toml
├── README.md
├── src
│   ├── cli.py
│   ├── __init__.py
│   └── preprocessing.py
├── tests
│   ├── test_cli.py
│   └── test_logic.py
├── tree.txt
└── uv.lock

3 directories, 11 files

Installation

  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone the repository
git clone https://github.com/NeoLafuente/Lab0.git
cd Lab0
  1. Create and activate a virtual environment:
uv sync
source .venv/bin/activate

Using the Library (Python API)

Import functions from src.preprocessing to use them inside your Python scripts or notebooks.

Example:

from src.preprocessing import normalize_minmax, remove_missing_values

data = [1, None, 2, '', 3, float('nan'), 4]
clean = remove_missing_values(data)          # array([1,2,3,4])
normalized = normalize_minmax([1,2,3,4,5])   # array([0.,0.25,0.5,0.75,1.])

Text example:

from src.preprocessing import tokenize_alphanumeric, remove_stopwords

text = "The quick brown fox jumps over the lazy dog"
tok = tokenize_alphanumeric(text)  # 'the quick brown fox jumps over the lazy dog'
filtered = remove_stopwords(tok, {'the', 'over'})
# 'quick brown fox jumps lazy dog'

CLI Usage

The CLI is implemented in src/cli.py. You can run it with Python:

Run as a module:

python -m src.cli --help

Or call specific groups and commands. The CLI has 4 groups: clean, numeric, text, struct.

General form:

  • Commands accept a positional comma-separated list of values (legacy), OR the explicit --values (-v) option (preferred when an input element starts with -).
  • Many commands have options (for example --min / --max for normalization/clip).

Examples

  • Clean

    • Remove missing values:
      python -m src.cli clean remove-missing "1,2,,3,None,4"
      Or (preferred if a value begins with -):
      python -m src.cli clean remove-missing --values "-1,-2,,3"
    • Fill missing values:
      python -m src.cli clean fill-missing "1,2,,3" --fill-value 99
  • Numeric

    • Normalize:
      python -m src.cli numeric normalize "1,2,3,4,5"
      # or explicitly:
      python -m src.cli numeric normalize --values "-5,-3,0" --min -1 --max 1
    • Standardize:
      python -m src.cli numeric standardize "1,2,3,4,5"
    • Clip:
      python -m src.cli numeric clip "1,5,10,15,20" --min 5 --max 15
    • Convert to integers:
      python -m src.cli numeric to-integers "1,2,abc,4"
    • Logarithmic:
      python -m src.cli numeric logarithmic "1,10,100"
  • Text

    • Tokenize (lowercase, alphanumeric only):
      python -m src.cli text tokenize "Hello, World! 123"
    • Remove punctuation (preserve case):
      python -m src.cli text remove-punctuation "user@email.com"
    • Remove stop-words (customizable):
      python -m src.cli text remove-stopwords "the quick brown fox" --stopwords "the,a,is"
  • Struct

    • Shuffle:
      python -m src.cli struct shuffle "1,2,3,4,5" --seed 42
    • Flatten (expects JSON formatted list-of-lists):
      python -m src.cli struct flatten '[[1,2],[3,4]]'
    • Unique:
      python -m src.cli struct unique "1,2,2,3,3"

Notes about negative numbers

  • If a comma-separated value starts with a minus sign (e.g. -5,-3,0) Click might parse the token as an option. The CLI supports --values / -v to pass such lists safely (see examples above). You can still pass a positional argument for positive-only lists.

Running Tests

Unit and integration tests are written with pytest and Click's testing utilities.

Run tests:

# Run the full test suite
uv run python -m pytest -v

# With coverage reporting (requires pytest-cov)
uv run python -m pytest -v --cov=src
  • Tests live in the tests/ folder.
  • The tests include unit tests for each function and integration tests invoking the CLI via Click's CliRunner.
  • Some tests deliberately use --values for lists that start with - to avoid Click option parsing issues.

Contributing

Contributions are welcome. Typical workflow:

  1. Fork the repository.
  2. Create a topic branch: git checkout -b feat/my-change
  3. Run tests locally and add tests for your change.
  4. Open a pull request describing your change.

Please follow the existing style and add unit tests for new logic.


License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages