Swahili Syllabic Tokenizer

This repository hosts a custom tokenizer for Swahili text, designed to tokenize text into syllables using a syllabic vocabulary. The tokenizer is compatible with the Hugging Face transformers library, making it easy to integrate into NLP pipelines and models.

Features

Syllabic Tokenization: Tokenizes Swahili text into syllables based on a predefined syllabic vocabulary.
Byte Fallback: Handles UTF-8 byte fallback for out-of-vocabulary tokens.
Customizable: Easily extendable and adaptable for specific NLP tasks.

Usage

Installation

pip install -r requirements.txt

Example

from hf_tokenizer import SilabiTokenizer

# Initialize the tokenizer
tokenizer = SilabiTokenizer()
# Encode a sample text
encoded_input = tokenizer("Hii ni mfano wa maandishi.")

# Decode the token ids back to text
decoded_text = tokenizer.decode(encoded_input['input_ids'])

print("Encoded Input:", encoded_input)
print("Decoded Text:", decoded_text)

Contributions

Contributions and suggestions are welcome! Feel free to open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
silabi_tokenizer		silabi_tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
model_card.md		model_card.md
requirements.txt		requirements.txt
setup.py		setup.py
silabi_vocab.json		silabi_vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swahili Syllabic Tokenizer

Features

Usage

Installation

Example

Contributions

About

Releases

Packages

Languages

License

nguthiru/silabi-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Swahili Syllabic Tokenizer

Features

Usage

Installation

Example

Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages