Skip to content

Latest commit

 

History

History
189 lines (157 loc) · 7.92 KB

README.md

File metadata and controls

189 lines (157 loc) · 7.92 KB

ParsiPy: NLP Toolkit for Historical Persian Texts in Python


PyPI version built with Python3 GitHub repo size

Overview

ParsiPy is an NLP toolkit designed for analyzing historical Persian texts, including languages like Parsig (Pahlavi). It provides essential modules such as lemmatization, POS tagging, tokenization, and phoneme-to-grapheme conversion, making it a valuable resource for researchers working with low-resource languages. Beyond its practical applications, ParsiPy serves as a model for developing NLP tools tailored to linguistically rich yet underrepresented languages.

PyPI Counter
Github Stars
Branch main dev
CI

Installation

PyPI

Source code

Usage

To use ParsiPy's modules for analyzing texts in the Pahlavi language, you need to input your text in phonetic form.
To simplify the process, we have developed a pipeline module that works as follows.

Pipeline

In the following example, we use a passage from an ancient Parsig text containing advice for people at that time. Its rough English translation is: "Forget what is gone and do not worry about what has not yet come." [1]

You can easily apply tokenization, lemmatization, POS tagging, and phoneme-to-grapheme conversion to this text using the following code:

>>> from parsipy import pipeline, Task
>>> result = pipeline(sentence='ān uzīd frāmōš kun ud ān nē mad ēstēd rāy tēmār bēš ma bar',
                      tasks=[Task.TOKENIZER, Task.LEMMA, Task.POS, Task.P2T])

The result is a dictionary containing the outputs of all requested tasks:

{
    "tokenizer": [
        {"id": 0, "text": "ān"},
        {"id": 1, "text": "uzīd"},
        {"id": 2, "text": "frāmōš"},
        {"id": 3, "text": "kun"},
        {"id": 4, "text": "ud"},
        {"id": 5, "text": "ān"},
        {"id": 6, "text": ""},
        {"id": 7, "text": "mad"},
        {"id": 8, "text": "ēstēd"},
        {"id": 9, "text": "rāy"},
        {"id": 10, "text": "tēmār"},
        {"id": 11, "text": "bēš"},
        {"id": 12, "text": "ma"},
        {"id": 13, "text": "bar"}
    ],
    "lemma": [
        {"stem": "ān", "text": "ān"},
        {"stem": "uzīd", "text": "uzīd"},
        {"stem": "frāmōš", "text": "frāmōš"},
        {"stem": "kun", "text": "kun"},
        {"stem": "ud", "text": "ud"},
        {"stem": "ān", "text": "ān"},
        {"stem": "", "text": ""},
        {"stem": "mad", "text": "mad"},
        {"stem": "ēst", "text": "ēstēd"},
        {"stem": "rāy", "text": "rāy"},
        {"stem": "tēmār", "text": "tēmār"},
        {"stem": "bēš", "text": "bēš"},
        {"stem": "ma", "text": "ma"},
        {"stem": "bar", "text": "bar"}
    ],

    "POS": [
        {"POS": "DET", "text": "ān"},
        {"POS": "N", "text": "uzīd"},
        {"POS": "N", "text": "frāmōš"},
        {"POS": "V", "text": "kun"},
        {"POS": "CONJ", "text": "ud"},
        {"POS": "DET", "text": "ān"},
        {"POS": "ADV", "text": ""},
        {"POS": "V", "text": "mad"},
        {"POS": "V", "text": "ēstēd"},
        {"POS": "POST", "text": "rāy"},
        {"POS": "N", "text": "tēmār"},
        {"POS": "N", "text": "bēš"},
        {"POS": "ADV", "text": "ma"},
        {"POS": "N", "text": "bar"}
    ],
    "P2T": [
        {"text": "ān", "transliteration": "ZK"},
        {"text": "uzīd", "transliteration": "ʾwcyt"},
        {"text": "frāmōš", "transliteration": "plʾmwš"},
        {"text": "kun", "transliteration": "OḆYDWNt͟y"},
        {"text": "ud", "transliteration": "W"},
        {"text": "ān", "transliteration": "ZK"},
        {"text": "", "transliteration": "LA"},
        {"text": "mad", "transliteration": "mt"},
        {"text": "ēstēd", "transliteration": "YKOYMWyt'"},
        {"text": "rāy", "transliteration": "lʾd"},
        {"text": "tēmār", "transliteration": "tymʾl"},
        {"text": "bēš", "transliteration": "byš"},
        {"text": "ma", "transliteration": "AL"},
        {"text": "bar", "transliteration": "YḆLWN"}
    ]
}

Below is a brief explanation of each task:

Tokenization

This module splits a sentence into individual tokens, making it easier to process each word separately. Tokenization is a crucial first step for many NLP tasks.

Lemmatization

Lemmatization reduces words to their base or root forms, removing prefixes and suffixes. This is useful for standardizing different word variations.

POS

This module assigns a part-of-speech (POS) tag to each word in a sentence based on its grammatical role. The output provides essential grammatical information for further text analysis.

P2T

Since there is no widely accepted Unicode representation for the original Pahlavi script, digital texts are often written in a phonetic form. This module maps phonetic representations to their transliteration which is a middle-form between phonetic and their original characters. We also present a tool for converting the transliteration into the original text format.

For converting transliteration to Parsig font, you can use this exe file and font in Windows.

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [email protected].

  • Please complete the issue template

References

1- گشتاسب, فرزانه, and حاجی پور. "توصیف و تبیین ماهیت عدالت خسرو انوشیروان در متون فارسی و جستجوی پیشینه آن در متون فارسی میانه." (فصلنامه مطالعات تاریخ فرهنگی) پژوهشنامه انجمن ایرانی تاریخ 14.53 (2022): 101-125.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ParsiPy Donation