Skip to content

Latest commit

 

History

History
45 lines (32 loc) · 3.33 KB

README.md

File metadata and controls

45 lines (32 loc) · 3.33 KB

The-Indus-Project

India, known for its rich civilization, stands as a testament to humanity's cultural tapestry, thriving amidst diverse languages and traditions. Its vibrant mosaic of tongues, encompassing over 1,600 languages, reflects the profound linguistic diversity that flourishes across its vast landscapes. From Hindi to Tamil, Bengali to Gujarati, each language weaves a narrative of its own, adding hues to India's vibrant social fabric. This linguistic tapestry fosters a deep sense of unity, embracing the multiplicity of voices and celebrating the profound heritage that defines India's extraordinary cultural heritage. In first phase we aim to cover 40 Hindi dialects in the Indus project. More dialects will be added subsequently!

Toolkit for Indus project

Indus NLP Toolkit.py is just an addendum on top of Indian NLTK so that some work needed for NLP purpose for Indus project cna be done .The repository is free to use to tokenize words in 40+ Hindi dialects , clean english words etc. There is a test file also given to tet the code

Data/Stopwords repository has the stopwords needed .Currently there are stop words for Hindi and its dialects like maithili. More are in the process and will be added soon

General usage

python test_indus_toolkit.py

In test_indus_toolkit import os import string import numpy as np from IndusNLPToolkit import Toolkit

ip = Toolkit() print(ip.pos_tags("हाय मेरे कोल 10000 स्टिकर न"))

print(ip.clean_text("हाय मेरे कोल 10000 स्टिकर न"))

etc...

Usage of IndusNLPToolkit.py

#Choose your text text = "संसद के विशेष सत्र (Parliament Special Session) के बीच कल यानी, सोमवार, 18 सितंबर को प्रधानमंत्री नरेंद्र मोदी (PM Narendra Modi) की अध्यक्षता में हुई केंद्रीय कैबिनेट बैठक हुई जिसमें महिला आरक्षण बिल (Women's Reservation Bill) मंजूरी दे दी गई है. सूत्रों के हिसाब से यह खबर सामने आ रही है. मोदी कैबिनेट की बैठक में लोकसभा और विधानसभाओं जैसी निर्वाचित संस्थाओं में 33 फीसदी महिला आरक्षण (Women Quota Bill 2023) पर मुहर लग गई है. मीडिया रिपोर्ट्स के अनुसार, महिला आरक्षण बिल को आज यानी मंगलवार को लोकसभा में नए संसद भवन (New Parliament Building) में पेश किया जाएगा."

If you wish to opne a file of data use the following

with open('Data\corpora\hin_mixed_2019_10K-sentences.txt' ,'r', encoding='utf-8') as f:

###text = f.read()

Make an instance of class

tk = Toolkit()

Usage various functions

print(tk.put_purna_viram(text))

print(tk.clean_text(text))

print(tk.word_tokenize(text))

print(tk.sent_tokenize(text))

print(tk.pos_tags(text))

print(tk.find_stopwords('hi'))

print(tk.generate_stemmed_text(text))

print("--------------------")