Skip to content

fauzailk/The-Indus-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The-Indus-Project

India, known for its rich civilization, stands as a testament to humanity's cultural tapestry, thriving amidst diverse languages and traditions. Its vibrant mosaic of tongues, encompassing over 1,600 languages, reflects the profound linguistic diversity that flourishes across its vast landscapes. From Hindi to Tamil, Bengali to Gujarati, each language weaves a narrative of its own, adding hues to India's vibrant social fabric. This linguistic tapestry fosters a deep sense of unity, embracing the multiplicity of voices and celebrating the profound heritage that defines India's extraordinary cultural heritage. In first phase we aim to cover 40 Hindi dialects in the Indus project. More dialects will be added subsequently!

Toolkit for Indus project

Indus NLP Toolkit.py is just an addendum on top of Indian NLTK so that some work needed for NLP purpose for Indus project cna be done .The repository is free to use to tokenize words in 40+ Hindi dialects , clean english words etc. There is a test file also given to tet the code

Data/Stopwords repository has the stopwords needed .Currently there are stop words for Hindi and its dialects like maithili. More are in the process and will be added soon

General usage

python test_indus_toolkit.py

In test_indus_toolkit import os import string import numpy as np from IndusNLPToolkit import Toolkit

ip = Toolkit() print(ip.pos_tags("हाय मेरे कोल 10000 स्टिकर न"))

print(ip.clean_text("हाय मेरे कोल 10000 स्टिकर न"))

etc...

Usage of IndusNLPToolkit.py

#Choose your text text = "संसद के विशेष सत्र (Parliament Special Session) के बीच कल यानी, सोमवार, 18 सितंबर को प्रधानमंत्री नरेंद्र मोदी (PM Narendra Modi) की अध्यक्षता में हुई केंद्रीय कैबिनेट बैठक हुई जिसमें महिला आरक्षण बिल (Women's Reservation Bill) मंजूरी दे दी गई है. सूत्रों के हिसाब से यह खबर सामने आ रही है. मोदी कैबिनेट की बैठक में लोकसभा और विधानसभाओं जैसी निर्वाचित संस्थाओं में 33 फीसदी महिला आरक्षण (Women Quota Bill 2023) पर मुहर लग गई है. मीडिया रिपोर्ट्स के अनुसार, महिला आरक्षण बिल को आज यानी मंगलवार को लोकसभा में नए संसद भवन (New Parliament Building) में पेश किया जाएगा."

If you wish to opne a file of data use the following

with open('Data\corpora\hin_mixed_2019_10K-sentences.txt' ,'r', encoding='utf-8') as f:

###text = f.read()

Make an instance of class

tk = Toolkit()

Usage various functions

print(tk.put_purna_viram(text))

print(tk.clean_text(text))

print(tk.word_tokenize(text))

print(tk.sent_tokenize(text))

print(tk.pos_tags(text))

print(tk.find_stopwords('hi'))

print(tk.generate_stemmed_text(text))

print("--------------------")

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%