The-Indus-Project

India, known for its rich civilization, stands as a testament to humanity's cultural tapestry, thriving amidst diverse languages and traditions. Its vibrant mosaic of tongues, encompassing over 1,600 languages, reflects the profound linguistic diversity that flourishes across its vast landscapes. From Hindi to Tamil, Bengali to Gujarati, each language weaves a narrative of its own, adding hues to India's vibrant social fabric. This linguistic tapestry fosters a deep sense of unity, embracing the multiplicity of voices and celebrating the profound heritage that defines India's extraordinary cultural heritage. In first phase we aim to cover 40 Hindi dialects in the Indus project. More dialects will be added subsequently!

Toolkit for Indus project

Indus NLP Toolkit.py is just an addendum on top of Indian NLTK so that some work needed for NLP purpose for Indus project cna be done .The repository is free to use to tokenize words in 40+ Hindi dialects , clean english words etc. There is a test file also given to tet the code

Data/Stopwords repository has the stopwords needed .Currently there are stop words for Hindi and its dialects like maithili. More are in the process and will be added soon

General usage

python test_indus_toolkit.py

In test_indus_toolkit import os import string import numpy as np from IndusNLPToolkit import Toolkit

ip = Toolkit() print(ip.pos_tags("हाय मेरे कोल 10000 स्टिकर न"))

print(ip.clean_text("हाय मेरे कोल 10000 स्टिकर न"))

etc...

Usage of IndusNLPToolkit.py

#Choose your text text = "संसद के विशेष सत्र (Parliament Special Session) के बीच कल यानी, सोमवार, 18 सितंबर को प्रधानमंत्री नरेंद्र मोदी (PM Narendra Modi) की अध्यक्षता में हुई केंद्रीय कैबिनेट बैठक हुई जिसमें महिला आरक्षण बिल (Women's Reservation Bill) मंजूरी दे दी गई है. सूत्रों के हिसाब से यह खबर सामने आ रही है. मोदी कैबिनेट की बैठक में लोकसभा और विधानसभाओं जैसी निर्वाचित संस्थाओं में 33 फीसदी महिला आरक्षण (Women Quota Bill 2023) पर मुहर लग गई है. मीडिया रिपोर्ट्स के अनुसार, महिला आरक्षण बिल को आज यानी मंगलवार को लोकसभा में नए संसद भवन (New Parliament Building) में पेश किया जाएगा."

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Data		Data
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
IndusNLPToolkit.py		IndusNLPToolkit.py
README.md		README.md
newsapi_pick.py		newsapi_pick.py
test_indus_toolkit.py		test_indus_toolkit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The-Indus-Project

Toolkit for Indus project

Usage of IndusNLPToolkit.py

If you wish to opne a file of data use the following

with open('Data\corpora\hin_mixed_2019_10K-sentences.txt' ,'r', encoding='utf-8') as f:

Make an instance of class

Usage various functions

print(tk.put_purna_viram(text))

print(tk.clean_text(text))

print(tk.word_tokenize(text))

print(tk.sent_tokenize(text))

print(tk.pos_tags(text))

print(tk.find_stopwords('hi'))

print(tk.generate_stemmed_text(text))

print("--------------------")

About

Releases

Packages

Languages

fauzailk/The-Indus-Project

Folders and files

Latest commit

History

Repository files navigation

The-Indus-Project

Toolkit for Indus project

Usage of IndusNLPToolkit.py

If you wish to opne a file of data use the following

with open('Data\corpora\hin_mixed_2019_10K-sentences.txt' ,'r', encoding='utf-8') as f:

Make an instance of class

Usage various functions

print(tk.put_purna_viram(text))

print(tk.clean_text(text))

print(tk.word_tokenize(text))

print(tk.sent_tokenize(text))

print(tk.pos_tags(text))

print(tk.find_stopwords('hi'))

print(tk.generate_stemmed_text(text))

print("--------------------")

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages