India, known for its rich civilization, stands as a testament to humanity's cultural tapestry, thriving amidst diverse languages and traditions. Its vibrant mosaic of tongues, encompassing over 1,600 languages, reflects the profound linguistic diversity that flourishes across its vast landscapes. From Hindi to Tamil, Bengali to Gujarati, each language weaves a narrative of its own, adding hues to India's vibrant social fabric. This linguistic tapestry fosters a deep sense of unity, embracing the multiplicity of voices and celebrating the profound heritage that defines India's extraordinary cultural heritage. In first phase we aim to cover 40 Hindi dialects in the Indus project. More dialects will be added subsequently!
Indus NLP Toolkit.py is just an addendum on top of Indian NLTK so that some work needed for NLP purpose for Indus project cna be done .The repository is free to use to tokenize words in 40+ Hindi dialects , clean english words etc. There is a test file also given to tet the code
Data/Stopwords repository has the stopwords needed .Currently there are stop words for Hindi and its dialects like maithili. More are in the process and will be added soon
General usage
python test_indus_toolkit.py
In test_indus_toolkit import os import string import numpy as np from IndusNLPToolkit import Toolkit
ip = Toolkit() print(ip.pos_tags("हाय मेरे कोल 10000 स्टिकर न"))
print(ip.clean_text("हाय मेरे कोल 10000 स्टिकर न"))
etc...
#Choose your text text = "संसद के विशेष सत्र (Parliament Special Session) के बीच कल यानी, सोमवार, 18 सितंबर को प्रधानमंत्री नरेंद्र मोदी (PM Narendra Modi) की अध्यक्षता में हुई केंद्रीय कैबिनेट बैठक हुई जिसमें महिला आरक्षण बिल (Women's Reservation Bill) मंजूरी दे दी गई है. सूत्रों के हिसाब से यह खबर सामने आ रही है. मोदी कैबिनेट की बैठक में लोकसभा और विधानसभाओं जैसी निर्वाचित संस्थाओं में 33 फीसदी महिला आरक्षण (Women Quota Bill 2023) पर मुहर लग गई है. मीडिया रिपोर्ट्स के अनुसार, महिला आरक्षण बिल को आज यानी मंगलवार को लोकसभा में नए संसद भवन (New Parliament Building) में पेश किया जाएगा."
###text = f.read()
tk = Toolkit()