GitHub - gaaniruddha/FIT5196-A1: This repository contains assignments #1 that was completed as a part of "FIT5196 Data Wrangling", taught at Monash Uni in S2 2020.

Data Wrangling A1: Parsing text files & text pre-processing

Assignment_Specifications.pdf: Assignment specifications
Task1_Parsing_Text_Files.ipynb/pdf: task 1 code in Python, documentation in Markdown
Task2_Text_PreProcessing.ipynb/pdf: task 2 code in Python, documentation in Markdown
30945305.xlsx: Input for task 2.

Tasks completed:

Task 1:
- Extracting data from semi-structured text files, transforming it into an XML format as per the specifications
- Designing efficient regular expressions to extract data into an XML file
Task 2:

Python code to preprocess a set of tweets and convert them into numerical representations (which are suitable for input into recommender systems/ information-retrieval algorithms) based on the assignment specifications.

Libraries used: re, langid, od, nltk, nltk.collocations, nltk.tokenize, nltk.stem, nltk.probability, itertooks, sklearn.feature_extraction.text

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
30945305.xlsx		30945305.xlsx
Assignment_Specifications.pdf		Assignment_Specifications.pdf
README.md		README.md
Task1_Parsing_Text_Files.ipynb		Task1_Parsing_Text_Files.ipynb
Task1_Parsing_Text_Files.pdf		Task1_Parsing_Text_Files.pdf
Task2_Text_PreProcessing.ipynb		Task2_Text_PreProcessing.ipynb
Task2_Text_Preprocessing.pdf		Task2_Text_Preprocessing.pdf

Provide feedback