Data Wrangling A1: Parsing text files & text pre-processing
- Assignment_Specifications.pdf: Assignment specifications
- Task1_Parsing_Text_Files.ipynb/pdf: task 1 code in Python, documentation in Markdown
- Task2_Text_PreProcessing.ipynb/pdf: task 2 code in Python, documentation in Markdown
- 30945305.xlsx: Input for task 2.
Tasks completed:
-
Task 1:
- Extracting data from semi-structured text files, transforming it into an XML format as per the specifications
- Designing efficient regular expressions to extract data into an XML file
-
Task 2:
- Python code to preprocess a set of tweets and convert them into numerical representations (which are suitable for input into recommender systems/ information-retrieval algorithms) based on the assignment specifications.
Libraries used: re, langid, od, nltk, nltk.collocations, nltk.tokenize, nltk.stem, nltk.probability, itertooks, sklearn.feature_extraction.text