Skip to content

This repository contains assignments #1 that was completed as a part of "FIT5196 Data Wrangling", taught at Monash Uni in S2 2020.

Notifications You must be signed in to change notification settings

gaaniruddha/FIT5196-A1

Repository files navigation

Data Wrangling A1: Parsing text files & text pre-processing

  • Assignment_Specifications.pdf: Assignment specifications
  • Task1_Parsing_Text_Files.ipynb/pdf: task 1 code in Python, documentation in Markdown
  • Task2_Text_PreProcessing.ipynb/pdf: task 2 code in Python, documentation in Markdown
  • 30945305.xlsx: Input for task 2.

Tasks completed:

  1. Task 1:

    • Extracting data from semi-structured text files, transforming it into an XML format as per the specifications
    • Designing efficient regular expressions to extract data into an XML file
  2. Task 2:

  • Python code to preprocess a set of tweets and convert them into numerical representations (which are suitable for input into recommender systems/ information-retrieval algorithms) based on the assignment specifications.

Libraries used: re, langid, od, nltk, nltk.collocations, nltk.tokenize, nltk.stem, nltk.probability, itertooks, sklearn.feature_extraction.text

About

This repository contains assignments #1 that was completed as a part of "FIT5196 Data Wrangling", taught at Monash Uni in S2 2020.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published