Skip to content
Christian Adriano, M.Sc., PMP edited this page Dec 24, 2018 · 4 revisions

Data wrangling (DW) of experiment data on fault localization microtasks

About the project

This Python project does data cleaning, formatting, and consolidating of the data from two large scale experiments.

How the data was generated

The experiments comprised locating the root-cause of software failures in 18 popular open source software projects. The experiment participants were programmers recruited on Mechanical Turk. Programmers performed microtasks that consisted each of a question about the relationship between a program statement and a software failure.

The data

Each line in the data set corresponds to the outcome of a small task (microtasks) performed by a crowd of programmers. The data files will be available on a public repository (soon!). Meanwhile, you can look at the paper draft available here.

The data wrangling steps

There are two file formats (different separators). Moreover, the two experiments produced files with different content, i.e., fields. Therefore, I needed to support this diversity while reusing a single workflow to consolidate microtasks and the anonymized participant data.

My choice was to use two abstractions - one to deal with the different formats (parsing) and other to deal with different content. These two concerns are implemented respectively in the classes MergeFile and Parser (Figure-1). I designed their dependencies in a way that allowed me to vary the merging and parsing strategies. This was possible by adopting the Strategy Design Pattern.

Class diagram for the data parsing and merging solution