-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This Python project does data cleaning, formatting, and consolidating of the data from two large scale experiments.
The experiments comprised locating the root-cause of software failures in 18 popular open source software projects. The experiment participants were programmers recruited on Mechanical Turk. Programmers performed microtasks that consisted each of a question about the relationship between a program statement and a software failure.
Each line in the data set corresponds to the outcome of a small task (microtasks) performed by a crowd of programmers. The data files will be available on a public repository (soon!). Meanwhile, you can look at the paper draft available here.
There are two file formats (different separators). Moreover, the two experiments produced files with different content, i.e., fields. Therefore, I needed to support this diversity while reusing a single workflow to consolidate microtasks and the anonymized participant data.
My choice was to use two abstractions - one to deal with the different formats (parsing) and other to deal with different content. These two concerns are implemented respectively in the classes MergeFile and Parser (Figure-1). I designed their dependencies in a way that allowed me to vary the merging and parsing strategies. This was possible by adopting the Strategy Design Pattern.