Introducing data cleaning as a potential intermediate step in ETL process #120
Replies: 2 comments
-
For transparency, if we were to do this, it would need to be scoped for a future grant. But it could be part of our 2026 scope of work. As of this moment, we are supporting with data cleanup manually, but already learning about the user requirements. |
Beta Was this translation helpful? Give feedback.
-
These two tools can help understand the issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This thread is documenting a problem that is emerging out of ongoing programmatic support, and a potential solution, for discussion.
The problem
In our current ETL approach for Guardian Connector, data is fetched directly from an upstream source, such as an API, or soon, individual file uploads. After applying basic transformations like reformatting column names, the data is then written as-is to the database.
This works well for projects with clean and consistent data. However, many projects require additional data cleaning to be usable. Common needs include:
Example Scenarios
KoboToolbox/ODK Forms
When forms are revised mid-project, a field like
what_is_your_age
may appear under different groupings in different versions, leading to columns like:what_is_your_age
demography/what_is_your_age
These values should ideally be merged during cleanup.
Mapping Projects using a tool like (Co)Mapeo
Projects that involve participatory mapping often include early "practice" points or redundant submissions. These need to be filtered or merged before meaningful analysis.
How our current scripts miss the mark
While our users appreciate seeing their raw data visualized in Superset or GC Explorer, they frequently want to clean up this data. Doing so on the front end demands a convoluted process of creating virtual tables using SQL. Or, users need to manually export data from Guardian Connector, clean it up, and upload it again (and currently, we have to do this for them).
What could be done?
To address these needs, we could introduce an optional data cleaning step in the ETL process. This can be achieved in a few ways:
Let's illustrate how this could work:
Status Quo (Current ETL Flow)
Possibility 1: manual cleanup and one-time upload
Possibility 2: Automated ETL with Optional Cleanup Phase
This approach retains automated sync but inserts an optional cleaning layer using external tools.
Beta Was this translation helpful? Give feedback.
All reactions