Introducing data cleaning as a potential intermediate step in ETL process #120

rudokemper · 2025-07-16T20:02:14Z

rudokemper
Jul 16, 2025
Maintainer

This thread is documenting a problem that is emerging out of ongoing programmatic support, and a potential solution, for discussion.

The problem

In our current ETL approach for Guardian Connector, data is fetched directly from an upstream source, such as an API, or soon, individual file uploads. After applying basic transformations like reformatting column names, the data is then written as-is to the database.

This works well for projects with clean and consistent data. However, many projects require additional data cleaning to be usable. Common needs include:

Merging columns created through iterative form design.
Removing training or test data, often captured during workshops or onboarding phases.
Consolidating duplicate entries about the same location or observation.

Example Scenarios

KoboToolbox/ODK Forms
When forms are revised mid-project, a field like what_is_your_age may appear under different groupings in different versions, leading to columns like:
- what_is_your_age
- demography/what_is_your_age
These values should ideally be merged during cleanup.
Mapping Projects using a tool like (Co)Mapeo
Projects that involve participatory mapping often include early "practice" points or redundant submissions. These need to be filtered or merged before meaningful analysis.

How our current scripts miss the mark

While our users appreciate seeing their raw data visualized in Superset or GC Explorer, they frequently want to clean up this data. Doing so on the front end demands a convoluted process of creating virtual tables using SQL. Or, users need to manually export data from Guardian Connector, clean it up, and upload it again (and currently, we have to do this for them).

What could be done?

To address these needs, we could introduce an optional data cleaning step in the ETL process. This can be achieved in a few ways:

Instead of setting up a regularly scheduled connector script, users can manually clean data in a downloaded file (e.g. XLS, CSV, GeoJSON) and then re-upload it using the forthcoming GC File Uploader Windmill app. This is a one-time, manual operation.
We can integrate tools like OpenRefine or even Google Spreadsheets, allowing users to refine and clean data before it's pulled back into the ETL pipeline using the APIs of those tools.

Let's illustrate how this could work:

Status Quo (Current ETL Flow)

graph TD
    A[A. Fetch Data from KoboToolbox] --> B[B. Transform Data]
    B --> C[C. Write Data to Database]

All steps are automated.
No opportunity for domain experts to clean or vet data mid-process.

Possibility 1: manual cleanup and one-time upload

flowchart TD
    A([A. Download CSV/XLS from KoboToolbox]) --> B([B. Clean Data in Excel or Similar])
    B --> C([C. Upload via GC File Uploader])
    C --> D[D. Write Data to Database]

Steps A–C are manual (shown with rounded boxes).
Useful when no ongoing sync is needed.

Possibility 2: Automated ETL with Optional Cleanup Phase

This approach retains automated sync but inserts an optional cleaning layer using external tools.

flowchart TD
    A[A. Fetch Data from KoboToolbox] -->|OpenRefine| C([B. Clean in OpenRefine]) --> D[C. Fetch Cleaned Data via API]
    A -->|Google Sheets| E([B. Clean in Google Sheets]) --> D

    D --> G[D. Transform Data]

    G --> H[E. Write Data to Database]

Cleaning steps B is manual.
Fetching, transformation, and writing remain automated.

rudokemper · 2025-07-16T20:17:25Z

rudokemper
Jul 16, 2025
Maintainer Author

For transparency, if we were to do this, it would need to be scoped for a future grant. But it could be part of our 2026 scope of work. As of this moment, we are supporting with data cleanup manually, but already learning about the user requirements.

0 replies

nicopace · 2025-07-17T15:48:49Z

nicopace
Jul 17, 2025
Maintainer

These two tools can help understand the issue.
OpenRefine has proven to be useful in the past for me in this role: https://openrefine.org/
Amphi.ai is written in Python and spits Pandas, so it strikes a good balance in between user friendlyness and integration with our current setup:https://amphi.ai/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introducing data cleaning as a potential intermediate step in ETL process #120

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Introducing data cleaning as a potential intermediate step in ETL process #120

Uh oh!

rudokemper Jul 16, 2025 Maintainer

The problem

Example Scenarios

How our current scripts miss the mark

What could be done?

Status Quo (Current ETL Flow)

Possibility 1: manual cleanup and one-time upload

Possibility 2: Automated ETL with Optional Cleanup Phase

Replies: 2 comments

Uh oh!

rudokemper Jul 16, 2025 Maintainer Author

Uh oh!

nicopace Jul 17, 2025 Maintainer

rudokemper
Jul 16, 2025
Maintainer

rudokemper
Jul 16, 2025
Maintainer Author

nicopace
Jul 17, 2025
Maintainer