Skip to content

Add a "Dataset Importer" Windmill app #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
b588d70
Add stepper and db table check
rudokemper Jul 31, 2025
04a9ebd
Better state management for step 1
rudokemper Jul 31, 2025
a102766
Reactive design for step 2
rudokemper Jul 31, 2025
deda729
Save either base64 encoded or string to temp
rudokemper Jul 31, 2025
5e17c00
Correct behavior for saving converted file to temp and setting state
rudokemper Jul 31, 2025
f241854
Exclude XLS and other similar files from unzipping
rudokemper Aug 1, 2025
cb90fcb
Ensure actual XLS fixture is converted well
rudokemper Aug 1, 2025
4de6f34
Yes, we DO need to import openpyxl
rudokemper Aug 1, 2025
a063ec0
Only allow xls and xlsx for now
rudokemper Aug 1, 2025
9c3c1dd
Finishing touches to step 2 scripts
rudokemper Aug 1, 2025
546beff
Add step 4 for finalization (all but script)
rudokemper Aug 1, 2025
6646445
Rename to dataset importer
rudokemper Aug 4, 2025
06ca627
Better state management for tracked vars
rudokemper Aug 4, 2025
cd2508e
Better script names
rudokemper Aug 4, 2025
f8feaea
Remove brackets from script names to prevent errors
rudokemper Aug 4, 2025
59235ee
Add force SQL name function
rudokemper Aug 4, 2025
1d344cb
Pass dataset and valid SQL version; document TODOs
rudokemper Aug 4, 2025
ad3b987
Return file format when converting
rudokemper Aug 4, 2025
ad8ceac
Adjust data conversion tests
rudokemper Aug 4, 2025
c1828e1
Add test for force valid sql name
rudokemper Aug 4, 2025
9bf3a2d
Bump pyodk and dep versions
rudokemper Aug 4, 2025
8e38076
App can work with ODK transformation
rudokemper Aug 4, 2025
3625f49
Finalize flow - working with transformations
rudokemper Aug 4, 2025
cb81896
Refactor scripts and add docstrings
rudokemper Aug 4, 2025
cd19657
Resolve database name
rudokemper Aug 4, 2025
188895c
Adapt step naming convention for file eval upload
rudokemper Aug 4, 2025
8b8cdce
Remove GC File Uploader (Locus Map)
rudokemper Aug 4, 2025
97fb451
Clarify TODOs in comments
rudokemper Aug 4, 2025
2989959
Add data source column even if there are no transformations
rudokemper Aug 5, 2025
769023d
Add CSV to Postgres script
rudokemper Aug 5, 2025
1e42260
Additional refactor: move testable code to common_logic
rudokemper Aug 5, 2025
cbf262a
Small test changes
rudokemper Aug 5, 2025
0e4bb6e
Add feature id in kml and gpx conversion
rudokemper Aug 5, 2025
3cfaa01
Add README
rudokemper Aug 5, 2025
98937b1
Add note to root readme
rudokemper Aug 5, 2025
7420411
UI improvements
rudokemper Aug 5, 2025
124eab3
Add additional TODOs for completion
rudokemper Aug 5, 2025
d670148
Add additional TODOs
rudokemper Aug 6, 2025
6443407
Get CoMapeo working (sort of)
rudokemper Aug 6, 2025
fa3765a
Improved error returning in result messages
rudokemper Aug 7, 2025
12fcd4e
Merge branch 'main' of github.com:conservationmetrics/gc-scripts-hub …
rudokemper Aug 12, 2025
2130c4e
Bugfix: error in upload process shows correctly
rudokemper Aug 12, 2025
e5d039c
Sniff geojson files in .json extension
rudokemper Aug 12, 2025
1f0f91d
Deterministic uuid creation for missing geojson ids
rudokemper Aug 12, 2025
fd27248
Merge branch 'main' of github.com:conservationmetrics/gc-scripts-hub …
rudokemper Aug 14, 2025
9bbfbfd
Merge branch 'main' of github.com:conservationmetrics/gc-scripts-hub …
rudokemper Aug 14, 2025
fdcbe33
Merge branch 'main' of github.com:conservationmetrics/gc-scripts-hub …
rudokemper Aug 20, 2025
ef5e190
Improved UI behavior
rudokemper Aug 20, 2025
80150f8
Merge remote-tracking branch 'origin/main' into windmill-app-gc-file-…
rudokemper Aug 20, 2025
0a70989
Merge remote-tracking branch 'origin/main' into windmill-app-gc-file-…
rudokemper Aug 20, 2025
8891327
Add TODO about more robust dataset name case handling
rudokemper Aug 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Some of the tools available in the Guardian Connector Scripts Hub are:
* A flow to download and store GeoJSON and GeoTIFF change detection alerts, post these to a CoMapeo Archive Server
API, and send a message to WhatsApp recipients via Twilio.
* Scripts to export data from a database into a specific format (e.g., GeoJSON).
* An app to import and transform datasets from a variety of file formats and sources into a PostgreSQL database.

![Available scripts, flows, and apps in gc-scripts-hub](gc-scripts-hub.jpg)
_A Windmill Workspace populated with some of the tools in this repository._
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
const { currentStepIndex, lastAction } = formStepper;

// Step 1: Dataset name must be valid
if (currentStepIndex === 0 && !state.datasetAvailable) {
throw new Error("Please enter a valid dataset name to proceed.");
}

// Step 2: File must be uploaded
if (currentStepIndex === 1 && lastAction === "next" && !state.uploadSuccess) {
throw new Error("Please upload your file to proceed.");
}

// Step 4: Can't reuse same session
if (currentStepIndex === 3 && state.finalizeSuccess) {
throw new Error("Please refresh the page to upload another file.");
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# py313
psycopg2-binary==2.9.10
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from f.common_logic.db_operations import check_if_table_exists, conninfo, postgresql
from f.common_logic.identifier_utils import normalize_identifier


def main(db: postgresql, dataset_name: str):
valid_sql_name = normalize_identifier(dataset_name)

table_exists = check_if_table_exists(conninfo(db), valid_sql_name)

return {
"tableExists": table_exists,
"datasetName": dataset_name,
"validSqlName": valid_sql_name,
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
switch (state.datasetAvailable) {
case true:
return `✅ Dataset name is available! The database table name will be "${state.validSqlname}".`;
case false:
return "⚠️ Dataset name is already in usage.";
default:
return "";
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
return state.fileNameOriginal
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
state.uploadSuccess = false;
state.uploadButtonEnabled = selectFile?.result ? true : false;
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// if (selectFile.result && uploadFile.result && !uploadFile.result.error) {
// state.uploadSuccess = true;
// }
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
if (state.uploadButtonEnabled && state.uploadSuccess) {
return "✅ File successfully uploaded to temporary storage! Please proceed to the next step to finish writing the data to the warehouse."
} else if (state.uploadButtonEnabled && !state.uploadSuccess && state.uploadErrorMessage) {
return `❌ ${state.uploadErrorMessage}`
} else {
return ""
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# py313
attrs==25.3.0
certifi==2025.8.3
click==8.2.1
click-plugins==1.1.1.2
cligj==0.7.2
et-xmlfile==2.0.0
filetype==1.2.0
fiona==1.10.1
numpy==2.3.2
openpyxl==3.1.5
pandas==2.3.1
python-dateutil==2.9.0.post0
pytz==2025.2
six==1.17.0
tzdata==2025.2
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import csv
import json
import logging
from io import StringIO
from pathlib import Path

from f.common_logic.data_conversion import convert_data, detect_structured_data_type
from f.common_logic.file_operations import save_uploaded_file_to_temp

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def main(uploaded_file, dataset_name):
"""
Process uploaded file and convert to standardized format.

Takes an uploaded file, detects its format, and converts it to either CSV or GeoJSON
depending on the data type. Saves both original and converted files to a dataset-specific
temporary directory for further processing.

Parameters
----------
uploaded_file : object or list
File object or list containing uploaded file data.
dataset_name : str
Name of the dataset, used for creating temp directory paths.

Returns
-------
tuple[bool, str | None, str | None, str | None]
A tuple containing (success, error_message, output_filename, output_format):
- success : bool
True if processing completed successfully, False if an error occurred.
- error_message : str or None
Error message if success is False, None if success is True.
- output_filename : str or None
Name of the converted file with '_parsed' suffix if successful, None if failed.
- output_format : str or None
Format of converted file ('csv' or 'geojson') if successful, None if failed.
"""
try:
logger.info(f"Starting file upload and conversion for dataset: {dataset_name}")

temp_dir = Path(f"/persistent-storage/tmp/{dataset_name}")
temp_dir.mkdir(parents=True, exist_ok=True)
logger.info(f"Created dataset temp directory: {temp_dir}")

saved_input = save_uploaded_file_to_temp(uploaded_file, tmp_dir=str(temp_dir))
input_path = saved_input["file_paths"][0]
logger.info(f"Saved original file to: {input_path}")

file_format = detect_structured_data_type(input_path)
logger.info(f"Detected file format: {file_format}")

converted_data, output_format = convert_data(input_path, file_format)
logger.info(f"Converted to format: {output_format}")

output_filename = f"{Path(input_path).stem}_parsed.{output_format}"

if output_format == "csv":
output = StringIO()
writer = csv.writer(output)
writer.writerows(converted_data)
csv_data = output.getvalue()

file_to_save = [{"name": output_filename, "data": csv_data}]
else: # geojson
file_to_save = [
{"name": output_filename, "data": json.dumps(converted_data)}
]

saved_output = save_uploaded_file_to_temp(
file_to_save, is_base64=False, tmp_dir=str(temp_dir)
)
output_path = saved_output["file_paths"][0]
logger.info(f"Saved parsed file to: {output_path}")

# Return success
return True, None, output_filename, output_format

except Exception as e:
error_msg = f"Error during file upload and conversion: {e}"
logger.error(error_msg)
return False, error_msg, None, None
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
if (dataSourceToggle.result && state.dataSource) {
return state.dataSource
} else {
return "None selected"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
if (dataSourceToggle.result) {
state.dataSource = dataSources.result;
} else if (!dataSourceToggle.result) {
state.dataSource = undefined;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# py313
annotated-types==0.7.0
certifi==2025.8.3
charset-normalizer==3.4.2
idna==3.10
psycopg2-binary==2.9.10
pydantic==2.9.2
pydantic-core==2.23.4
pyodk==1.2.1
requests==2.32.3
toml==0.10.2
typing-extensions==4.14.1
urllib3==2.5.0
Loading