data utils #127

jpopesculian · 2025-04-11T15:41:44Z

implements utilities for inferring schemas and converting CSVs and JSONs to parquet files

Copilot

Copilot reviewed 49 out of 53 changed files in this pull request and generated no comments.

Files not reviewed (4)

data-utils/.gitignore: Language not supported
data-utils/Makefile: Language not supported
data-utils/tests/data/csv/custom_null_test.csv: Language not supported
data-utils/tests/data/csv/decimal_test.csv: Language not supported

Angel-Dijoux

This is just amazing; I'm impressed, to be honest. I have just a few suggestions and comments!

Angel-Dijoux · 2025-04-24T20:20:56Z

data-utils/Makefile

+
+.PHONY: wasm
+wasm:
+	RUSTFLAGS="${RUSTFLAGS}" wasm-pack build --target web --out-name index  ${BUILDFLAGS} --features default-wasm


Should we use weak-ref here : https://rustwasm.github.io/docs/wasm-bindgen/reference/weak-references.html ?

ooh, i didn't know about the finalization registry. yeah, definitely something we probably want to use for production builds. i'd like to use some zero-copy stuff for doing the actual blob download when doing parquet conversions, which hopefully won't be a problem with de-allocation because the JS keeps a reference to the internal wasm memory, but i'll get to that when i get to that

data-utils/Makefile

data-utils/src/json/reader.rs

data-utils/src/process.rs

data-utils/src/csv/mod.rs

data-utils/src/csv/reader.rs

data-utils/src/infer.rs

jpopesculian · 2025-05-16T16:08:39Z

Added an example repo https://github.com/aqora-io/data-utils-playground

volgar1x · 2025-05-20T18:33:44Z

Hey @jpopesculian 👋 I haven’t reviewed yet sorry, but what are the next steps for this PR considering the dataset feature over at the platform?

jpopesculian · 2025-05-20T18:40:36Z

For this PR, its just about integrating it into the frontend and using it to convert user files into parquet before uploading

volgar1x · 2025-05-21T10:18:54Z

I have test failures with cargo test -p aqora-data-utils:

failures:
    csv::test::basic_example
    csv::test::example_no_headers
    csv::test::null_test
    json::test::basic_json
    json::test::basic_jsonl
    json::test::basic_lists_json
    json::test::basic_lists_jsonl
    json::test::basic_lists_no_headers_json
    json::test::basic_lists_no_headers_jsonl
    json::test::convert_basic_json

volgar1x · 2025-05-21T11:11:58Z

Should the commands aqora data infer and aqora data convert stay in the final version of the PR? How should they be used by the user?

I was under the impression the user journey would be like this:

Create a dataset aqora data init {DATASET_NAME} {FOLDER},
Register dataset files using aqora data add {FILENAME},
this is where we infer a schema, then output it in pyproject.toml
Tweak the schema in pyproject.toml,
Get infos using aqora dataset to get:
- Currently inferred schema,
- Row count,
- Total size, (although this info may be misleading since the converted parquet files usually have a lighter footprint than CSV/JSON)
Then do aqora upload.
this where we convert the files and upload the parquet to the platform
Or do instead aqora upload --dry-run to convert files without uploading them.

Angel-Dijoux

I read it a first time, and so far, the code looks very good 🎉 ! I just ran into 2 issues:
- I got the same result as @volgar1x when running tests
~~- I cannot compile with my mac, when I tried, I ran into this error: rust-lang/rust#57349 for the arrow-0.2.3 crate~~

Angel-Dijoux · 2025-05-22T07:06:19Z

data-utils/Makefile

+.PHONY: wasm-build
+wasm-build:
+	RUSTFLAGS="${RUSTFLAGS}" wasm-pack build --target web --out-name index --weak-refs --reference-types \
+		${BUILDFLAGS} --no-default-features --features default-wasm
+


what do you think about using this in

cli/.github/workflows/cd.yaml

Line 188 in 0c93130

wasm-pack build --target web --out-name index --release --weak-refs --reference-types --no-default-features --features default-wasm

?

Angel-Dijoux · 2025-05-22T07:24:26Z

data-utils/src/value.rs

+
+pub use ron::{Map, Value};
+
+const NAIVE_DATE_TIME_FMT: StrftimeItems<'static> = StrftimeItems::new("%Y-%m-%dT%H:%M:%S");


what is the difference between this and: const NAIVE_DATE_TIME_FMT: &str "%Y-%m-%dT%H:%M:%S"; and later : dt.format(NAIVE_DATE_TIME_FMT)?

Angel-Dijoux · 2025-05-22T07:32:27Z

data-utils/src/wasm/blob.rs

+    if rem != 0 {
+        index -= 1;
+        out[index].write(vec.split_off(vec.len() - rem).into_boxed_slice());
+        vec.shrink_to_fit();


What does this function shrink_to_fit do?

Angel-Dijoux · 2025-05-22T07:33:58Z

data-utils/src/wasm/blob.rs

+            .map(|buf| vec_to_blob(buf.into_inner(), 65_536, &options))
+            .collect()
+    }
+    // }


oh a little piece of an old story 😆

implements utilities for inferring schemas and converting CSVs and JSONs to parquet files

jpopesculian force-pushed the feat-data-utils branch from 97ee0a8 to f7491f0 Compare April 11, 2025 15:42

jpopesculian requested review from Copilot and volgar1x April 11, 2025 15:42

Copilot AI reviewed Apr 11, 2025

View reviewed changes

jpopesculian force-pushed the feat-data-utils branch 3 times, most recently from 643d3c3 to 9738195 Compare April 11, 2025 22:18

Angel-Dijoux force-pushed the feat-data-utils branch 5 times, most recently from 1403af3 to 9738195 Compare April 18, 2025 09:12

jpopesculian force-pushed the feat-data-utils branch 4 times, most recently from 6f98011 to 63671aa Compare April 22, 2025 15:37

DarkSylver approved these changes Apr 22, 2025

View reviewed changes

jpopesculian force-pushed the feat-data-utils branch 2 times, most recently from 6de7954 to d6a7d64 Compare April 22, 2025 16:38

Angel-Dijoux reviewed Apr 25, 2025

View reviewed changes

jpopesculian force-pushed the feat-data-utils branch 10 times, most recently from a765a12 to abdbca6 Compare April 28, 2025 16:12

jpopesculian force-pushed the feat-data-utils branch 19 times, most recently from f5f7ede to 12ff923 Compare May 16, 2025 13:57

jpopesculian force-pushed the feat-data-utils branch 2 times, most recently from 207ca58 to 0c93130 Compare May 19, 2025 16:06

Angel-Dijoux requested changes May 22, 2025

View reviewed changes

jpopesculian force-pushed the feat-data-utils branch from 0c93130 to d302d3b Compare May 23, 2025 13:49

feat: data utils

db9cc76

implements utilities for inferring schemas and converting CSVs and JSONs to parquet files

jpopesculian force-pushed the feat-data-utils branch from d302d3b to db9cc76 Compare June 3, 2025 15:59


		pub use ron::{Map, Value};

		const NAIVE_DATE_TIME_FMT: StrftimeItems<'static> = StrftimeItems::new("%Y-%m-%dT%H:%M:%S");

data utils #127

Are you sure you want to change the base?

data utils #127

Uh oh!

Conversation

jpopesculian commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux left a comment

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

jpopesculian Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpopesculian commented May 16, 2025

Uh oh!

volgar1x commented May 20, 2025

Uh oh!

jpopesculian commented May 20, 2025

Uh oh!

volgar1x commented May 21, 2025

Uh oh!

volgar1x commented May 21, 2025

Uh oh!

Angel-Dijoux left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Angel-Dijoux May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpopesculian commented Apr 11, 2025 •

edited

Loading

Angel-Dijoux left a comment •

edited

Loading