Skip to content

data utils #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

data utils #127

wants to merge 1 commit into from

Conversation

jpopesculian
Copy link
Contributor

@jpopesculian jpopesculian commented Apr 11, 2025

implements utilities for inferring schemas and converting CSVs and JSONs to parquet files

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 49 out of 53 changed files in this pull request and generated no comments.

Files not reviewed (4)
  • data-utils/.gitignore: Language not supported
  • data-utils/Makefile: Language not supported
  • data-utils/tests/data/csv/custom_null_test.csv: Language not supported
  • data-utils/tests/data/csv/decimal_test.csv: Language not supported

@jpopesculian jpopesculian force-pushed the feat-data-utils branch 3 times, most recently from 643d3c3 to 9738195 Compare April 11, 2025 22:18
@Angel-Dijoux Angel-Dijoux force-pushed the feat-data-utils branch 5 times, most recently from 1403af3 to 9738195 Compare April 18, 2025 09:12
@jpopesculian jpopesculian force-pushed the feat-data-utils branch 4 times, most recently from 6f98011 to 63671aa Compare April 22, 2025 15:37
@jpopesculian jpopesculian force-pushed the feat-data-utils branch 2 times, most recently from 6de7954 to d6a7d64 Compare April 22, 2025 16:38
Copy link
Contributor

@Angel-Dijoux Angel-Dijoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just amazing; I'm impressed, to be honest. I have just a few suggestions and comments!


.PHONY: wasm
wasm:
RUSTFLAGS="${RUSTFLAGS}" wasm-pack build --target web --out-name index ${BUILDFLAGS} --features default-wasm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, i didn't know about the finalization registry. yeah, definitely something we probably want to use for production builds. i'd like to use some zero-copy stuff for doing the actual blob download when doing parquet conversions, which hopefully won't be a problem with de-allocation because the JS keeps a reference to the internal wasm memory, but i'll get to that when i get to that

@jpopesculian jpopesculian force-pushed the feat-data-utils branch 10 times, most recently from a765a12 to abdbca6 Compare April 28, 2025 16:12
@jpopesculian jpopesculian force-pushed the feat-data-utils branch 19 times, most recently from f5f7ede to 12ff923 Compare May 16, 2025 13:57
@jpopesculian
Copy link
Contributor Author

Added an example repo https://github.com/aqora-io/data-utils-playground

@jpopesculian jpopesculian force-pushed the feat-data-utils branch 2 times, most recently from 207ca58 to 0c93130 Compare May 19, 2025 16:06
Copy link
Contributor

Hey @jpopesculian 👋 I haven’t reviewed yet sorry, but what are the next steps for this PR considering the dataset feature over at the platform?

Copy link
Contributor Author

For this PR, its just about integrating it into the frontend and using it to convert user files into parquet before uploading

@volgar1x
Copy link
Contributor

I have test failures with cargo test -p aqora-data-utils:

failures:
    csv::test::basic_example
    csv::test::example_no_headers
    csv::test::null_test
    json::test::basic_json
    json::test::basic_jsonl
    json::test::basic_lists_json
    json::test::basic_lists_jsonl
    json::test::basic_lists_no_headers_json
    json::test::basic_lists_no_headers_jsonl
    json::test::convert_basic_json

@volgar1x
Copy link
Contributor

Should the commands aqora data infer and aqora data convert stay in the final version of the PR? How should they be used by the user?

I was under the impression the user journey would be like this:

  1. Create a dataset aqora data init {DATASET_NAME} {FOLDER},
  2. Register dataset files using aqora data add {FILENAME},
    this is where we infer a schema, then output it in pyproject.toml
  3. Tweak the schema in pyproject.toml,
  4. Get infos using aqora dataset to get:
    • Currently inferred schema,
    • Row count,
    • Total size, (although this info may be misleading since the converted parquet files usually have a lighter footprint than CSV/JSON)
  5. Then do aqora upload.
    this where we convert the files and upload the parquet to the platform
  6. Or do instead aqora upload --dry-run to convert files without uploading them.

Copy link
Contributor

@Angel-Dijoux Angel-Dijoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it a first time, and so far, the code looks very good 🎉 ! I just ran into 2 issues:
- I got the same result as @volgar1x when running tests
- I cannot compile with my mac, when I tried, I ran into this error: rust-lang/rust#57349 for the arrow-0.2.3 crate

Comment on lines +20 to +24
.PHONY: wasm-build
wasm-build:
RUSTFLAGS="${RUSTFLAGS}" wasm-pack build --target web --out-name index --weak-refs --reference-types \
${BUILDFLAGS} --no-default-features --features default-wasm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about using this in

wasm-pack build --target web --out-name index --release --weak-refs --reference-types --no-default-features --features default-wasm
?


pub use ron::{Map, Value};

const NAIVE_DATE_TIME_FMT: StrftimeItems<'static> = StrftimeItems::new("%Y-%m-%dT%H:%M:%S");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between this and: const NAIVE_DATE_TIME_FMT: &str "%Y-%m-%dT%H:%M:%S"; and later : dt.format(NAIVE_DATE_TIME_FMT)?

if rem != 0 {
index -= 1;
out[index].write(vec.split_off(vec.len() - rem).into_boxed_slice());
vec.shrink_to_fit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function shrink_to_fit do?

.map(|buf| vec_to_blob(buf.into_inner(), 65_536, &options))
.collect()
}
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh a little piece of an old story 😆

implements utilities for inferring schemas and converting CSVs and JSONs
to parquet files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants