Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add dimensionality to queries #53

Open
ItsMeBrianD opened this issue Jul 11, 2024 · 2 comments
Open

Feature Request: Add dimensionality to queries #53

ItsMeBrianD opened this issue Jul 11, 2024 · 2 comments

Comments

@ItsMeBrianD
Copy link
Member

DuckDB has a concept of Hive Partitioning that we want to mimic for large datasets (e.g. player stats) to prevent massive queries from being run.

This will come in two parts:

Defining Dimensions

I propose that we use frontmatter in SQL to define dimensionality, so that each dimension defines a query that has the results

Single Dimension Query

set.sql

---
    dimensions:
        season: SELECT someColumn as dimension FROM someTable
---
SELECT * 
FROM someTable
WHERE someColumn = ${season}

This would result in a folder structure 1 level deep:

set/
    season=1/
        data.parquet
        data.csv
        data.json
    season=2/
        data.parquet
        data.csv
        data.json
Multi Dimension Query

set2.sql

---
    dimensions:
        season: SELECT someColumn as dimension FROM someTable
        match: SELECT someColumn as dimension FROM someMatchTable
---
SELECT * 
FROM someTable
WHERE someColumn = ${season} 
  AND someOtherColumn = ${match}

This would result in a folder structure 2 levels deep:

set/
    season=1/
        match=1/
            data.parquet
            data.csv
            data.json
        match=2/
            data.parquet
            data.csv
            data.json
    season=2/
        match=1/
            data.parquet
            data.csv
            data.json
        match=2/
            data.parquet
            data.csv
            data.json

Proper output

SQL Snippets

Currently on our dataset pages we have a SQL snippet that creates a view, and download buttons for the files;
the SQL snippet will need to be updated to properly load all the files:

CREATE VIEW set AS (
    SELECT *
    FROM read_parquet([
        'https://example.com/data/set/season=1/data.parquet', 
        'https://example.com/data/set/season=2/data.parquet', 
        'https://example.com/data/set/season=3/data.parquet' -- etc...
    ], hive_partitioning = true);
)

Batch Download

We should also consider creating a .tar.gz file for each format for the entire set that reflects the file structure,
for example the parquet download button would reference a set.parquet.tar.gz file with the contents:

season=1/
    data.parquet
season=2/
    data.parquet

This allows users to still download the entire dataset for local analysis

Manifest

It may also be helpful to produce a manifest of URLs for a partitioned set, so that non-duckdb programs can easily reference all of the files; the structure of this is TBD, and this is not required as part of the first version

@ItsMeBrianD
Copy link
Member Author

As another thought, it might make more sense to define dimensions in the sibling markdown file instead of trying to handle frontmatter in the SQL itself

@ItsMeBrianD
Copy link
Member Author

https://gist.github.com/ItsMeBrianD/6a0d9fbf42856a67958ae6e21b60224d

This file works as described above; and just needs to be split into prefect-y tasks.

One challenge here is trying to build the website at the end - if we want to address #45, then we need
to have a way of defining which queries are "supposed" to exist, along with which queries "actually" exist.
(e.g. figuring out which pages to build while accounting for deleted / failed queries)

Not sure what the best approach to the above is, unless we want to have some sort of "broken" page that is
thrown up for any queries that failed (?). This would just mean that we can build the navbar before firing off the queries, and then each query flow is responsible for building it's own page (succeed or fail)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant