-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: reads using global ctx #982
Conversation
a6fd8e4
to
f7af294
Compare
python/datafusion/io.py
Outdated
|
||
from datafusion.dataframe import DataFrame | ||
from datafusion.expr import Expr | ||
import pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: it would be great to use ruff (https://stackoverflow.com/a/77876298) or isort to deterministically and programmatically sort python imports, and validate that in CI. I think isort/ruff would have a newline here between the third-party and first-party imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there a pre-commit config for ruff linter and formatter
datafusion-python/.pre-commit-config.yaml
Lines 23 to 30 in 79c22d6
- repo: https://github.com/astral-sh/ruff-pre-commit | |
# Ruff version. | |
rev: v0.3.0 | |
hooks: | |
# Run the linter. | |
- id: ruff | |
# Run the formatter. | |
- id: ruff-format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the SO answer above explains, import sorting isn't currently part of the default ruff-format
behavior. We'd need to opt-in by adding an I
element here:
datafusion-python/pyproject.toml
Line 66 in 79c22d6
select = ["E4", "E7", "E9", "F", "D", "W"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to this addition, but there is a potential source of confusion that we can mitigate with documentation. If a new user creates a session context themself and registers functions, and then creates a dataframe using this method, the functions they registered will not be available. I think it could lead to a fair amount of confusion.
I think this is easily mitigated by adding documentation to these functions that describes that it uses a default global session context and if the user needs a custom context they need to use the functions .
ba61c8f
to
b166bdd
Compare
I added a few lines to the documentation, rebased, and applied updated ruff formatting. |
Key Points:
Details:
After replacing all
Just for reference, these are the scripts I used to generate and test the functions:
```
####test.py
from datafusion import SessionContext
Create a new sessionctx = SessionContext() Read different file formatsdf1 = ctx.read_csv("data.csv") # Accepts str or Path print(df1)
####create.py - to create the data files Sample data as a dictionarydata = { Create DataFramedf = pd.DataFrame(data) Save as Parquetdf.to_parquet('data.parquet') Save as JSON (line-delimited)with open('data.json', 'w') as f: Save as Avroschema = { records = df.to_dict('records')
|
There needs to be an initial import There are ruff rules to check this and we should turn them on |
Thanks, Kyle. More generally I’ll see about the impact of turning on all the rules and then removing a few specifically as needed |
Which issue does this PR close?