Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reads using global ctx #982

Merged
merged 2 commits into from
Mar 8, 2025
Merged

Conversation

ion-elgreco
Copy link
Contributor

Which issue does this PR close?

@ion-elgreco ion-elgreco force-pushed the feat/global_read branch 2 times, most recently from a6fd8e4 to f7af294 Compare December 28, 2024 11:11

from datafusion.dataframe import DataFrame
from datafusion.expr import Expr
import pyarrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: it would be great to use ruff (https://stackoverflow.com/a/77876298) or isort to deterministically and programmatically sort python imports, and validate that in CI. I think isort/ruff would have a newline here between the third-party and first-party imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there a pre-commit config for ruff linter and formatter

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.3.0
hooks:
# Run the linter.
- id: ruff
# Run the formatter.
- id: ruff-format

Copy link
Contributor

@kylebarron kylebarron Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the SO answer above explains, import sorting isn't currently part of the default ruff-format behavior. We'd need to opt-in by adding an I element here:

select = ["E4", "E7", "E9", "F", "D", "W"]

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to this addition, but there is a potential source of confusion that we can mitigate with documentation. If a new user creates a session context themself and registers functions, and then creates a dataframe using this method, the functions they registered will not be available. I think it could lead to a fair amount of confusion.

I think this is easily mitigated by adding documentation to these functions that describes that it uses a default global session context and if the user needs a custom context they need to use the functions .

@timsaucer
Copy link
Contributor

I added a few lines to the documentation, rebased, and applied updated ruff formatting.

@timsaucer timsaucer merged commit acd7040 into apache:main Mar 8, 2025
15 checks passed
@Spaarsh
Copy link
Contributor

Spaarsh commented Mar 8, 2025

Key Points:

  1. | operator not supporting type hints for python < 3.10, anyone pulling the main post merge will not be able to use SessionContext at all
  2. global_ctx already exposed to python

Details:
The | operator being used in all the read_* functions is supports type hinting only for python >=3.10. It was introduced in python3.10 here. So in order to even import SessionContext, I had to change all | operations with Union. Until then, I was getting this error:

$ python3
Python 3.9.7 (default, Oct 18 2021, 02:25:46) 
[Clang 13.0.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> imp   
KeyboardInterrupt
>>> from datafusion import SessionContext
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/df-py/datafusion-python/python/datafusion/__init__.py", line 48, in <module>
    from .io import read_avro, read_csv, read_json, read_parquet
  File "/home/user/df-py/datafusion-python/python/datafusion/io.py", line 31, in <module>
    path: str | pathlib.Path,
TypeError: unsupported operand type(s) for |: 'type' and 'type'

After replacing all | operations with Union, it all works. global_ctx is already being exposed to python, unless I have misunderstood something. I pulled the branch and tested it. It works.

$ python3 test.py 
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+-----+----+-----------+----------+---------+------------+
| age | id | is_active | name     | salary  | start_date |
+-----+----+-----------+----------+---------+------------+
| 32  | 1  | true      | John Doe | 75000.5 | 2020-01-15 |
+-----+----+-----------+----------+---------+------------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+

Just for reference, these are the scripts I used to generate and test the functions:

``` ####test.py from datafusion import SessionContext

Create a new session

ctx = SessionContext()

Read different file formats

df1 = ctx.read_csv("data.csv") # Accepts str or Path
df2 = ctx.read_parquet("data.parquet")
df3 = ctx.read_json("data.json")
df4 = ctx.read_avro("data.avro")

print(df1)
print(df2)
print(df3)
print(df4)


####create.py - to create the data files
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import json
import fastavro

Sample data as a dictionary

data = {
'id': 1,
'name': ['John Doe'],
'age': [32],
'salary': [75000.50],
'start_date': ['2020-01-15'],
'is_active': [True]
}

Create DataFrame

df = pd.DataFrame(data)

Save as Parquet

df.to_parquet('data.parquet')

Save as JSON (line-delimited)

with open('data.json', 'w') as f:
for _, row in df.iterrows():
json.dump(row.to_dict(), f)
f.write('\n')

Save as Avro

schema = {
'name': 'Employee',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'name', 'type': 'string'},
{'name': 'age', 'type': 'int'},
{'name': 'salary', 'type': 'double'},
{'name': 'start_date', 'type': 'string'},
{'name': 'is_active', 'type': 'boolean'}
]
}

records = df.to_dict('records')
with open('data.avro', 'wb') as f:
fastavro.writer(f, schema, records)

</details>

@kylebarron
Copy link
Contributor

There needs to be an initial import from __future__ import annotations on the first line to be able to use | typing syntax.

There are ruff rules to check this and we should turn them on

@timsaucer
Copy link
Contributor

Thanks, Kyle. More generally I’ll see about the impact of turning on all the rules and then removing a few specifically as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make all read methods available on DataFusion module
5 participants