feat: reads using global ctx #982

ion-elgreco · 2024-12-28T11:07:45Z

Which issue does this PR close?

closes Make all read methods available on DataFusion module #918

kylebarron · 2025-01-06T22:51:51Z

python/datafusion/io.py

+
+from datafusion.dataframe import DataFrame
+from datafusion.expr import Expr
+import pyarrow


Side note: it would be great to use ruff (https://stackoverflow.com/a/77876298) or isort to deterministically and programmatically sort python imports, and validate that in CI. I think isort/ruff would have a newline here between the third-party and first-party imports.

there a pre-commit config for ruff linter and formatter

datafusion-python/.pre-commit-config.yaml

Lines 23 to 30 in 79c22d6

- repo: https://github.com/astral-sh/ruff-pre-commit

# Ruff version.

rev: v0.3.0

hooks:

# Run the linter.

- id: ruff

# Run the formatter.

- id: ruff-format

As the SO answer above explains, import sorting isn't currently part of the default ruff-format behavior. We'd need to opt-in by adding an I element here:

datafusion-python/pyproject.toml

Line 66 in 79c22d6

select = ["E4", "E7", "E9", "F", "D", "W"]

timsaucer

I'm not opposed to this addition, but there is a potential source of confusion that we can mitigate with documentation. If a new user creates a session context themself and registers functions, and then creates a dataframe using this method, the functions they registered will not be available. I think it could lead to a fair amount of confusion.

I think this is easily mitigated by adding documentation to these functions that describes that it uses a default global session context and if the user needs a custom context they need to use the functions .

timsaucer · 2025-03-08T13:42:14Z

I added a few lines to the documentation, rebased, and applied updated ruff formatting.

Spaarsh · 2025-03-08T21:19:03Z

Key Points:

| operator not supporting type hints for python < 3.10, anyone pulling the main post merge will not be able to use SessionContext at all
global_ctx already exposed to python

Details:
The | operator being used in all the read_* functions is supports type hinting only for python >=3.10. It was introduced in python3.10 here. So in order to even import SessionContext, I had to change all | operations with Union. Until then, I was getting this error:

$ python3
Python 3.9.7 (default, Oct 18 2021, 02:25:46) 
[Clang 13.0.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> imp   
KeyboardInterrupt
>>> from datafusion import SessionContext
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/df-py/datafusion-python/python/datafusion/__init__.py", line 48, in <module>
    from .io import read_avro, read_csv, read_json, read_parquet
  File "/home/user/df-py/datafusion-python/python/datafusion/io.py", line 31, in <module>
    path: str | pathlib.Path,
TypeError: unsupported operand type(s) for |: 'type' and 'type'

After replacing all | operations with Union, it all works. global_ctx is already being exposed to python, unless I have misunderstood something. I pulled the branch and tested it. It works.

$ python3 test.py 
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+-----+----+-----------+----------+---------+------------+
| age | id | is_active | name     | salary  | start_date |
+-----+----+-----------+----------+---------+------------+
| 32  | 1  | true      | John Doe | 75000.5 | 2020-01-15 |
+-----+----+-----------+----------+---------+------------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name     | age | salary  | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
+----+----------+-----+---------+------------+-----------+

Just for reference, these are the scripts I used to generate and test the functions:

``` ####test.py from datafusion import SessionContext

Create a new session

ctx = SessionContext()

Read different file formats

df1 = ctx.read_csv("data.csv") # Accepts str or Path
df2 = ctx.read_parquet("data.parquet")
df3 = ctx.read_json("data.json")
df4 = ctx.read_avro("data.avro")

print(df1)
print(df2)
print(df3)
print(df4)

####create.py - to create the data files
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import json
import fastavro

Sample data as a dictionary

data = {
'id': 1,
'name': ['John Doe'],
'age': [32],
'salary': [75000.50],
'start_date': ['2020-01-15'],
'is_active': [True]
}

Create DataFrame

df = pd.DataFrame(data)

Save as Parquet

df.to_parquet('data.parquet')

Save as JSON (line-delimited)

with open('data.json', 'w') as f:
for _, row in df.iterrows():
json.dump(row.to_dict(), f)
f.write('\n')

Save as Avro

schema = {
'name': 'Employee',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'name', 'type': 'string'},
{'name': 'age', 'type': 'int'},
{'name': 'salary', 'type': 'double'},
{'name': 'start_date', 'type': 'string'},
{'name': 'is_active', 'type': 'boolean'}
]
}

records = df.to_dict('records')
with open('data.avro', 'wb') as f:
fastavro.writer(f, schema, records)

</details>

kylebarron · 2025-03-08T21:47:30Z

There needs to be an initial import from __future__ import annotations on the first line to be able to use | typing syntax.

There are ruff rules to check this and we should turn them on

timsaucer · 2025-03-08T23:57:23Z

Thanks, Kyle. More generally I’ll see about the impact of turning on all the rules and then removing a few specifically as needed

ion-elgreco force-pushed the feat/global_read branch 2 times, most recently from a6fd8e4 to f7af294 Compare December 28, 2024 11:11

kylebarron approved these changes Jan 6, 2025

View reviewed changes

timsaucer reviewed Jan 8, 2025

View reviewed changes

ion-elgreco and others added 2 commits March 8, 2025 08:41

feat: reads using global ctx

404e542

Add text to io methods to describe the context they are using

b166bdd

timsaucer force-pushed the feat/global_read branch from ba61c8f to b166bdd Compare March 8, 2025 13:41

timsaucer approved these changes Mar 8, 2025

View reviewed changes

timsaucer mentioned this pull request Mar 8, 2025

Make global context easier to access for users #1045

Closed

timsaucer merged commit acd7040 into apache:main Mar 8, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reads using global ctx #982

feat: reads using global ctx #982

ion-elgreco commented Dec 28, 2024

kylebarron Jan 6, 2025

kevinjqliu Jan 7, 2025

kylebarron Jan 7, 2025 •

edited

Loading

timsaucer left a comment

timsaucer commented Mar 8, 2025

Spaarsh commented Mar 8, 2025 •

edited

Loading

Create a new session

Read different file formats

Sample data as a dictionary

Create DataFrame

Save as Parquet

Save as JSON (line-delimited)

Save as Avro

kylebarron commented Mar 8, 2025

timsaucer commented Mar 8, 2025

	- repo: https://github.com/astral-sh/ruff-pre-commit
	# Ruff version.
	rev: v0.3.0
	hooks:
	# Run the linter.
	- id: ruff
	# Run the formatter.
	- id: ruff-format

feat: reads using global ctx #982

feat: reads using global ctx #982

Conversation

ion-elgreco commented Dec 28, 2024

Which issue does this PR close?

kylebarron Jan 6, 2025

Choose a reason for hiding this comment

kevinjqliu Jan 7, 2025

Choose a reason for hiding this comment

kylebarron Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

timsaucer left a comment

Choose a reason for hiding this comment

timsaucer commented Mar 8, 2025

Spaarsh commented Mar 8, 2025 • edited Loading

Create a new session

Read different file formats

Sample data as a dictionary

Create DataFrame

Save as Parquet

Save as JSON (line-delimited)

Save as Avro

kylebarron commented Mar 8, 2025

timsaucer commented Mar 8, 2025

kylebarron Jan 7, 2025 •

edited

Loading

Spaarsh commented Mar 8, 2025 •

edited

Loading