-
Notifications
You must be signed in to change notification settings - Fork 369
Add fenic-datasets integration #1936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
be8c987
Add fenic integration documentation
cpard ddd17c8
minor fixes
cpard 24b351d
Merge branch 'huggingface:main' into add-fenic-integration
cpard 3fd5be9
Update fenic session examples with minimal configs
cpard ee3ad5e
docs: clarify schema merging note
cpard File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
# fenic | ||
|
||
[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub. | ||
|
||
<div class="flex justify-center"> | ||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/> | ||
</div> | ||
|
||
## Getting Started | ||
|
||
To get started, pip install `fenic`: | ||
|
||
```bash | ||
pip install fenic | ||
``` | ||
|
||
### Create a Session | ||
|
||
Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations): | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
``` | ||
|
||
## Overview | ||
|
||
fenic is an opinionated data processing framework that combines: | ||
- **DataFrame API**: PySpark-inspired operations for familiar data manipulation | ||
- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering | ||
- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google) | ||
- **Query Optimization**: Automatic optimization through logical plan transformations | ||
|
||
## Read from Hugging Face Hub | ||
|
||
fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface. | ||
|
||
### Supported Formats | ||
|
||
fenic supports reading the following formats from Hugging Face: | ||
- **Parquet files** (`.parquet`) | ||
davanstrien marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- **CSV files** (`.csv`) | ||
|
||
### Reading Datasets | ||
|
||
To read a dataset from the Hugging Face Hub: | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
|
||
# Read a CSV file from a public dataset | ||
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") | ||
|
||
# Read Parquet files using glob patterns | ||
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | ||
|
||
# Read from a specific dataset revision | ||
df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet") | ||
``` | ||
|
||
### Reading with Schema Management | ||
davanstrien marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```python | ||
# Read multiple CSV files with schema merging | ||
df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) | ||
|
||
# Read multiple Parquet files with schema merging | ||
df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True) | ||
``` | ||
|
||
> **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet). | ||
|
||
### Authentication | ||
|
||
To read private datasets, you need to set your Hugging Face token as an environment variable: | ||
|
||
```shell | ||
export HF_TOKEN="your_hugging_face_token_here" | ||
``` | ||
|
||
### Path Format | ||
|
||
The Hugging Face path format in fenic follows this structure: | ||
``` | ||
hf://{repo_type}/{repo_id}/{path_to_file} | ||
``` | ||
|
||
You can also specify dataset revisions or versions: | ||
``` | ||
hf://{repo_type}/{repo_id}@{revision}/{path_to_file} | ||
``` | ||
|
||
Features: | ||
- Supports glob patterns (`*`, `**`) | ||
- Dataset revisions/versions using `@` notation: | ||
- Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e` | ||
- Branch: `@refs/convert/parquet` | ||
- Branch alias: `@~parquet` | ||
- Requires `HF_TOKEN` environment variable for private datasets | ||
|
||
### Mixing Data Sources | ||
|
||
fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols: | ||
|
||
```python | ||
# Mix HF and local files in one read call | ||
df = session.read.parquet([ | ||
"hf://datasets/cais/mmlu/astronomy/*.parquet", | ||
"file:///local/data/*.parquet", | ||
"./relative/path/data.parquet" | ||
]) | ||
``` | ||
|
||
This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline. | ||
|
||
## Processing Data from Hugging Face | ||
|
||
Once loaded from Hugging Face, you can use fenic's full DataFrame API: | ||
|
||
### Basic DataFrame Operations | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
|
||
# Load IMDB dataset from Hugging Face | ||
df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") | ||
|
||
# Filter and select | ||
positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") | ||
|
||
# Group by and aggregate | ||
label_counts = df.group_by("label").agg( | ||
fc.count("*").alias("count") | ||
) | ||
``` | ||
|
||
### AI-Powered Operations | ||
|
||
To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured: | ||
|
||
```python | ||
import fenic as fc | ||
|
||
# Requires OPENAI_API_KEY to be set for language and embedding calls | ||
davanstrien marked this conversation as resolved.
Show resolved
Hide resolved
|
||
session = fc.Session.get_or_create( | ||
fc.SessionConfig( | ||
semantic=fc.SemanticConfig( | ||
language_models={ | ||
"gpt-4o-mini": fc.OpenAILanguageModel( | ||
model_name="gpt-4o-mini", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
embedding_models={ | ||
"text-embedding-3-small": fc.OpenAIEmbeddingModel( | ||
model_name="text-embedding-3-small", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
) | ||
) | ||
) | ||
|
||
# Load a text dataset from Hugging Face | ||
df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") | ||
|
||
# Add embeddings to text columns | ||
df_with_embeddings = df.select( | ||
"*", | ||
fc.semantic.embed(fc.col("text")).alias("embedding") | ||
) | ||
|
||
# Apply semantic functions for sentiment analysis | ||
df_analyzed = df_with_embeddings.select( | ||
"*", | ||
fc.semantic.analyze_sentiment( | ||
fc.col("text"), | ||
model_alias="gpt-4o-mini", # Optional: specify model | ||
).alias("sentiment") | ||
) | ||
``` | ||
|
||
## Example: Analyzing MMLU Dataset | ||
|
||
```python | ||
import fenic as fc | ||
|
||
# Requires OPENAI_API_KEY to be set for semantic calls | ||
session = fc.Session.get_or_create( | ||
fc.SessionConfig( | ||
semantic=fc.SemanticConfig( | ||
language_models={ | ||
"gpt-4o-mini": fc.OpenAILanguageModel( | ||
model_name="gpt-4o-mini", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
) | ||
) | ||
) | ||
|
||
# Load MMLU astronomy subset from Hugging Face | ||
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | ||
|
||
# Process the data | ||
processed_df = (df | ||
# Filter for specific criteria | ||
.filter(fc.col("subject") == "astronomy") | ||
# Select relevant columns | ||
.select("question", "choices", "answer") | ||
# Add difficulty analysis using semantic.map | ||
.select( | ||
"*", | ||
fc.semantic.map( | ||
"Rate the difficulty of this question from 1-5: {{question}}", | ||
question=fc.col("question"), | ||
model_alias="gpt-4o-mini" # Optional: specify model | ||
).alias("difficulty") | ||
) | ||
) | ||
|
||
# Show results | ||
processed_df.show() | ||
``` | ||
|
||
## Resources | ||
|
||
- [fenic GitHub Repository](https://github.com/typedef-ai/fenic) | ||
- [fenic Documentation](https://docs.fenic.ai/latest/) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.