Skip to content

Rjurney/v0.1#15

Open
rjurney wants to merge 2 commits intomainfrom
rjurney/v0.1
Open

Rjurney/v0.1#15
rjurney wants to merge 2 commits intomainfrom
rjurney/v0.1

Conversation

@rjurney
Copy link
Contributor

@rjurney rjurney commented Mar 8, 2026

No description provided.

rjurney added 2 commits March 7, 2026 19:24
Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro
allowed only for validation data generation with < 2K API calls.
Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic
Pydantic classes for DSPy signatures. Add auto-generation of entity
types from PySpark DataFrame schemas via type_generator module. Add
Spark-to-Python type mapping and DatasetProfile-driven field descriptions.
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the data modeling and integration capabilities within the SERF entity resolution system. It introduces a more flexible and user-friendly approach to defining entity types by moving away from legacy BAML-generated types towards fresh, domain-agnostic Pydantic models. A key enhancement is the ability to automatically generate these Pydantic entity types directly from Spark DataFrame schemas, streamlining data ingestion. Additionally, the changes establish clear budgetary and operational guidelines for leveraging large language models, ensuring efficient and cost-effective use of resources.

Highlights

  • Enhanced SparkDantic Schema Bridge: The SparkDantic schema bridge now supports bidirectional functionality, allowing Pydantic-to-Spark for writing and Spark-to-Pydantic for auto-generating entity types from input DataFrames.
  • New Domain-Agnostic Pydantic Type Strategy: SERF will no longer reuse Abzu's BAML-generated types. Instead, it will build fresh, domain-agnostic Pydantic classes specifically designed for DSPy signatures, preserving only proven ER metadata patterns.
  • Automated Entity Type Generation from DataFrames: A significant new feature allows SERF to automatically infer and generate Pydantic entity types from PySpark DataFrame schemas, simplifying the user experience for data ingestion.
  • LLM Budget and Model Usage Guidelines: New guidelines have been established for LLM usage, including a hard budget of $100 for overnight builds, specifying Gemini 2.0 Flash for pipeline operations and Gemini 2.5 Pro for validation data generation, with strict limits on API calls and model choices.
  • Updated Core Type System Implementation Plan: The implementation plan for the core type system has been revised to reflect the new strategy of rewriting Pydantic types from scratch and includes the creation of a dedicated module for auto-generating types from Spark schemas.
Changelog
  • docs/SERF_LONG_SHOT_PLAN.md
    • Updated the description of the SparkDantic schema bridge to clarify its bidirectional capabilities.
    • Revised the strategy for evolving SERF's type system, emphasizing the creation of new, domain-agnostic Pydantic types for DSPy.
    • Added a new section detailing the auto-generation of entity types from PySpark DataFrames, including a function signature and type mapping.
    • Introduced a new section outlining budget constraints and model usage policies for LLM APIs (Gemini Flash/Pro) during overnight builds.
    • Modified the implementation plan for 'Step 2: Core Type System' to reflect the new type generation approach and increased time estimate.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the SERF build plan by introducing a new feature for auto-generating Pydantic entity types from DataFrame schemas. The changes are well-integrated throughout the document, adding details to the data model, project structure, and implementation plan. A new section on budget constraints for the overnight build has also been added, which provides clear guidelines on API usage. My feedback includes a suggestion to improve the consistency of model names mentioned in the document.

Note: Security Review has been skipped due to the limited scope of the PR.


1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets.

2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be an inconsistency in the model names mentioned. This section specifies using Gemini 2.5 Pro for generating validation data, while the 'Core Technologies' table in section 4.1 lists Gemini 2.5 Flash Lite as the lightweight model. For clarity and consistency in the plan, it would be beneficial to harmonize the model names used throughout the document or clarify if different models are intended for different purposes.

@rjurney
Copy link
Contributor Author

rjurney commented Mar 8, 2026

/gemini review
@gemini review

@gemini-code-assist
Copy link

Thank you for requesting a review, @rjurney! I've thoroughly reviewed the updates to docs/SERF_LONG_SHOT_PLAN.md.

Overall, this pull request introduces a very well-thought-out and comprehensive plan for significant enhancements to the SERF entity resolution system. The changes demonstrate a clear vision for improving flexibility, maintainability, and cost-efficiency.

Here are some key strengths I've identified in this plan:

  • Strategic Shift to Domain-Agnostic Pydantic Types: The decision to move away from BAML-generated types and build fresh, domain-agnostic Pydantic models from scratch is excellent. This will greatly improve the system's flexibility and maintainability, allowing SERF to be applied to a wider range of entity resolution problems without being tied to a specific domain.
  • Automated Entity Type Generation: The detailed plan for entity_type_from_spark_schema to automatically infer and generate Pydantic entity types from PySpark DataFrame schemas is a significant user experience improvement. This feature will streamline data ingestion and reduce the manual effort required for defining types.
  • Robust LLM Budget and Usage Guidelines: The introduction of a hard budget for Gemini API spend, along with clear guidelines for model selection (Gemini 2.0 Flash for pipeline operations, Gemini 2.5 Pro for validation data generation), token usage tracking, and model restrictions, is a critical and highly commendable aspect. This proactive approach to cost management for LLM usage is crucial for operational efficiency.
  • Clear Implementation Steps: The updated 'Step 2: Core Type System' in the implementation plan provides precise instructions, including the rewrite of types.py and the creation of type_generator.py with associated unit tests. This level of detail will be invaluable for guiding the development process.

These updates represent a strong architectural direction for SERF, addressing key areas of improvement and setting a solid foundation for future development. Great work on this detailed planning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant