Conversation
Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro allowed only for validation data generation with < 2K API calls.
Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic Pydantic classes for DSPy signatures. Add auto-generation of entity types from PySpark DataFrame schemas via type_generator module. Add Spark-to-Python type mapping and DatasetProfile-driven field descriptions.
|
Cursor Agent can help with this pull request. Just |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the long-term plan document for the SERF framework, incorporating crucial architectural decisions regarding its type system and external API usage. The changes detail a shift towards a more flexible, domain-agnostic Pydantic type system with automated generation capabilities, and establish clear budget constraints for Gemini API consumption, ensuring the project's development aligns with current best practices and resource management. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the SERF_LONG_SHOT_PLAN.md document to refine the implementation plan. Key changes include clarifying the move from BAML to fresh DSPy Pydantic types, introducing a feature to auto-generate entity types from DataFrames, and adding a detailed section on budget constraints for Gemini API usage. The changes make the plan more specific and actionable. My review focuses on a potential issue in the new budget constraint section, where non-standard Gemini model names are used. I've suggested correcting these to align with publicly available models to prevent confusion during implementation. Overall, the updates are a significant improvement to the plan.
Note: Security Review has been skipped due to the limited scope of the PR.
| 1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets. | ||
|
|
||
| 2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage. | ||
|
|
||
| 3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini. | ||
|
|
||
| 4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only. | ||
|
|
||
| | Use Case | Model | Max Calls | Est. Cost | | ||
| | ------------------------------ | ---------------- | ------------------------- | ---------- | | ||
| | ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30 | | ||
| | Validation data generation | Gemini 2.5 Pro | < 2,000 | ~$50 | | ||
| | **Total** | | | **< $100** | |
There was a problem hiding this comment.
The model names "Gemini 2.0 Flash" and "Gemini 2.5 Pro" used in this section do not seem to correspond to publicly available Google models. This is likely a typo for "Gemini 1.5 Flash" and "Gemini 1.5 Pro", respectively. Using incorrect model names could cause errors during implementation when the agent attempts to make API calls.
Additionally, the pricing mentioned for these models seems to be custom or not aligned with current public pricing. For the budget constraints to be effective, it's crucial to use the correct model names and their corresponding up-to-date pricing.
Please consider updating the model names and verifying the pricing information throughout this section.
| 1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets. | |
| 2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage. | |
| 3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini. | |
| 4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only. | |
| | Use Case | Model | Max Calls | Est. Cost | | |
| | ------------------------------ | ---------------- | ------------------------- | ---------- | | |
| | ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30 | | |
| | Validation data generation | Gemini 2.5 Pro | < 2,000 | ~$50 | | |
| | **Total** | | | **< $100** | | |
| 1. **Use Gemini 1.5 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets. | |
| 2. **Gemini 1.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 1.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage. | |
| 3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini. | |
| 4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 1.5 Pro calls and finish remaining work with Flash only. | |
| | Use Case | Model | Max Calls | Est. Cost | | |
| | ------------------------------ | ---------------- | ------------------------- | ---------- | | |
| | ER pipeline (match/merge/edge) | Gemini 1.5 Flash | Unlimited (within budget) | ~$10-30 | | |
| | Validation data generation | Gemini 1.5 Pro | < 2,000 | ~$50 | | |
| | **Total** | | | **< $100** | |
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Update branch to
origin/rjurney/v0.1to incorporate latest changes for BAML and Gemini API limitations.This update ensures the research phase and subsequent implementation are based on the most current codebase, as requested by the user.