Skip flattening for mutations across imports#480
Skip flattening for mutations across imports#480vish-cs wants to merge 1 commit intodatacommonsorg:masterfrom
Conversation
Summary of ChangesHello @vish-cs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the graph ingestion pipeline by changing how mutations are handled. Instead of aggregating all mutations from various imports and then flattening them for a single write operation, the pipeline now processes and writes mutations for each import individually. This change streamlines the data flow, improves the correctness of dependency handling between delete and write operations, and enhances the clarity of the ingestion process by isolating mutation handling per import. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the ingestion pipeline to process each import individually, which is a good architectural improvement for scalability. However, the current implementation introduces a few issues. I've identified a typo that will cause a compilation failure, a null pointer exception when deletes are skipped due to incorrect initialization of wait signals, and a potential data consistency issue where node writes no longer wait for deletions. My review includes specific code suggestions to address these critical and high-severity problems.
pipeline/ingestion/src/main/java/org/datacommons/ingestion/pipeline/GraphIngestionPipeline.java
Outdated
Show resolved
Hide resolved
pipeline/ingestion/src/main/java/org/datacommons/ingestion/pipeline/GraphIngestionPipeline.java
Outdated
Show resolved
Hide resolved
pipeline/ingestion/src/main/java/org/datacommons/ingestion/pipeline/GraphIngestionPipeline.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Helper method to flatten and write mutations to Spanner, optionally waiting on a signal. | ||
| * Helper method to flatten and write mutations to Spanner. |
There was a problem hiding this comment.
nit: maybe update since it isn't flattening anymore?
We currently flatten mutations across all imports and write to spanner. This PR processes each import independently to avoid any cross-dependency.
Updated wait signals for spanner writes to perform node writes without waiting for edge/obs deletions to complete.
Added import name to all the dataflow steps for easier debugging.
Also, remove edge generation for dcid property.