You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a colleague who wants to process papers from the materials science domain and extract "synthesis recipes". These recipes are passages that describe fairly complicated procedures for creating a novel chemical. This is useful because the recipes are extremely intricate and hard to discover. If we can build a model that suggests high-quality recipes for novel targets, it would be a big step.
The near-term goal is simply to extract these recipes from existing papers. So we want to populate a schema that looks like this:
(PaperIdentifier, TargetChemical, RecipeText)
After that works, we can populate a structured form of the recipe description. But just getting the raw text first would be helpful.
I have some annotated data we can use, though it's not shareable via Git so please don't commit it here.
Doing a basic but good job here involves:
Certainly writing code that extracts content from PDFs
Evaluating the accuracy of the initial task
After this basic version works, we want to evaluate runtime performance and maybe consider:
Using the PZ RAG operators like retrieve, but this depends on the runtime of the basic
Building a RAG index over the input papers
Finally, we would pursue the structured form of the recipe. I can share the domain experts' proposal for this structure.
The text was updated successfully, but these errors were encountered:
We have a colleague who wants to process papers from the materials science domain and extract "synthesis recipes". These recipes are passages that describe fairly complicated procedures for creating a novel chemical. This is useful because the recipes are extremely intricate and hard to discover. If we can build a model that suggests high-quality recipes for novel targets, it would be a big step.
The near-term goal is simply to extract these recipes from existing papers. So we want to populate a schema that looks like this:
(PaperIdentifier, TargetChemical, RecipeText)
After that works, we can populate a structured form of the recipe description. But just getting the raw text first would be helpful.
I have some annotated data we can use, though it's not shareable via Git so please don't commit it here.
Doing a basic but good job here involves:
After this basic version works, we want to evaluate runtime performance and maybe consider:
Finally, we would pursue the structured form of the recipe. I can share the domain experts' proposal for this structure.
The text was updated successfully, but these errors were encountered: