Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Palimpzest test case: processing materials science papers for crystal recipes #123

Open
mikecafarella opened this issue Feb 12, 2025 · 1 comment
Assignees

Comments

@mikecafarella
Copy link
Collaborator

We have a colleague who wants to process papers from the materials science domain and extract "synthesis recipes". These recipes are passages that describe fairly complicated procedures for creating a novel chemical. This is useful because the recipes are extremely intricate and hard to discover. If we can build a model that suggests high-quality recipes for novel targets, it would be a big step.

The near-term goal is simply to extract these recipes from existing papers. So we want to populate a schema that looks like this:
(PaperIdentifier, TargetChemical, RecipeText)

After that works, we can populate a structured form of the recipe description. But just getting the raw text first would be helpful.

I have some annotated data we can use, though it's not shareable via Git so please don't commit it here.

Doing a basic but good job here involves:

  1. Certainly writing code that extracts content from PDFs
  2. Evaluating the accuracy of the initial task

After this basic version works, we want to evaluate runtime performance and maybe consider:

  1. Using the PZ RAG operators like retrieve, but this depends on the runtime of the basic
  2. Building a RAG index over the input papers

Finally, we would pursue the structured form of the recipe. I can share the domain experts' proposal for this structure.

@mdr223
Copy link
Collaborator

mdr223 commented Feb 26, 2025

Next step: try to propose structured schema for info we want to extract from highlighted text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants