Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add batch search by variations #361

Merged
merged 6 commits into from
Jun 18, 2024
Merged

Conversation

jsstevenson
Copy link
Member

@jsstevenson jsstevenson commented Jun 13, 2024

close #290

Current use case is variations only (e.g. in the context of a VCF), but the objects/API are structured in such a way to (relatively) smoothly accommodate other kinds of terms*.

Very naively implemented. Tune-ups in #311 should include fetching everything in a single query rather than iteratively getting every study once study IDs are acquired. I also wonder if there's anything else we can do cache-wise to optimize for those cases where a user is performing repeated lookups on things that are probably pretty closely located in the genome, such as VCFs (seqrepo already uses a big LRU cache for sequence lookups so that might be all that's necessary)

The ability to submit multiple terms for the same kind of entity raises questions about transparent management of redundancy, failed lookups, etc. I added an extra field to supply the normalized ID, if available, for each term, so that the client can tell if terms are normalizing to the same ID or if they fail to normalize at all:

  "query": {
    "variations": [
      {
        "term": "EGFR L858R",
        "normalized_id": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ"
      },
      {
        "term": "matthew cannon 2"
      },
      {
        "term": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ",
        "normalized_id": "ga4gh:VA.S41CcMJT2bcd8R4-qXZWH1PoHWNtG2PZ"
      }
    ]
  }

This also makes it possible to understand which studies correspond to each search term.

  • Caveat about including other kinds of entities: I can understand why we'd want it, but it potentially makes search very weird. Say you search for drug A, drug B, variation A, and variation B. Do you just include any study that includes any of those terms? (I don't know what the use case for that is, but I think it's the most intuitive result). Or do you just include studies that include variation A + drug A or drug B, or variation B + drug A or drug B? (actually a pretty reasonable thing to search for, but a bit counterintuitive to frame).
  • A note about response structure. As a first pass, I made it consistent with get_search_studies -- response.studies is just a list of studies. If you wanted to figure out which ones came from a given search term, you could take the normalized ID from response.queries and then filter through response.studies based on the value in study.variant.definingContext.id (for ProteinSequenceConsequences, or something else for CategoricalVariations). Alternatively, you could group response.studies into a dict where the key is the normalized ID and the value is the list of studies for that ID. However, that becomes VERY messy if you do want to support searching by multiple entity types (what would the key be, a concatenated string or something?).

@jsstevenson jsstevenson added enhancement New feature or request priority:medium Medium priority labels Jun 14, 2024
@jsstevenson jsstevenson marked this pull request as ready for review June 17, 2024 18:54
@jsstevenson jsstevenson requested a review from korikuzma June 17, 2024 19:04
Comment on lines +116 to +117
variations: list[str] | None = Query( # noqa: B008
None, description=_batch_search_studies_descr["arg_variations"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we can't do below?

Suggested change
variations: list[str] | None = Query( # noqa: B008
None, description=_batch_search_studies_descr["arg_variations"]
variations: list[str] = Query( # noqa: B008
[], description=_batch_search_studies_descr["arg_variations"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the FastAPI Query method shield from the usual mutable defaults problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, apparently this is the "old way", anyway

Comment on lines +725 to +726
Because this method could be expanded to include other kinds of search terms,
``variations`` is optionally nullable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Can we have it default to empty list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely can't use mutable defaults here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if it's documented well that variations will never be mutated then we could but may not follow best practices. We can leave as is

src/metakb/query.py Outdated Show resolved Hide resolved
@jsstevenson jsstevenson merged commit 9e4c43b into staging Jun 18, 2024
11 checks passed
@jsstevenson jsstevenson deleted the batch-get-variation branch June 18, 2024 17:10
korikuzma added a commit that referenced this pull request Jun 24, 2024
* PR suggestions were applied, but forgot to update tests
korikuzma added a commit that referenced this pull request Jun 24, 2024
…368)

* PR suggestions were applied, but forgot to update tests
@jsstevenson jsstevenson mentioned this pull request Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:medium Medium priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants