Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add site cultivars, treatments metadata to plot metadata #573

Open
tcnichol opened this issue May 9, 2019 · 20 comments
Open

add site cultivars, treatments metadata to plot metadata #573

tcnichol opened this issue May 9, 2019 · 20 comments
Assignees

Comments

@tcnichol
Copy link

tcnichol commented May 9, 2019

Description

Add site cultivar and site treatments data to sites metadata.

The change would be made here

https://github.com/terraref/terrautils/blob/master/terrautils/lemnatec.py#L118

This may be possible to accomplish using brapi endpoints, or else will require modification of betydb.

this is the branch i am working on for this issue:

https://github.com/terraref/terrautils/tree/add-cultivars-site-metadata

@tcnichol tcnichol self-assigned this May 9, 2019
@tcnichol tcnichol added this to the TERRA Sprint - April 2019 milestone May 9, 2019
@tcnichol
Copy link
Author

tcnichol commented May 9, 2019

@max-zilla I had a few questions about this one.

At the point where we want to add the sites_cultivars and sites_treatments, we have a site_id but not an experiment id. If I had the experiment_id, there should be endpoints in brapi that will supply what we need.

So I see two ways of handling this.

The first would be that in terrautils, the experiments json is cached. I can add a method that takes a site_id and returns and experiment or experiment_id. That would require parsing that json for every site, which is the drawback.

The second way would be that when the sites json is created, the method iterates over each experiment, and then builds an object for each site. We could just put the experiment_id in with the sites. It's just adding a single id and not a full object.

Let me know if either of those are acceptable.

@dlebauer
Copy link
Member

Would it help to have a view pre populated in the database?

@max-zilla
Copy link
Contributor

The second option would probably be better, each extractor creates the cache on startup if it needs BETY stuff (we don't have a shared cache file like traitvis) so inserting the experiment into the site object and saving potentially many scans of the JSON later sounds more efficient.

@tcnichol
Copy link
Author

K. I will implement the second then.

@tcnichol
Copy link
Author

I have now changed the get_sites method in betydb in the following ways:

Each entry for a site now contains the corresponding experiment id.
It also contains the cultivar information.

The cultivar information is obtained by the following:

Using the brapi v1/study/{studyDbId}/layouts endpoint, a dictionary is created that maps site ids to germplasm ids, and gets the corresponding information from the /germplasm endpoint in brapi.

I think that this can be further optimized by using the stored BETYDB_CULTIVARS instead of making a call to the brapi /germplasm enpoint. Will check into that next, but this means that the cultivar info can be added in lemnatec.py without making any calls there.

@tcnichol
Copy link
Author

Both the cultivar and treatments information should now be contained in the sites retrieved from betydb. Some sites do not have cultivar or treatment data. Currently the value for those keys is 'no info available' but if there is a better value for those, I can change that.

If someone can take a look, I think this one is pretty close to done.

@dlebauer
Copy link
Member

@tcnichol I can take a look if you can point to some of the records that are missing information.

@tcnichol
Copy link
Author

Here is one
https://brapi.workbench.terraref.org/brapi/v1/studies/6000000014/layouts
https://brapi.workbench.terraref.org/brapi/v1/observationunits?studyDbId=6000000014

in layouts, there are entries for these 2 plots MAC Field Scanner Season 4 Range 6 Column 10 W/E
and there is a corresponding entry in 'treatments' (but without the W/E)

However, this plot

MAC Field Scanner Season 4 Range 48 Column 11 W

doesn't have any corresponding entries in the /observationunits endpoint.

Since all the entries in observationunits have a corresponding entry in /layouts, but not all entries in /layouts have a corresponding entry in observationunits, are these control plots? What would be the best value to put under 'treatments' if there is no entry for the plot in observationunits?

@dlebauer
Copy link
Member

dlebauer commented May 15, 2019

it sounds like there are two things to address:

  1. there are missing records in treatments_sites that link the W/E plots to treatments. Is that the case?
  2. it isn't clear why MAC Field Scanner Season 4 Range 48 Column 11 W wouldn't have an observationunits endpoing, - since it is a plot, it is by definition an observation unit

Other thoughts:

  • E/W subplots should be given the value of "observationLevel": "subplot"
  • in looking at the /observationunits endpoint, it is not clear why there is a "location_abbreviation" field, isn't this redundant with ObservationUnitName? I don't think I can find location_abbreviation in the BrAPI specification.

@tcnichol
Copy link
Author

I will check into 1 tomorrow.

I can create an issue for changing location_abbrevation to observationUnitName in brapi as well if that does not match the spec.

@tcnichol
Copy link
Author

@dlebauer

I synced my local database with bety and did not see a sites_treatments table, but I was able to create something comparable with this sql.

create table experiments_sites_treatments as
select experiments_sites.site_id, experiments_sites.experiment_id, experiments_treatments.treatment_id
from experiments_sites
INNER JOIN experiments_treatments on experiments_treatments.experiment_id=experiments_sites.experiment_id

the next step was running this query

select experiments_sites_treatments.site_id, experiments_sites_treatments.experiment_id, experiments_sites_treatments.treatment_id,
treatments.id, treatments.name, treatments.definition
from experiments_sites_treatments, treatments
where experiments_sites_treatments.treatment_id = treatments.id
and experiments_sites_treatments.site_id= 6000007122;

Here are the results

siteandtreatments

So it looks like bety does have a treatment with a name/definition associated with this plot. Could this mean that there is something up with the brapi endpoint or the query it uses? I will check it and compare what it does with what I get here.

@tcnichol
Copy link
Author

i posted in the brapi channel about pagination.

@tcnichol
Copy link
Author

i think i figured out the issue with pagination, testing a fix in brapi locally with this.

@tcnichol
Copy link
Author

terraref/brapi#29

This pull request fixes the issue in brapi. With pagination working correctly, there should be an entry found for every plot.

@tcnichol
Copy link
Author

@robkooper thanks for merging that one in.

Since @dlebauer pointed out a problem with the results returned by observationunits, I have created this issue.

terraref/brapi#30

Will fix that endpoint and then fix the branch associated with this in terrautils to use the correct keys rather than 'location_abbrevation.'

@tcnichol
Copy link
Author

terraref/brapi#31

This pull request fixes location_abbrevation, and I have changed terrautils to take this change into account for the branch associated with this issue.

@tcnichol
Copy link
Author

sample metadata as would be generated by

terrautils.lemnatec._get_sites

{"6000007362": {"sitename": "MAC Field Scanner Season 4 Range 14 Column 2 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI330182", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000819"}, "url": ""}, "notes": "sitename is the plot that contains the image centroid", "6000007408": {"sitename": "MAC Field Scanner Season 4 Range 15 Column 9 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI329301", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000558"}, "url": ""}, "6000007592": {"sitename": "MAC Field Scanner Season 4 Range 21 Column 5 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI563332", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000903"}, "url": ""}, "6000007726": {"sitename": "MAC Field Scanner Season 4 Range 25 Column 8 E", "experiment_id": 6000000014, "treatments": "no info available", "cultivar": "no info available", "url": ""}}

@tcnichol
Copy link
Author

(just a few keys, I didn't want to post something huge.)

@tcnichol
Copy link
Author

pull request open

terraref/terrautils#71

@max-zilla
Copy link
Contributor

The PR is ready for review but the get_sites process runs extremely slowly. I will review with an eye toward possible caching or other improvements to make this viable in terrautils metadata cleaning pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants