Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a field without data breaks larger occurrences download #930

Open
mjwestgate opened this issue Oct 17, 2024 · 6 comments · Fixed by #942
Open

Adding a field without data breaks larger occurrences download #930

mjwestgate opened this issue Oct 17, 2024 · 6 comments · Fixed by #942
Assignees
Milestone

Comments

@mjwestgate
Copy link

This is based on an issue identified using galah here. Basically, when we select a field in our occurrence download, for a query where no records have data in that field, the whole download fails. I've put @daxkellie's summary of the problem below.

To walk through the problem, the following query asks for counts of Acacia aneura grouped by scientficName:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=scientificName&fsort=count&flimit=10000

It returns this:
[{"fieldName":"scientificName","fieldResult":[{"label":"Acacia aneura","i18nCode":"scientificName.Acacia aneura","count":80,"fq":"scientificName:\"Acacia aneura\""},{"label":"Acacia aneura var. major","i18nCode":"scientificName.Acacia aneura var. major","count":6,"fq":"scientificName:\"Acacia aneura var. major\""},{"label":"Acacia aneura var. aneura","i18nCode":"scientificName.Acacia aneura var. aneura","count":1,"fq":"scientificName:\"Acacia aneura var. aneura\""}],"count":3}]

Which is great. By changing facets to location, we get no records, suggesting that this field is empty:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=location&fsort=count&flimit=10000

Again, fine. We then format request as an occurrence download, including a number of fields including location:

"https://biocache-ws.ala.org.au/ws/occurrences/offline/download?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&fields=recordID%2CscientificName%2CvernacularName%2Ckingdom%2CeventDate%2CsamplingProtocol%2CindividualCount%2CrecordedBy%2Clocation&qa=none&facet=false&emailNotify=false&sourceTypeId=2004&reasonTypeId=4&email=martinjwestgate%40gmail.com&dwcHeaders=true"

This runs, stating we expect to receive 87 records:

{"status":"inQueue","totalRecords":87,"queueSize":1,"statusUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/status/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","cancelUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/cancel/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","searchUrl":"https://biocache.ala.org.au/occurrences/search?&q=*%3A*&fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&disableAllQualityFilters=true&fq=-basisOfRecord%3A%22FOSSIL_SPECIMEN%22+AND+-%28basisOfRecord%3A%22MATERIAL_SAMPLE%22+AND+contentTypes%3A%22Environmental+DNA%22%29&fq=-%28duplicate_status%3A%22ASSOCIATED%22+AND+duplicateType%3A%22DIFFERENT_DATASET%22%29&fq=-assertions%3ATAXON_MATCH_NONE+AND+-assertions%3AINVALID_SCIENTIFIC_NAME+AND+-assertions%3ATAXON_HOMONYM+AND+-assertions%3AUNKNOWN_KINGDOM+AND+-assertions%3ATAXON_SCOPE_MISMATCH&fq=-occurrenceStatus%3AABSENT&fq=-identificationVerificationStatus%3A%22needs_id%22&fq=-userAssertions%3A50001+AND+-userAssertions%3A50005&fq=-year%3A%5B*+TO+1700%5D&fq=-establishmentMeans%3A%22MANAGED%22+AND+-decimalLatitude%3A0+AND+-decimalLongitude%3A0+AND+-assertions%3A%22PRESUMED_SWAPPED_COORDINATE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_STATEPROVINCE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_COUNTRY%22+AND+-assertions%3A%22PRESUMED_NEGATED_LATITUDE%22+AND+-assertions%3A%22PRESUMED_NEGATED_LONGITUDE%22&fq=-outlierLayerCount%3A%5B3+TO+*%5D&fq=-spatiallyValid%3A%22false%22&fq=-coordinateUncertaintyInMeters%3A%5B10001+TO+*%5D"}

Finally, the resulting Zip file (https://biocache.ala.org.au/biocache-download/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1/1729125651839/data.zip") has no data in it. What we would expect instead would be for all the requested fields to be downloaded, but with only NAs in the location column.

@kylie-m
Copy link

kylie-m commented Oct 17, 2024

Relates to support ticket: https://support.ehelp.edu.au/a/tickets/209037

@adam-collins
Copy link
Contributor

The contents of the location field are the same as the lat_long field. This is signified by sourceFields in the index/fields service. This location field is not intended for use with the download service. The service is likely misleading because we include the dataType name (instead of class) and indicate that it is stored=true. There is also an intentional lack of other information on the record such as description, downloadDescription, info, class(s), dwcTerm.

The problem that needs fixing is with the biocache-service index/fields service. It is currently exposing fields that are intended for use in search only (not facets, not downloads) but that still report stored=true because that is required for other reasons.

I think dataTypes should be removed as their usage requires knowledge about SOLR queries. dataTypes geohash, packedQuad, quad, location.
image

The intention is to keep other search only fields in the index/fields response.

There is no intention to include virtual search fields in index/fields.

@mjwestgate
Copy link
Author

OK thanks @adam-collins, that makes sense. It also tallies with our workflows; we only allow users to query fields that are listed in index/fields, so if they aren't in there, the query will get stopped by galah at an earlier stage.

While we're doing that it might make sense to have a spring clean of other content too. The first three fields listed are _nest_parent_, _nest_path_ and _root_, for example, which doesn't seem right either.

@adam-collins
Copy link
Contributor

Post cleanup of index/fields, it will contain no internal use or fields with data types deemed complicated use. It will include:

  • fields that can be used everywhere (most fields)
  • fields that can only be used for search queries (case insensitive text searching, etc)

To differentiate between the two

  • stored: true can be downloaded and faceted
  • stored: false cannot be downloaded or faceted

@adam-collins adam-collins added this to the 3.6.0 milestone Jan 29, 2025
@nickdos nickdos linked a pull request Jan 30, 2025 that will close this issue
nickdos added a commit that referenced this issue Jan 30, 2025
nickdos added a commit that referenced this issue Jan 30, 2025
nickdos added a commit that referenced this issue Jan 30, 2025
* #930 Filter unwanted fields from index/fields
* #930 Braces in wrong spot
* #930 revert a autocomplete #fail
* #930 skipFieldTypes overridable with config
@nickdos
Copy link
Contributor

nickdos commented Jan 31, 2025

Hi @mjwestgate - the changes for this issue have been deployed to the test server. Could you please take a look when you get the chance and confirm it meets your expectations for this issue? Thanks N

@nickdos nickdos reopened this Jan 31, 2025
mjwestgate added a commit to AtlasOfLivingAustralia/galah-R that referenced this issue Feb 3, 2025
@mjwestgate
Copy link
Author

Hi @nickdos! I've had a look at the new version of index/fields linked above; it loads fine in galah and looks heaps better. I've added a filter within galah to remove entries from index/fields with stored = FALSE as per @adam-collins's suggestion above. We'll push these changes this week with galah 2.1.1, but the new biocache service release can happen any time. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants