Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when first <field/> does not have an 'index' attribute #17

Closed
nielsklazenga opened this issue Jan 9, 2025 · 1 comment
Closed
Assignees

Comments

@nielsklazenga
Copy link

Some data sets we get through the GBIF repatriation do not pass pre-ingestion and give the following error message:

...
2025-01-09 03:07:53,799 - Dwca - INFO - Reading from /data/dmgt/dr22959/temp/input/dr22959.zip
Traceback (most recent call last):
  File "/data/dmgt/preingestion/preingest.py", line 81, in <module>
    main()
  File "/data/dmgt/preingestion/preingest.py", line 76, in main
    return PreingestDR.preingest_dr(dr=uid, output_dwca_path=output_dwca_path,
  File "/data/dmgt/preingestion/preingest_dr.py", line 144, in preingest_dr
    Validate(dr=dr, file=dwca).validate_dwca(dwca_content_keys_to_check={'event': ['eventID'], 'occurrence': connection['termsForUniqueKey']})
  File "/data/dmgt/preingestion/preingest_validate.py", line 24, in validate_dwca
    if DwcaHandler.validate_dwca(dwca_file=self.file, error_file=self.error_file,
  File "/home/hadoop/.local/lib/python3.9/site-packages/dwcahandler/dwca/dwca_factory.py", line 92, in validate_dwca
    return Dwca(dwca_file_loc=dwca_file).validate_dwca(keys_lookup, error_file)
  File "/home/hadoop/.local/lib/python3.9/site-packages/dwcahandler/dwca/base_dwca.py", line 174, in validate_dwca
    self.extract_dwca()
  File "/home/hadoop/.local/lib/python3.9/site-packages/dwcahandler/dwca/core_dwca.py", line 193, in extract_dwca
    self.meta_content.read_meta_file(meta_xml)
  File "/home/hadoop/.local/lib/python3.9/site-packages/dwcahandler/dwca/dwca_meta.py", line 197, in read_meta_file
    self.meta_elements = [self.__extract_meta_info(ns, node_elm, CoreOrExtType.CORE)]
  File "/home/hadoop/.local/lib/python3.9/site-packages/dwcahandler/dwca/dwca_meta.py", line 160, in __extract_meta_info
    if fields[0].attrib['index'] != '0':
KeyError: 'index'
Command exiting with ret '1'

The error can be tracked down to the extraction of metadata from the meta.xml file in the DwcaHandler, https://github.com/AtlasOfLivingAustralia/dwcahandler/blob/develop/src/dwcahandler/dwca/dwca_meta.py/#L160:

...
        if fields[0].attrib['index'] != '0':
...

The fields list is created in https://github.com/AtlasOfLivingAustralia/dwcahandler/blob/develop/src/dwcahandler/dwca/dwca_meta.py/#L146:

...
        fields = node_elm.findall(f'{ns}field')
...

The meta.xml file of the DwCA that failed starts with:

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="metadata.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <id index="0" />
    <field default="WGS84" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
    <field index="0" term="http://rs.gbif.org/terms/1.0/gbifID"/>
    <field index="1" term="http://purl.org/dc/terms/accessRights"/>
...

So, the first <field/> or field[0] indeed does not have an 'index' attribute.

Default fields that are not associated with columns in the CSV are part of the DwC-A specification, so we should be able to deal with them, or at least they should not break pr-ingestion.

The easy solution would be to only include <fields/> with an 'index' attribute in the fields list, but this could also be taken as an opportunity the deal with default values that are provided in the DwCA meta.xml (maybe later).

@patkyn
Copy link
Collaborator

patkyn commented Jan 24, 2025

Thanks @nielsklazenga. I've fixed the bug for the dwca meta xml format for this dataset. I've added some test cases to test the dwca extraction and validation, including the format related to this dataset https://github.com/AtlasOfLivingAustralia/dwcahandler/blob/feature/v0.4.0/tests/input_files/dwca/dwca-sample2/meta.xml
Let me know if there's any other format that's not covered by the samples https://github.com/AtlasOfLivingAustralia/dwcahandler/tree/feature/v0.4.0/tests/input_files/dwca which are the input files for validate unit test

@patkyn patkyn mentioned this issue Jan 28, 2025
patkyn added a commit that referenced this issue Jan 29, 2025
* #17 - Fix for metadata fields and add validate dwca test cases

* #17 - Remove commented code

* #17 - Validate unit test

* Update macOS version for test

* AtlasOfLivingAustralia/preingestion#272 - Strip column header spaces in csv files. Add test cases to test the core and ext dataframe

* increase test version
@patkyn patkyn closed this as completed Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants