Skip to content

Releases: cedadev/padocc

Version 1.3.1: Bug Fixes

13 Feb 10:09
c4c1a51
Compare
Choose a tag to compare

Release Notes for version 1.3.1

Bug Fixes

  • Fixed bug with using index to ID project from cli
  • Fixed bug when updating cloud format to CFA (#55)
  • Fixed CFA scan order issue - CFA scan performed as base (#56)
  • Fixed bug with validator check - dimension differences prevented data difference detection which interfered with identical_dims (#57)

Version 1.3: Project and Group Operations

05 Feb 16:33
8a72745
Compare
Choose a tag to compare

# Release Notes for pre-release 1.3a

Module Restructuring

  • Restructured module sections into three components:

    • Core: Central components
    • Operations: Relating specifically to group operations.
    • Phases: Classes for specific phases within the pipeline (scan, compute etc.)
    • Added a Tests module for automated testing.
  • Other details

    • The entire module is now referred to as padocc throughout.
    • padocc is consistent with other packages in the CEDA package landscape in terms of its use of poetry for dependency management.

## Filehandlers

  • Generic properties of all filehandlers:

    • move_file: Specify a new path/name for the attached file.
    • remove_file: Delete the file from the filesystem.
    • file: Name of the attached file.
    • filename: Full path to the attached file.
  • Magic Methods of all filehandlers - implementations vary.

    • __contains__: Check if a value exists within the filehandler contents.
    • __str__: String representation of the specific instance.
    • __repr__: Programmatic representation of the specific instance.
    • __len__: Length of contents within the filehandler (where appropriate)
    • __iter__: Allows generation of iterable content of filehandler (where appropriate)
    • __getitem__: Filehandlers are indexable.
  • Other Methods

    • Append: ListFileHandler operates like a list with append() function.
    • Set/Get: Set or get the value attributed to a filehandler
    • Close: Save content of the filehandler to the filesystem.
  • Methods for Special Filehandlers

    • add_download_link: Specific to the Kerchunk JSON Filehandler, resets the path for all chunks to use the CEDA dap connection.
    • add_kerchunk_history: Adds specific parameters to the Kerchunk content's history parameter to state when it was generated etc.
    • clear: Clears the Store-type filehandlers of all internal files.
    • open (beta): Open a cloud-format (Store or JSON) Filehandler as a dataset
  • Other Features

    • Conf: JSON Filehandlers allow a special property to be supplied, a default dictionary which contains some template values. These are then applied before saving any content, where all new values override the template.
    • Documentation: Added the documentation page for filehandlers and subcomponents.

Project Operator

Documentation is provided for the project operator, detailed here are some of the key features.

  • Key Methods

    • info: Obtain information about the specific operator instance.
    • help: Get help with public methods
    • run: Can only be used with a Phase Operator (see below) for running a phased operation. All errors are handled and logged when using this function.
    • increment_versions: Increment major or minor versions
  • Properties

    • dir: The directory on the filesystem where all project files can be found
    • groupdir: The path to the group directory on the filesystem.
    • cfa_path: The path to a cfa dataset for this project.
    • outproduct: The complete filename of the output product
    • outpath: The combined path to the output product.
    • revision: The full revision describing the product version i.e 'kr1.0'
    • version_no: The version number (second part of the revision i.e '1.0') with a major and minor component.
    • cloud_format [Editable]: The cloud format being used in current workflows.
    • file_type [Editable]: The file type being used in current workflows.
    • source_format: The source format detected for this project.

## Group Operator

The group operator enables the application of phased operators across a group of datasets at once, either by parallel deployment or serial handling.

  • Key Methods

    • info: Obtain useful information about the group
    • help: Find helpful user functions for the group.
    • merge/unmerge (beta): Allows the merging and unmerging of groups of datasets.
    • run: Perform an operation on a subset of full set of the datasets within the group.
    • create_sbatch: Create a job deployment to SLURM for group processing.
    • init_from_file [Mixin]: Initialise a group from parameters in a CSV file.
    • add/remove_project [Mixin]: Add or remove a specific project from a group (Not implemented in pre-release)
    • get_project [Mixin]: Enables retrieval of a specific project, also accomplished by indexing the group which utilises this function.
  • Still in development:

    • summary_data: Summarise the data across all projects in the group, including status within the pipeline.
    • progress: Obtain pipeline specific information about the projects in the group.
    • create_allocations: Assemble allocations using binpacking for parallel deployment.
  • Magic Methods

    • __str__: String representation of the group
    • __repr__: Programmatic representation of the group.
    • __getitem__: Group is integer indexable to obtain a specific dataset as a ProjectOperator.

## Phased Operators

The phased operators can be used to individually operate on a specific project, although
it is instead suggested that the GroupOperator.run() method is used as this includes all error logging
as part of the project operator. Specifics of the phased operators are described below:

  • Scan Operator: Scan a subset of the source files in a specific project and generate some extrapolated results for the whole dataset, including number of chunks, data size and volumes etc.
  • Compute Operator: Inherited by DatasetProcessors within the compute module (Kerchunk, Zarr, CFA etc.) and enables the computation step. The scan operator uses the compute processors with a file limiter, to operate on a small subset for scanning purposes.
  • Validate Operator: Perform dataset validation using the CFA dataset generated as part of the pipeline. If a CFA dataset is the only dataset generated, this step is currently not utilised.

Future improvements:

  • Reorganisation of the compute operator as the current inheritance system is overly complicated.
  • Addition of an ingest operator for ingestion into STAC catalogs and the CEDA Archive (CEDA-only.)

# Release Notes for pre-release 1.3.0b

General

  • Minor changes to project Readme.
  • Added Airflow example workflow file in documentation.
  • Added zarr version limit to pyproject.
  • Added specific license version to pyproject, instead of broken license link.

Tests

  • Now import the GroupOperation directly from padocc
  • Added Zarr tests, which run successfully in gitlab workflows.

## Phases

Compute

  • Moved definition of is_trial to after the super() call in ComputeOperation: This is to allow is_trial to be set as default for all project processes, then overridden in the case of the scan operation that then creates a compute operation.
  • Changed the behaviour of the setup_cache method for DirectoryOperation objects.
  • Added default converter type, and allows users to specify a type running compute or scan. In the case of scanning, the ctype carries through to the compute element.
  • Added create_mode to create_refs() which reorganises logger messages. Loading each cache file will always be attempted, but additional messages will now display if attepts are unsuccessful in a row. The unsuccessful loading message will only display once until an attept is successful, then subsequent attempts are unsuccessful again.
  • Concatenated shape checks into a single perform_shape_checks function.
  • Removed create_kstore/create_kfile calls, now dealt with by filehandlers.
  • Repaired ZarrDS object, which is now in line with other practices in padocc, in terms of use of filehandlers and properties of the project.
  • Repaired combine_dataset function to fix issue with data vars.
  • Added store overwrite checks, with appropriate help function.

Scan

  • Added ctype option as described above, allows users to select a starting ctype to try for conversion.

Validate

  • Changed format_slice function so it adds brackets to coordinate values.
  • Added data_report and metadata_report properties that are readonly.
  • Various typing hints.
  • Added general validation error which now points at the data report.
  • Removed opening sequences for datasets, now dealt with by filehandlers.

Operations -> Groups

  • Renamed Operations to Groups.
  • Replaced blacklist with faultlist. All behaviours remain the same.
  • Removed new_inputfile property.
  • Made configure_subset a private method, should use the method repeat_by_status which also allows repetition by phase.
  • Added deletion function.
  • Added slurm directories creation (from DirectoryOperation.)
  • Added docstrings to Mixins.
  • Expanded functions:
    • Add project (add_project)
    • Remove project (remove_project)
    • Transfer project (transfer_project)
    • Repeat by status (repeat_by_status)
    • Remove by status (remove_by_status)
    • Merge subsets (merge_subsets)
    • Summarise data (summarise_data)
    • Summarise status (summarise_status)
  • Removed or migrated several older functions (mostly private.)

Core

  • Added mixins module, expanded from single script.
  • Added dataset handler mixin - filehandlers as properties depending on specific parameters of the project.
  • Added directory mixin
  • Added properties mixin for project.
  • Added status mixin for project.
  • Added docstrings, with explanation of which mixins can be applied where.
  • Removed old definitions for creating new cloud files, now dealt with by dataset handler and filehandlers.
  • Added delete project function.
  • Added default overrides for cloud_type and file_type to apply to all projects.
  • Sorted filehandler kwargs, these are now part of the logging operation.
  • Removed the base mixins script.
  • Readded the get_attribute method from the command line.
  • Fixed issue with the C...
Read more

Pre-release 2: New features and functionality

20 Jan 16:47
5aed7b8
Compare
Choose a tag to compare

# Release Notes for pre-release 1.3.0b

General

  • Minor changes to project Readme.
  • Added Airflow example workflow file in documentation.
  • Added zarr version limit to pyproject.
  • Added specific license version to pyproject, instead of broken license link.

Tests

  • Now import the GroupOperation directly from padocc
  • Added Zarr tests, which run successfully in gitlab workflows.

## Phases

Compute

  • Moved definition of is_trial to after the super() call in ComputeOperation: This is to allow is_trial to be set as default for all project processes, then overridden in the case of the scan operation that then creates a compute operation.
  • Changed the behaviour of the setup_cache method for DirectoryOperation objects.
  • Added default converter type, and allows users to specify a type running compute or scan. In the case of scanning, the ctype carries through to the compute element.
  • Added create_mode to create_refs() which reorganises logger messages. Loading each cache file will always be attempted, but additional messages will now display if attepts are unsuccessful in a row. The unsuccessful loading message will only display once until an attept is successful, then subsequent attempts are unsuccessful again.
  • Concatenated shape checks into a single perform_shape_checks function.
  • Removed create_kstore/create_kfile calls, now dealt with by filehandlers.
  • Repaired ZarrDS object, which is now in line with other practices in padocc, in terms of use of filehandlers and properties of the project.
  • Repaired combine_dataset function to fix issue with data vars.
  • Added store overwrite checks, with appropriate help function.

Scan

  • Added ctype option as described above, allows users to select a starting ctype to try for conversion.

Validate

  • Changed format_slice function so it adds brackets to coordinate values.
  • Added data_report and metadata_report properties that are readonly.
  • Various typing hints.
  • Added general validation error which now points at the data report.
  • Removed opening sequences for datasets, now dealt with by filehandlers.

Operations -> Groups

  • Renamed Operations to Groups.
  • Replaced blacklist with faultlist. All behaviours remain the same.
  • Removed new_inputfile property.
  • Made configure_subset a private method, should use the method repeat_by_status which also allows repetition by phase.
  • Added deletion function.
  • Added slurm directories creation (from DirectoryOperation.)
  • Added docstrings to Mixins.
  • Expanded functions:
    • Add project (add_project)
    • Remove project (remove_project)
    • Transfer project (transfer_project)
    • Repeat by status (repeat_by_status)
    • Remove by status (remove_by_status)
    • Merge subsets (merge_subsets)
    • Summarise data (summarise_data)
    • Summarise status (summarise_status)
  • Removed or migrated several older functions (mostly private.)

Core

  • Added mixins module, expanded from single script.
  • Added dataset handler mixin - filehandlers as properties depending on specific parameters of the project.
  • Added directory mixin
  • Added properties mixin for project.
  • Added status mixin for project.
  • Added docstrings, with explanation of which mixins can be applied where.
  • Removed old definitions for creating new cloud files, now dealt with by dataset handler and filehandlers.
  • Added delete project function.
  • Added default overrides for cloud_type and file_type to apply to all projects.
  • Sorted filehandler kwargs, these are now part of the logging operation.
  • Removed the base mixins script.
  • Readded the get_attribute method from the command line.
  • Fixed issue with the CLI script argument input_file, now always specified as input.

Pre-release: v1.3.0a

08 Jan 11:08
Compare
Choose a tag to compare
Pre-release: v1.3.0a Pre-release
Pre-release

# Release Notes for pre-release 1.3

Module Restructuring

  • Restructured module sections into three components:

    • Core: Central components
    • Operations: Relating specifically to group operations.
    • Phases: Classes for specific phases within the pipeline (scan, compute etc.)
    • Added a Tests module for automated testing.
  • Other details

    • The entire module is now referred to as padocc throughout.
    • padocc is consistent with other packages in the CEDA package landscape in terms of its use of poetry for dependency management.

## Filehandlers

  • Generic properties of all filehandlers:

    • move_file: Specify a new path/name for the attached file.
    • remove_file: Delete the file from the filesystem.
    • file: Name of the attached file.
    • filename: Full path to the attached file.
  • Magic Methods of all filehandlers - implementations vary.

    • __contains__: Check if a value exists within the filehandler contents.
    • __str__: String representation of the specific instance.
    • __repr__: Programmatic representation of the specific instance.
    • __len__: Length of contents within the filehandler (where appropriate)
    • __iter__: Allows generation of iterable content of filehandler (where appropriate)
    • __getitem__: Filehandlers are indexable.
  • Other Methods

    • Append: ListFileHandler operates like a list with append() function.
    • Set/Get: Set or get the value attributed to a filehandler
    • Close: Save content of the filehandler to the filesystem.
  • Methods for Special Filehandlers

    • add_download_link: Specific to the Kerchunk JSON Filehandler, resets the path for all chunks to use the CEDA dap connection.
    • add_kerchunk_history: Adds specific parameters to the Kerchunk content's history parameter to state when it was generated etc.
    • clear: Clears the Store-type filehandlers of all internal files.
    • open (beta): Open a cloud-format (Store or JSON) Filehandler as a dataset
  • Other Features

    • Conf: JSON Filehandlers allow a special property to be supplied, a default dictionary which contains some template values. These are then applied before saving any content, where all new values override the template.
    • Documentation: Added the documentation page for filehandlers and subcomponents.

Project Operator

Documentation is provided for the project operator, detailed here are some of the key features.

  • Key Methods

    • info: Obtain information about the specific operator instance.
    • help: Get help with public methods
    • run: Can only be used with a Phase Operator (see below) for running a phased operation. All errors are handled and logged when using this function.
    • increment_versions: Increment major or minor versions
  • Properties

    • dir: The directory on the filesystem where all project files can be found
    • groupdir: The path to the group directory on the filesystem.
    • cfa_path: The path to a cfa dataset for this project.
    • outproduct: The complete filename of the output product
    • outpath: The combined path to the output product.
    • revision: The full revision describing the product version i.e 'kr1.0'
    • version_no: The version number (second part of the revision i.e '1.0') with a major and minor component.
    • cloud_format [Editable]: The cloud format being used in current workflows.
    • file_type [Editable]: The file type being used in current workflows.
    • source_format: The source format detected for this project.

## Group Operator

The group operator enables the application of phased operators across a group of datasets at once, either by parallel deployment or serial handling.

  • Key Methods

    • info: Obtain useful information about the group
    • help: Find helpful user functions for the group.
    • merge/unmerge (beta): Allows the merging and unmerging of groups of datasets.
    • run: Perform an operation on a subset of full set of the datasets within the group.
    • create_sbatch: Create a job deployment to SLURM for group processing.
    • init_from_file [Mixin]: Initialise a group from parameters in a CSV file.
    • add/remove_project [Mixin]: Add or remove a specific project from a group (Not implemented in pre-release)
    • get_project [Mixin]: Enables retrieval of a specific project, also accomplished by indexing the group which utilises this function.
  • Still in development:

    • summary_data: Summarise the data across all projects in the group, including status within the pipeline.
    • progress: Obtain pipeline specific information about the projects in the group.
    • create_allocations: Assemble allocations using binpacking for parallel deployment.
  • Magic Methods

    • __str__: String representation of the group
    • __repr__: Programmatic representation of the group.
    • __getitem__: Group is integer indexable to obtain a specific dataset as a ProjectOperator.

## Phased Operators

The phased operators can be used to individually operate on a specific project, although
it is instead suggested that the GroupOperator.run() method is used as this includes all error logging
as part of the project operator. Specifics of the phased operators are described below:

  • Scan Operator: Scan a subset of the source files in a specific project and generate some extrapolated results for the whole dataset, including number of chunks, data size and volumes etc.
  • Compute Operator: Inherited by DatasetProcessors within the compute module (Kerchunk, Zarr, CFA etc.) and enables the computation step. The scan operator uses the compute processors with a file limiter, to operate on a small subset for scanning purposes.
  • Validate Operator: Perform dataset validation using the CFA dataset generated as part of the pipeline. If a CFA dataset is the only dataset generated, this step is currently not utilised.

Future improvements:

  • Reorganisation of the compute operator as the current inheritance system is overly complicated.
  • Addition of an ingest operator for ingestion into STAC catalogs and the CEDA Archive (CEDA-only.)

Version 1.2 Allocations and Banding

12 Apr 13:24
e19c785
Compare
Choose a tag to compare
Pre-release

Updates for version 1.2:

  • Pipeline to Aggregate Data for Optimised Cloud Capabilities (padocc) - New official name for the pipeline.

Assessor (addition 1.2.a)

  • Two new modes added! (match and status_log)
  • Added new display options! (allocations and bands now displayed)
  • Bug fixes:
    • merge_old_new - issue with indexing different types of lists. (1.2.a1)
    • cleanup - now able to delete allocation directories. (1.2.a2)
    • progress - now able to match multiple error types. (1.2.a3)

Allocation (addition 1.2.b)

  • Added allocations for compute processes with estimations using binpacking - requires specific flag to enable.
  • Added general purpose bands for rerunning different processes - will look at past runs and add time for failed jobs.
    • Uses default values for time for each phase, unless --allow-band-increase is enabled in which case previous runs are considered.

Documentation (addition 1.2.c)

  • Added developer's guide for adding new features!
  • Updated flag listings for all tool scripts.

Group Run (addition 1.2.d)

  • Added default times for different phases
  • Added deployment function for multiple arrays from within a single call! Allocations and bands can now be deployed (no current limit to number of array jobs that can be deployed simultaneously)
  • Added pre-deployment input requirement to check deployments are as expected.
  • Minor bug fixes
    • Verbose flag now carries over to subprocesses (1.2.d1)

Compute (addition 1.2.e)

  • Added Zarr processor!
  • Rearranged all processors with new names and class inherritance.
  • Added ProjectProcessor parent class from which KerchunkDSProcessor now inherits!
  • New functions for checking variable shapes and determining behaviour which helps optimise processes within the pipeline.

Errors (addition 1.2.f)

  • Added NaNComparisonError - for consistent issues with comparing arrays (1.2.f1)
  • Added RemoteProtocolError - if the remote protocol cannot be handled properly (1.2.f2)
  • Added SourceNotFoundError - for resources that failed to open (1.2.f3)
  • Added ArchiveConnectionError - catches fsspec ReferenceNotReachable for multiple tries (1.2.f4)
  • Added KerchunkDecodeError - issue opening Kerchunk file (normally time decode related) (1.2.f5)
  • Added FullsetRequiredError - raised instead of risking a timeout in validation (1.2.f6)

Index_Cat (addition 1.2.g)

  • Initialised script for later use pushing Kerchunk records to an index

Ingest (addition 1.2.h)

  • Initialised script with some basic functions to use when ingesting data to the CEDA archive, also checks download links have been added properly.

Init (addition 1.2.i)

  • Updated docstrings

Logs (addition 1.2.j)

  • Added log_status fetch function
  • Updated init_logger to ensure filehandler exists.

Scan (addition 1.2.k)

  • Removed unused function eval_sizes
  • Altered scan setup to use instances of processor classes.
  • Added new detail-cfg attributes!
  • Added override_type for specifying Zarr as an output type.

Utils (addition 1.2.l)

  • Added new switches to BypassSwitch class for fasttrack, skip link addition.
  • Reconfigured remote protocol option in open_kerchunk.
  • Added function specifically to get the blacklist.
  • Added get/set_last_run routines for band increases if jobs time out.
  • Added find_divisor and find_closest routines for use in allocations.

Validate (addition 1.2.m)

  • Integration of new errors
  • Multiple tries of fetching Kerchunk/Xarray data with different options if required.
  • Added array flattening at point of checking NaN values, the flattened arrays are then used throughout, and with the new error codes for unreachable chunks, this means once the data is fetched successfully it can be kept and used for all tests.

Notebooks (addition 1.2.n)

  • Renamed simple scan notebook.
  • Initialised pipeline test notebook.

Single Run (addition 1.2.o)

  • Reconfigured how allocations/bands/subsets work for single/multiple processes.
  • Reconfigured logger creation when dealing with multiple processes in a single job.
  • Added override_type flag for compute phase.

Version 1.1 PPC Error Tracking

08 Mar 15:58
80232c3
Compare
Choose a tag to compare
Pre-release

New Software Features:

  • Per Project Code (PPC) Error tracking
  • Individual log files for each dataset, updated automatically with filehandler updates built into the pipeline - for upcoming job allocation improvements.
  • Scanning improvements and identification of types of dimensions.
  • Support for virtual dimension additions (file_number)
  • BypassSwitch option and default changes.

New Documentation:

  • Documentation Updates
  • Example CCI Water Vapour files and tutorial
  • Kerchunk Powerpoints

v1.0.2

16 Feb 12:27
d1f5a07
Compare
Choose a tag to compare
v1.0.2 Pre-release
Pre-release

Version 1.0.2

  • Major documentation overhaul
  • Added BypassSwitch for better control of switch options
  • Added features to Assessor:
  1. Blacklist and proj_code list concatenation with existing files.
  2. Slurm error type recognition and better labelling.

Alpha 1.0.1

13 Feb 16:22
Compare
Choose a tag to compare
Alpha 1.0.1 Pre-release
Pre-release

Version 1.0.1

  • errors.py - contains all custom error classes for Kerchunk pipeline
  • logs.py - contains logger content and other utils
  • setup.py - installs pipeline scripts into environment
  • validate.py - thorough validation added (try subset, then try full fileset)

Bug Fixes:

  • Dap link in cached files: Pipeline now writes cache files BEFORE concatenation to ensure no linkage.
  • Dimension loading issue: Known issue with remote_protocol set - greater distinction between pre-concatenation and post-concatenation Kerchunk files.

Features added:

  • Memory flag: Assign specific memory to parallel job arrays
  • In-place validation: Pipeline allows rechecking of 'complete' datasets.
  • Multi-dimension concatenation support
  • Identical_dims identification within compute phase.
  • Added custom error classes for edge cases.

Alpha 1.0

17 Jan 11:08
b76eb7d
Compare
Choose a tag to compare
Alpha 1.0 Pre-release
Pre-release

First release (alpha) of CEDA Kerchunk Pipeline. Includes Init, Scan, Compute and Validate phases but no Catalog/Ingest control solution. Kerchunk files created by the pipeline are fully verified so all results from NetCDF solutions will be the same as the Kerchunk method.