Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

vX.X.X

Changed

fix typo in install dependency distributed
add missing psutil requirement. #21.

v0.3.0

All changes

Added

add support for parallel processing using dask.distributed with command line flags --dask-distributed-local-core-fraction and --dask-distributed-local-memory-fraction to control the number of cores and memory to use on the local machine.

v0.2.0

All changes

Added

add support for creating dataset splits (e.g. train, validation, test) through output.splitting section in the config file, and support for optionally compute statistics for a given split (with output.splitting.splits.{split_name}.compute_statistics). .
include units and long_name attributes for all stacked variables as {output_variable}_units and {output_variable}_long_name .
include version of mllam-data-prep in output

Changed

split dataset creation and storage to zarr into separate functions mllam_data_prep.create_dataset(...) and mllam_data_prep.create_dataset_zarr(...) respectively
changes to spec from v0.1.0:
- the architecture section has been renamed output to make it clearer that this section defines the properties of the output of mllam-data-prep
- sampling_dim removed from output (previously architecture) section of spec, this is not needed to create the training data
- the variables (and their dimensions) of the output definition has been renamed from architecture.input_variables to output.variables
- coordinate value ranges for the dimensions of the output (i.e. what that the architecture expects as input) has been renamed from architecture.input_ranges to output.coord_ranges to make the use more clear
- selection on variable coordinates values is now set with inputs.{dataset_name}.variables.{variable_name}.values rather than inputs.{dataset_name}.variables.{variable_name}.sel
- when dimension-mapping method stack_variables_by_var_name is used the formatting string for the new variable is now called name_format rather than name
- when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing the named method (rename) explicitly through the method key, i.e. rather than {to_dim}: {from_dim} it is now {to_dim}: {method: rename, dim: {from_dim}} to match the signature of the other dimension-mapping methods.
- attribute inputs.{dataset_name}.name attribute has been removed, with the key dataset_name this is superfluous
relax minimuim python version requirement to >3.8 to simplify downstream usage

First tagged release of mllam-data-prep which includes functionality to declaratively (in a yaml-config file) describe how the variables and coordinates of a set of zarr-based source datasets are mapped to a new set of variables with new coordinates to single a training dataset and write this resulting single dataset to a new zarr dataset. This explicit mapping gives the flexibility to target different different model architectures (which may require different inputs with different shapes between architectures).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

vX.X.X

Changed

v0.3.0

Added

v0.2.0

Added

Changed

v0.1.0

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

vX.X.X

Changed

v0.3.0

Added

v0.2.0

Added

Changed

v0.1.0