Releases: openpipelines-bio/openpipeline
OpenPipelines.bio v2.0.0-rc.2
OpenPipelines.bio v2.0.0-rc.1
BREAKING CHANGES
-
Added cell multiplexing support to the
from_cellranger_multi_to_h5mucomponent and thecellranger_multiworkflow. For thefrom_cellranger_multi_to_h5mucomponent, theoutputargument now requires a value containing a wildcard character*, which will be replaced by the sample ID to form the final output file names. Additionally, asample_csvargument is added to thefrom_cellragner_multi_to_h5mucomponent which describes the sample name per output file. No change is required for theoutput_h5muargument from thecellranger_multiworkflow, the workflow will just emit multiple events in case of a multiplexed run, one for each sample. The id of the events (and default output file names) are set by--sample_ids(in case of cell multiplexing), or (as before) by the user providedidfor the input (PR #803 and PR #902). -
demux/bcl_convert: update BCL convert from 3.10 to 4.2 (PR #774). -
demux/cellranger_mkfastq,mapping/cellranger_count,mapping/cellranger_multiandreference/build_cellranger_reference: update cellranger to8.0.1(PR #774 and PR #811). -
Removed
--disable_library_compatibility_checkin favour of--check_library_compatibilityto themapping/cellranger_multicomponent and theingestion/cellranger_multiworkflow (PR #818). -
lianapy: bumped version to1.3.0(PR #827 and PR #862). Additionally,groupbyis now a required argument. -
concat: this component was deprecated and has now been removed, useconcatenate_h5muinstead (PR #796). -
The
workflowsfolder in the root of the project no longer contains symbolic links to the build workflows intarget.
Using any workflows that was previously linked in this directory will now result in an error which will indicate
the location of the workflow to be used instead (PR #796). -
XGBoost: bump version to2.0.3(PR #646). -
Several components: update anndata to
0.11.1and mudata to0.3.1(PR #645 and PR #901), and scanpy to1.10.4(PR #901). -
filter/filter_with_hvg: this component was deprecated and has now been removed. Usefeature_annotation/highly_variable_features_scanpyinstead (PR #843). -
dataflow/concat: this component was deprecated and has now been removed. Usedataflow/concatenate_h5muinstead (PR #857). -
convert/from_h5mu_to_seurat: bump seurat to latest version (PR #850). -
workflows/ingestion/bd_rhapsody: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
mapping/bd_rhapsody: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
reference/make_bdrhap_reference: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
reference/build_star_reference: Renamemapping/star_build_referencetoreference/build_star_reference(PR #846). -
reference/cellranger_mkgtf: Renamereference/mkgtftoreference/cellranger_mkgtf(PR #846). -
labels_transfer/xgboost: Align interface with new annotation workflow- Store label probabilities instead of uncertainties
- Take
.h5muformat as an input instead of.h5ad
-
reference/build_cellranger_arc_reference: a default value of "output" is now specified for the argument--genome, inline withreference/build_cellranger_referencecomponent. Additionally, providing a value for--organismis no longer required and its default value ofHomo Sapienshas been removed (PR #864).
MAJOR CHANGES
- Bump popv to
0.4.2(PR #901)
NEW FUNCTIONALITY
-
Added
demux/cellranger_atac_mkfastqcomponent: demultiplex raw sequencing data for ATAC experiments (PR #726). -
process_samples,process_batchesandrna_multisampleworkflows: added functionality to scale the log-normalized
gene expression data to unit variance and zero mean. The scaled data will be output to a different layer and the
representation with reduced dimensions will be created and stored in addition to the non-scaled data (PR #733). -
transform/scaling: add--input_layerand--output_layerarguments (PR #733). -
CI: added checking of mudata contents for multiple workflows (PR #783).
-
Added multiple arguments to the
cellranger_multiworkflow in order to maintain feature parity with themapping/cellranger_multicomponent (PR #803). -
convert/from_cellranger_to_h5mu: add support for antigen analysis. -
Added
demux/cellranger_atac_mkfastqcomponent: demultiplex raw sequencing data for ATAC experiments (PR #726). -
Added
reference/build_cellranger_referencecomponent: build reference file compatible with ATAC and ATAC+GEX experiments (PR #726). -
demux/bcl_convert: add support for no lane splitting (PR #804). -
reference/cellranger_mkgtfcomponent: Added cellranger mkgtf as a standalone component (PR #771). -
scgpt/cross_check_genescomponent: Added a gene-model cross check component for scGPT (PR #758). -
scgpt/embedding: component: Added scGPT embedding component (PR #761) -
scgpt/tokenize_pad: component: Added scGPT padding and tokenization component (PR #754). -
scgpt/binningcomponent: Added a scGPT pre-processing binning component (PR #765). -
workflows/integration/scgpt_leidenworkflow with scGPT integration followed by Leiden clustering (PR #794). -
scgpt/cell_type_annotationcomponent: Added scGPT cell type annotation component (PR #798). -
resources_test_scripts/scGPT.sh: Added script to include scGPT test resources (PR #800). -
transform/clrcomponent: Added the option to set theaxisalong which to apply CLR. Possible to override
on workflow level as well (PR #767). -
annotate/celltypistcomponent: Added a CellTypist annotation component (PR #825). -
dataflow/split_h5mucomponent: Added a component to split a single h5mu file into multiple h5mu files based on the values of an .obs column (PR #824). -
workflows/test_workflows/ingestioncomponents &workflows/ingestion: Added standalone components for integration testing of ingestion workflows (PR #801). -
workflows/ingestion/make_reference: Add additional arguments passed through to the STAR and BD Rhapsody reference components (PR #846). -
annotate/random_forest_annotationcomponent: Added a random forest cell type annotation component (PR #848). -
dataflow/concatenate_h5mu: data from.uns, both originating from the global and per-modality slots, is now retained in the final concatenated output object. Additionally, added theuns_merge_modeargument in order to tune the behavior when conflicting keys are detected across samples (PR #859). -
dimred/densmapcomponent: Added a densMAP dimensionality reduction component (PR #748). -
annotate/scanvicomponent: Added a component to annotate cells using scANVI (PR #833). -
transform/bpcells_regress_outcomponent: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). -
transform/regress_out: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). -
workflows/ingestion/make_reference: add possibility to build CellRanger ARC references. Added--motifs_file,--non_nuclear_contigsand--output_cellranger_arcarguments (PR #864). -
Test resources (reference_gencodev41_chr1): switch reference genome for CellRanger to ARC variant (PR #864).
-
transform/bpcells_regress_outcomponent: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). -
transform/regress_out: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). -
Added
transform/tfidfcomponent: normalize ATAC data with TF-IDF (PR #870). -
Added
dimred/lsicomponent (PR #552). -
metadata/duplicate_obscomponent: Added a component to make a copy from one .obs field or index to another .obs field within the same MuData object (PR #874, PR #899). -
annotate/onclass: component: Added a component to annotate cell types using OnClass (PR #844). -
annotate/svmcomponent: Added a component to annotate cell types using support vector machine (SVM) (PR #845). -
metadata/duplicate_varcomponent: Added a component to make a copy from one .var field or index to another .var field within the same MuData object (PR #877, PR #899). -
filter/subset_obspcomponent: Added a component to subset an .obsp matrix by column based on the value of an .obs field. The resulting subset is moved to an .obsm field (PR #888). -
labels_transfer/knncomponent: Enable using additional distance functions for KNN classification (PR #830) and allow to perform KNN classification based on a pre-calculated neighborhood graph (PR #890).
MINOR CHANGES
-
Several components: bump python version (PR #901).
-
resources_test_scripts/cellranger_atac_tiny_bcl.shscript: generate counts from fastq files using CellRanger atac count (PR #726). -
cellbender_remove_background_v0_2: update base image tonvcr.io/nvidia/pytorch:23.12-py3(PR #646). -
Bump scvelo to
0.3.2(PR #828). -
Pin numpy<2 for several components (PR #815).
-
Added
resources_test_scripts/cellranger_atac_tiny_bcl.shscript: download tiny bcl file with an ATAC experiment, download a motifs file, demultiplex bcl files to reads in fastq format (PR #726). -
mapping/cellranger_multicomponent now outputs logs on failure of thecellranger multiprocess (PR #766). -
Bump
viash-actionstov6(PR #821). -
reference/make_reference: Do not try to extract genome fasta and transcriptome gtf if they are not gzipped (PR #856). -
Changes related to syncing the test resources (PR #867):
- Add
.info.test_resourcesto_viash.yamlto specify where test resources need to be synced from. download/sync_test_resources: Use `.inf...
- Add
OpenPipelines.bio v1.0.3
OpenPipelines.bio v0.12.7
OpenPipelines.bio v1.0.2
BUG FIXES
dataflow/concatenate_h5mu: fix writing out multidimensional annotation dataframes (e.g..varm) that had their
data dtype (dtype) changed as a result of adding more observations after concatenation, causingTypeError.
One notable example of this happening is when one of the samples does not have a multimodal annotation dataframe
which is present in another sample; causing the values being filled withNA(PR #842, backported from PR #837).
OpenPipelines.bio v1.0.1
OpenPipelines.bio v1.0.0
BREAKING CHANGES
-
query/cellxgene_census: Refactored the interface, documentation and internal workings of this component (PR #621).- Renamed arguments to align with standard OpenPipelines notations and cellxgene census API:
--input_databasebecame--input_uri--cellxgene_releasebecame--census_version--cell_querybecame--obs_value_filter--cells_filter_columnsbecame--cell_filter_grouping--min_cells_filter_columnsbecame--cell_filter_minimum_count--modalitybecame--output_modality- Removed
--dataset_idsince it was no longer being used. - Added
--add_dataset_metato add metadata to the output MuData object.
- Documentation of the component and its arguments was improved.
- Renamed arguments to align with standard OpenPipelines notations and cellxgene census API:
-
Docker image names now use
/instead of_between the name of the component and the namespace (PR #712). -
Change separator for arguments with multiple inputs from
:to;(PR #700 and #707). Now, all arguments withmultiple: truewill use;as the separator.
This change was made to be able to deal with file paths that contain:, e.g.s3://my-bucket/my:file.txt. Furthermore, the;separator will become
the default separator for all arguments withmultiple: truein Viash >= 0.9.0. -
This project now uses viash version 0.8.4 to build components and workflows. Changes related to this version update should
be mostly backwards compatible with respect to the results and execution of the pipelines. From a development perspective,
drastic updates have been made to the developemt workflow.Development related changes:
- Bump viash version to 0.8.4 (PR #598, PR#638, #697 and #706) in the project configuration.
- All pipelines no longer use the anonymous workflow. Instead, these workflows were given
a name which was added to the viash config as the entrypoint to the pipeline (PR #598). - Removed the
workflowsfolder and moved its contents to new locations:-
The
resources_test_scriptsfolder now resides in the root of the project (PR #605). -
All workflows have been moved to the
src/workflowsfolder (PR #605).
This implies that workflows must now be build usingviash (ns) build, just like with components. -
Adjust GitHub Actions to account for new workflow paths (PR #605).
-
In order to be backwards compatible, the
workflowsfolder now contains symbolic
links to the build workflows intarget. This is not a problem when using the repository for pipeline
execution. However, if a developer wishes to contribute to the project, symlink support should be enabled
in git usinggit config core.symlinks=true. Alternatively, use
git clone -c core.symlinks=true [email protected]:openpipelines-bio/openpipeline.gitwhen cloning the
repository. This avoids the symlinks being resolved (PR #628).
4bis. With PR #668, the workflows have been renamed. This does not hamper the backwards compatibility
of the symlinks that have been described in 4, because they still use the original location
which includes the original name.
*multiomics/rna_singlesamplehas been renamed torna/process_single_sample,
*multiomics/rna_multisamplehas been renamed torna/rna_multisample,
*multiomics/prot_multisamplebecameprot/prot_multisample,
*multiomics/prot_singlesamplebecameprot/prot_singlesample,
*multiomics/full_pipelinewas moved tomultiomics/process_samples,
*multiomics/multisamplehas been renamed tomultiomics/process_batches,
*multiomics/integration/initialize_integrationchanged tomultiomics/dimensionality_reduction,
* finally, all workflows atmultiomics/integration/*were moved tointegration/* -
Removed the
workflows/utilsfolder. Functionality that was provided by theDataflowHelper
andWorkflowHelperis now being provided by viash when the workflow is being build (PR #605).
-
End-user facing changes:
- The
concatcomponent had been deprecated and will be removed in a future release.
It's functionality has been copied to theconcatenate_h5mucomponent because the name is in
conflict with theconcatoperator from nextflow (PR #598). prot_singlesample,rna_singlesample,prot_multisampleandrna_multisample: QC statistics
are now only calculated once where needed. This means that the mitochondrial gene detection is
performed in therna_singlesamplepipeline and the other count based statistics are calculated
during theprot_multisampleandrna_multisamplepipelines. In both cases, theqcpipeline
is being used, but only parts of that workflow are activated by parametrization. Previously
the count based statistics were calculated in both thesinglesampleandmultisamplepipelines,
with the results from the multisample pipelines overwriting the previous results. What is breaking here
is that the qc statistics are not being added to the results of the singlesample worklows.
This is not an issue when using thefull_pipelinebecause in this case the singlesample and
multisample workflows are executed in-tandem. If you wish to execute the singlesample workflows
in a seperate manner and still include count based statistics, please run theqcpipeline
on the result of the singlesample workflow (PR #604).filter/filter_with_hvghas been renamed tofeature_annotation/highly_variable_features_scanpy, along with the following changes (PR #667).--do_filterwas removed--n_top_geneshas been renamed to--n_top_features
full_pipeline,multisampleandrna_multisample: Renamed arguments (PR #667).--filter_with_hvg_var_outputbecame--highly_variable_features_obs_batch_key--filter_with_hvg_obs_batch_keybecame--highly_variable_features_var_output
rna_multisample: Renamed arguments (PR #667).--filter_with_hvg_n_top_genesbecame--highly_variable_features_n_top_features--filter_with_hvg_flavorbecame--highly_variable_features_flavor
-
Renamed
obsm_metricstouns_metricsfor thecellranger_mappingworkflow because the cellranger metrics are stored in.unsand not.obsm(PR #610). -
mapping/cellranger_mkfastq: update from cellranger6.0.2to7.0.1(PR #675)
BUG FIXES
-
mapping/cellranger_multi: Fix the regex for the fastq input files to allow dropping the lane from the input file names (e.g._L001) (PR #778). -
workflows/rna/rna_singlesample: Fix argument passingtop_n_varsandobs_name_mitochondrial_fractionto theqcsubworkflow (PR #779). -
rna_singlesample: fixed a bug where selecting the column for the filtering with mitochondrial fractions
usingobs_name_mitochondrial_fractionwas done with the wrong column name, causingValueError(PR #743). -
Fix publishing in
process_samplesandprocess_batches(PR #759). -
Cellranger multi: Fix using a relative input path for
--vdj_inner_enrichment_primers(PR #717) -
dataflow/split_modalities: remove unusedcompressionargument. Useoutput_compressioninstead (PR #714). -
metadata/grep_annotation_column: fix calculating fraction when an input observation has no counts, which caused
the result to be out of bounds. -
Fix
--outputargument not working for several workflows (PR #740). -
transform/log1p: fix--input_layerargument not functioning (PR #678). -
dataflow/concatanddataflow/concatenate_h5mu: Fix an issue where using--mode moveon samples with non-overlapping features would causevar_namesto become unaligned to the data (PR #653). -
filter/filter_with_scrublet: (Testing) Fix duplicate test function names (PR #641). -
dataflow/concatenate_h5muanddataflow/concat: FixTypeErrorwhen using mode 'move' and a column with conflicting metadata does not exist across all samples (PR #631). -
dataflow/concatenate_h5muanddataflow/concat: Fix an issue where joining columns with different datatypes causedTypeError(PR #619). -
qc/calculate_qc_metrics: Resolved an issue where statistics based on the input columns selected with--var_qc_metricswere incorrect when these input columns were encoded inpd.BooleanDtype()(PR #685). -
move_obsm_to_obs: fix setting output columns when they already exist (PR #690). -
workflows/dimensionality_reductionworkflow: nearest neighbour calculations no longer recalcalates the PCA whenobm_input--obsm_pcais not set toX_pca. -
feature_annotation/highly_variable_scanpy: fix .X being used to remove observations with 0 counts when--layerhas been specified. -
filter/filter_with_counts: fix--layerargument not being used. -
transform/normalize_total: fix incorrect layer being written to the output when the input layer if not.X. -
src/workflows/qc: fix input layer not being taken into account when calculating the fraction of mitochondrial genes (always used .X). -
convert/from_cellranger_multi_to_h5mu: fix metric values not repesented as percentages being devided by 100. (#704).
NEW FUNCTIONALITY
-
dimred/tsnecomponent: Added a tSNE dimensionality reduction component (PR #742). -
multisamplepipeline: This workflow now works when provided multimple unimodal files or multiple multimodal files, in addition to the previously supported single multimodal file (PR #606). The modalities are processed independently from each other:- As before, a single multimodal file is split into several unimodal MuData objects, e...
OpenPipelines.bio v1.0.0-rc6
BUG FIXES
dataflow/concatenate_h5mu: fix regression bug where observations are no longer linked to the correct metadata
after concatenation (PR #807)
OpenPipelines.bio v1.0.0-rc5
BUG FIXES
cluster/leiden: prevent leiden component from hanging when a child process is killed (e.g. when there is not enough memory available) (PR #805).
OpenPipelines.bio v1.0.0-rc4
BREAKING CHANGES
query/cellxgene_census: Refactored the interface, documentation and internal workings of this component (PR #621).- Renamed arguments to align with standard OpenPipelines notations and cellxgene census API:
--input_databasebecame--input_uri--cellxgene_releasebecame--census_version--cell_querybecame--obs_value_filter--cells_filter_columnsbecame--cell_filter_grouping--min_cells_filter_columnsbecame--cell_filter_minimum_count--modalitybecame--output_modality- Removed
--dataset_idsince it was no longer being used. - Added
--add_dataset_metato add metadata to the output MuData object.
- Documentation of the component and its arguments was improved.
- Renamed arguments to align with standard OpenPipelines notations and cellxgene census API: