Releases: tschuelia/Pandora
v2.0.1
v2.0.0
Major changes
MDS implemenation
One of our beta-testers discovered some issues with the MDS implementation in Pandora. So far we used the scikit-learn
MDS implementation which implements a solver that is best suited for non-metric MDS (and mostly suited for small matrices). However, scikit-learn
uses the same solver for metric MDS (independent of the size of the data) as well, resulting in unexpected results (and sometimes does not find a solution at all resulting in circular MDS embeddings). These issues are known issues in scikit-learn
(scikit-learn/scikit-learn#18933, scikit-learn/scikit-learn#16846, scikit-learn/scikit-learn#11381, scikit-learn/scikit-learn#15272), and a PR implementing an alternative standard SVD solver for MDS remains unmerged for about 1.5 years now (scikit-learn/scikit-learn#22330).
To prevent these issues in Pandora, we switched to the PCoA implementation in scikit-allel
(which implements a standard SVD solver for metric MDS). This resulted in the following additional changes:
- The
MDS
andPCA
classes inembedding.py
are now one unified as one class calledEmbedding
. - MDS results don’t have the
stress
attribute anymore, but the explained variance per dimension similar to PCA results (scikit-allel’s PCoA provides the explained variance rations rather than a stress factor which is more informative anyway). - All MDS plots now show the explained variance per dimension similar to PCA plots (instead of the stress).
CLI flag + variable naming
we renamed the bootstrap_convergence_confidence_level
to bootstrap_convergence_tolerance
to follow the terminology of our paper
Minor changes
We improved the implementation of the missing_corrected_hamming_distance
resulting in a 100x speedup
Bug fixes
We fixed a bug causing the FST distance matrix in the EigenDataset
to be recomputed independent of the redo
flag
Contributions
Thanks Lucas for testing Pandora and reporting all issues 🙂
v1.0.8
Bug Fixes:
- Fix an issue checking for string occurrences in Pandas Series
- set the random state for the scikit-learn MDS computation for reproducible results
Improvements:
- New documentation site: a Jupyter notebook with an example of a more thorough inspection of Pandora results
- set the default
smartpca
path to'smartpca'
inEigenDataset::run_pca
New Feature:
HTML export of all plots when plot_results: true
in the Pandora config file: the HTML exports can be opened in any browser and they provide interactive exploration of plots leveraging the full power of Plotly 🙂
v1.0.7
v1.0.6
v1.0.5
Changes:
- We changed the
dtype
of the input matrix of theNumpyDataset
touint8
instead offloat64
. The idea here is that genotype data usually only comprises four values:0
,1
,2
, andnp.nan
. Setting thedtype
to the standard numpyfloat64
results in a huge memory overhead during bootstrapping. So instead, we change the default type touint8
, but allow the user to change the type in case the input data requires another data type as it e.g. comprises of more than four values not fitting the defaultuint8
type.
Note that in case of missing data and a non-float dtype, the missing value will not benp.nan
. For more details see the documentation. - For easier use of the Pandora library, we provide more default settings for multiple bootstrap and PCA computation methods:
bootstrap_and_embed_multiple
andbootstrap_and_embed_multiple_numpy
embedding = EmbeddingAlgorithm.PCA
n_components = 10
n_bootstraps = 100
smartpca = "smartpca"
result_dir
= same directory as the input data
bootstrap_and_embed_muliple_numpy
embedding = EmbeddingAlgorithm.PCA
n_components = 10
n_bootstraps = 100
EigenDataset.bootstrap
andNumpyDataset.bootstrap
:seed = None
EigenDataset.run_pca
:smartpca = "smartpca"
Bug Fixes:
- When computing the FST Matrix for the MDS Analysis for an
EigenDataset
,smartpca
will ignore all samples with populationIgnore
. So far, this caused a failure in the MDS computation as Pandora was not aware of it. We now adapted theIgnore
logic fromsmartpca
and remove samples withIgnore
population. - Dashes in population names caused an issue when computing the FST-Distance matrix for an
EigenDataset
withsmartpca
v1.0.4
- A race condition in the python multiprocessing library sometimes caused unexpected
AttributeError
s during the bootstrap process creation. The bug is fixed in Python 3.12, we added the fix as backport for older Python versions. - We replaced
np.empty
bynp.zeros
in the hamming distance computation. - We changed the
np.sum
invocation with generators to explicit lists (np.sum
from generators is deprecated) - Documentation updates
v1.0.3
Remove test_config.py
and manually setting the smartpca
and convertf
paths.
The update of bioconda's eigensoft recipe enables eigensoft installation on osx-arm64 as well, so no explicit path setting should be necessary (bioconda/bioconda-recipes#44082). This also enables me to run the tests in the conda-forge recipe.