Skip to content

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

@grst

Description

@grst

Please describe your wishes and possible alternatives to achieve the desired result.

Since #504, AnnData supports nullable int and bool columns in obs. Support for strings is planned in #679.

However, this only works if the nullable columns are represented as the appropriate pandas Array extension type.

For instance this

import anndata
import numpy as np
import pandas as pd

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame().assign(
        test_int=np.array([1, 2, None, 3]),
        test_bool=[True, False, None, False],
    ),
)
adata.write_h5ad("test.h5ad")

fails with TypeError: Can't implicitly convert non-string objects to strings.

After converting the columns to pandas arrays, the object can be saved:

for c in adata.obs.columns:
    adata.obs[c] = pd.array(adata.obs[c].values)
adata.write_h5ad("test.h5ad")

Unfortunately, the pandas extension arrays are little known and Nones might end up in adata.obs for various reasons (for instance scverse/scirpy#434).

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?
Or maybe there should be an equivalent to AnnData.strings_to_categoricals that can be called to sanitize such columns?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions