Skip to content

Loom format

PaulRivaud edited this page Jan 30, 2023 · 13 revisions

About Loom files

The Loom format is a file format developed by the Linnarsson Lab and made specifically to store scRNA-seq data and associated metadata. An R version was also developed by the Satija Lab. Examples present on this page use the Loompy syntax (Python).

Compatibility requirements

To ensure compatibility with our viewers, Loom files storing datasets must be configured a certain way and contain key elements. We present our requirements below.

# Open and close a Loom file
# Don't forget to close the file connection properly when done

df = loompy.connect(loom_path)
# do something
df.close() 

Global attributes

Global attributes provide information regarding the overall dataset. Mandatory global attributes are:

  • LOOM_SPEC_VERSION Loom version
  • Species Dataset species
  • Classes Coma-separated string of column attributes of interest (will be available as colors in viewer)
  • reductions JSON of dimensionality reductions (key) and associated coordinate labels (value, list of X and Y labels)
  • most_variable_genes Coma-separated string of most variable genes for the dataset
  • relevant_genes Coma-separated string of hand-picked relevant genes

For Spatial Transcriptomics datasets:

  • spatial_img_url A server path or URL to a slice image
  • spot_diameter_fullres The spot_diameter_fullres value from the scale_factors JSON file
  • reductions must contain a reduction called spatial

A global attribute value can be set easily

df.attrs['key'] = 'value'

We recommend creating a dictionary that will be used with loompy.create()

global_attrs = {
    'Species' : 'Human',
    'LOOM_SPEC_VERSION' : loompy.__version__,
    'Classes' : 'cell_type,cluster,sample',
    'reductions' : {"umap": ["X", "Y"], "tsne": ["X2", "Y2"], "pca": ["XPCA", "YPCA"]},
    'most_variable_genes' : 'CD34,LHX2,PTPRC,PAX8,PAX2',
    'relevant_genes' : 'SOX9,CD34,LHX2,NANOG'
}

Column attributes

Column attributes store metadata for cells. Each column attribute must be an array with a size matching the number of cells in the dataset. Our requirements are:

  • Sample An array of cell IDs (barcodes)
  • Two column attributes for X and Y coordinates from a dimensionality reduction for visualization purposes, matching the reductions global attribute

Users can add any metadata array that they want. A column attribute value can be set easily

df.ca['key'] = my_array

We recommend creating a dictionary that will be used with loompy.create(). All values must be arrays of matching size

col_attrs = {
    'Sample' = barcodes,
    'X' = X,
    'Y' = Y,
    'cell_types' = celltypes
}

Row attributes

Similar to column attributes. Our requirement is

  • Symbol Array of gene symbols matching dataset size

Users can add any metadata array that they want. A row attribute value can be set easily

df.ra['key'] = my_array

We recommend creating a dictionary that will be used with loompy.create(). All values must be arrays of matching size

row_attrs = {
    'Symbol' = symbols,
    'Synonyms' = synonyms,
    'Strand' = strand
}

Expression matrix

An expression matrix is required to create a Loom file. It enables gene querying in the viewer

Creating Loom

Having created attribute dictionaries individually makes the Loom creation process very easy

# M: expression matrix
loompy.create('my_dataset.loom', M, row_attrs=row_attrs, col_attrs=col_attrs, file_attrs=global_attrs)

Spatial Transcriptomics

Key elements:

  • Global attribute reductions must contain a reduction called spatial
  • When calculating spatial X and Y coordinates, make sure to use the scaling factor matching the image resolution
  • Include spot_diameter_fullres as a global attribute. The value is found in the scalefactors_json.json file
  • Additional reductions (UMAP, TSNE) and metadata can be added
  • A low-resolution image must be stored under /groups/irset/archives/web/genoViewer/media/datasets/spatial/<my_dataset>/tissue_lowres_image.png
import tables
import collections
import json
import loompy
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse


# H5 matrix
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])
 
def get_matrix_from_h5(filename):
    with tables.open_file(filename, 'r') as f:
        mat_group = f.get_node(f.root, 'matrix')
        barcodes = f.get_node(mat_group, 'barcodes').read()
        barcodes = np.array([x.decode('utf-8') for x in barcodes])
        data = getattr(mat_group, 'data').read()
        indices = getattr(mat_group, 'indices').read()
        indptr = getattr(mat_group, 'indptr').read()
        shape = getattr(mat_group, 'shape').read()
        matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)
         
        feature_ref = {}
        feature_group = f.get_node(mat_group, 'features')
        feature_ids = getattr(feature_group, 'id').read()
        feature_names = getattr(feature_group, 'name')
        feature_names = np.array([x.decode('utf-8') for x in feature_names])
        feature_types = getattr(feature_group, 'feature_type').read()
        feature_ref['id'] = feature_ids
        feature_ref['name'] = feature_names
        feature_ref['feature_type'] = feature_types
        tag_keys = getattr(feature_group, '_all_tag_keys').read()
        for key in tag_keys:
            key = key.decode("utf-8")
            feature_ref[key] = getattr(feature_group, key).read()
         
        return CountMatrix(feature_ref, barcodes, matrix)
    
#-----------------------------------------------------------------------------------
# Load matrix
#-----------------------------------------------------------------------------------
base_path = "test_data_spatial/HUGODECA/10x_Visium_data/V11A27-300_D1"
filtered_matrix_h5 = os.path.join(f'{base_path}','filtered_feature_bc_matrix.h5')
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)

#-----------------------------------------------------------------------------------
# Load scale factors & positions
#-----------------------------------------------------------------------------------
scalefactors_file = os.path.join(f'{base_path}','spatial/scalefactors_json.json')
with open(scalefactors_file, 'r') as f:
    scale_factors = json.load(f)
    
barcodes_coord_file = os.path.join(f'{base_path}','spatial/tissue_positions_list.csv')
spots = pd.read_csv(barcodes_coord_file,names=['barcode','tissue','ygrid','xgrid','ycoord','xcoord'])

#-----------------------------------------------------------------------------------
# Spot selection & scaling
#-----------------------------------------------------------------------------------
spot_sel = spots[spots.tissue==1] # spot selection
spot_sel = spot_sel.set_index('barcode') # set index in coord dataframe
spot_sel_sorted = spot_sel.loc[filtered_feature_bc_matrix.barcodes] # reorder rows based on barcodes order
xcoord = spot_sel_sorted.xcoord.values*scale_factors['tissue_lowres_scalef']
ycoord = spot_sel_sorted.ycoord.values*scale_factors['tissue_lowres_scalef']

#-----------------------------------------------------------------------------------
# Loom attributes
#-----------------------------------------------------------------------------------
species = 'Human'
reductions = '{"spatial" : ["X","Y"]}'
classes = ''

col_attrs = dict() # empty dictionary to store column attributes
col_attrs['Sample'] = filtered_feature_bc_matrix.barcodes
col_attrs['X'] = xcoord
col_attrs['Y'] = ycoord

row_attrs = dict() # empty dictionary to store row attributes
row_attrs['Symbol'] = filtered_feature_bc_matrix.feature_ref['name'] # features

#-----------------------------------------------------------------------------------
# Variable and relevant genes
#-----------------------------------------------------------------------------------
n = 10 # number of variable genes desired
v = np.var(filtered_feature_bc_matrix.matrix,axis=1) # compute variance
idx = np.argsort(v)[::-1][:n] # sort and select in descending order
most_variable_genes = row_attrs['Symbol'][idx] # trim symbol array
most_variable_genes = np.delete(labels, np.where(labels == 'nan')) # delete potential nan values
most_variable_genes = ','.join([x for x in most_variable_genes]) # coma-separated format
relevant_genes = 'SOX9,CD34,LHX2,NANOG' # hand-picked genes

#-----------------------------------------------------------------------------------
# Create Loom
#-----------------------------------------------------------------------------------
global_attrs = {
    'Species' : species,
    'LOOM_SPEC_VERSION' : loompy.__version__,
    'Classes' : classes,
    'reductions' : reductions,
    'spot_diameter_fullres' : scale_factors['spot_diameter_fullres'],
    'spatial_img_url' : 'datasets/spatial/<my_dataset>/tissue_lowres_image.png',
    'most_variable_genes' : most_variable_genes,
    'relevant_genes' : relevant_genes
}

output_file = f'test_data_spatial/HUGODECA/looms/{base_path.split("/")[-1]}.loom'
loompy.create(output_file, filtered_feature_bc_matrix.matrix, row_attrs=row_attrs, col_attrs=col_attrs, file_attrs=global_attrs) # create loom file