-
Notifications
You must be signed in to change notification settings - Fork 1
Loom format
The Loom format is a file format developed by the Linnarsson Lab and made specifically to store scRNA-seq data and associated metadata. An R version was also developed by the Satija Lab. Examples present on this page use the Loompy syntax (Python).
To ensure compatibility with our viewers, Loom files storing datasets must be configured a certain way and contain key elements. We present our requirements below.
# Open and close a Loom file
# Don't forget to close the file connection properly when done
df = loompy.connect(loom_path)
# do something
df.close()
Global attributes provide information regarding the overall dataset. Mandatory global attributes are:
-
LOOM_SPEC_VERSION
Loom version -
Species
Dataset species -
Classes
Coma-separated string of column attributes of interest (will be available as colors in viewer) -
reductions
JSON of dimensionality reductions (key) and associated coordinate labels (value, list of X and Y labels) -
most_variable_genes
Coma-separated string of most variable genes for the dataset -
relevant_genes
Coma-separated string of hand-picked relevant genes
For Spatial Transcriptomics datasets:
-
spatial_img_url
A server path or URL to a slice image -
spot_diameter_fullres
The spot_diameter_fullres value from the scale_factors JSON file -
reductions
must contain a reduction calledspatial
A global attribute value can be set easily
df.attrs['key'] = 'value'
We recommend creating a dictionary that will be used with loompy.create()
global_attrs = {
'Species' : 'Human',
'LOOM_SPEC_VERSION' : loompy.__version__,
'Classes' : 'cell_type,cluster,sample',
'reductions' : {"umap": ["X", "Y"], "tsne": ["X2", "Y2"], "pca": ["XPCA", "YPCA"]},
'most_variable_genes' : 'CD34,LHX2,PTPRC,PAX8,PAX2',
'relevant_genes' : 'SOX9,CD34,LHX2,NANOG'
}
Column attributes store metadata for cells. Each column attribute must be an array with a size matching the number of cells in the dataset. Our requirements are:
-
Sample
An array of cell IDs (barcodes) - Two column attributes for X and Y coordinates from a dimensionality reduction for visualization purposes, matching the
reductions
global attribute
Users can add any metadata array that they want. A column attribute value can be set easily
df.ca['key'] = my_array
We recommend creating a dictionary that will be used with loompy.create()
. All values must be arrays of matching size
col_attrs = {
'Sample' = barcodes,
'X' = X,
'Y' = Y,
'cell_types' = celltypes
}
Similar to column attributes. Our requirement is
-
Symbol
Array of gene symbols matching dataset size
Users can add any metadata array that they want. A row attribute value can be set easily
df.ra['key'] = my_array
We recommend creating a dictionary that will be used with loompy.create()
. All values must be arrays of matching size
row_attrs = {
'Symbol' = symbols,
'Synonyms' = synonyms,
'Strand' = strand
}
An expression matrix is required to create a Loom file. It enables gene querying in the viewer
Having created attribute dictionaries individually makes the Loom creation process very easy
# M: expression matrix
loompy.create('my_dataset.loom', M, row_attrs=row_attrs, col_attrs=col_attrs, file_attrs=global_attrs)
Key elements:
- Global attribute
reductions
must contain a reduction calledspatial
- When calculating spatial X and Y coordinates, make sure to use the scaling factor matching the image resolution
- Include
spot_diameter_fullres
as a global attribute. The value is found in thescalefactors_json.json
file - Additional reductions (UMAP, TSNE) and metadata can be added
- A low-resolution image must be stored under
/groups/irset/archives/web/genoViewer/media/datasets/spatial/<my_dataset>/tissue_lowres_image.png
import tables
import collections
import json
import loompy
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse
# H5 matrix
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])
def get_matrix_from_h5(filename):
with tables.open_file(filename, 'r') as f:
mat_group = f.get_node(f.root, 'matrix')
barcodes = f.get_node(mat_group, 'barcodes').read()
barcodes = np.array([x.decode('utf-8') for x in barcodes])
data = getattr(mat_group, 'data').read()
indices = getattr(mat_group, 'indices').read()
indptr = getattr(mat_group, 'indptr').read()
shape = getattr(mat_group, 'shape').read()
matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)
feature_ref = {}
feature_group = f.get_node(mat_group, 'features')
feature_ids = getattr(feature_group, 'id').read()
feature_names = getattr(feature_group, 'name')
feature_names = np.array([x.decode('utf-8') for x in feature_names])
feature_types = getattr(feature_group, 'feature_type').read()
feature_ref['id'] = feature_ids
feature_ref['name'] = feature_names
feature_ref['feature_type'] = feature_types
tag_keys = getattr(feature_group, '_all_tag_keys').read()
for key in tag_keys:
key = key.decode("utf-8")
feature_ref[key] = getattr(feature_group, key).read()
return CountMatrix(feature_ref, barcodes, matrix)
#-----------------------------------------------------------------------------------
# Load matrix
#-----------------------------------------------------------------------------------
base_path = "test_data_spatial/HUGODECA/10x_Visium_data/V11A27-300_D1"
filtered_matrix_h5 = os.path.join(f'{base_path}','filtered_feature_bc_matrix.h5')
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)
#-----------------------------------------------------------------------------------
# Load scale factors & positions
#-----------------------------------------------------------------------------------
scalefactors_file = os.path.join(f'{base_path}','spatial/scalefactors_json.json')
with open(scalefactors_file, 'r') as f:
scale_factors = json.load(f)
barcodes_coord_file = os.path.join(f'{base_path}','spatial/tissue_positions_list.csv')
spots = pd.read_csv(barcodes_coord_file,names=['barcode','tissue','ygrid','xgrid','ycoord','xcoord'])
#-----------------------------------------------------------------------------------
# Spot selection & scaling
#-----------------------------------------------------------------------------------
spot_sel = spots[spots.tissue==1] # spot selection
spot_sel = spot_sel.set_index('barcode') # set index in coord dataframe
spot_sel_sorted = spot_sel.loc[filtered_feature_bc_matrix.barcodes] # reorder rows based on barcodes order
xcoord = spot_sel_sorted.xcoord.values*scale_factors['tissue_lowres_scalef']
ycoord = spot_sel_sorted.ycoord.values*scale_factors['tissue_lowres_scalef']
#-----------------------------------------------------------------------------------
# Loom attributes
#-----------------------------------------------------------------------------------
species = 'Human'
reductions = '{"spatial" : ["X","Y"]}'
classes = ''
col_attrs = dict() # empty dictionary to store column attributes
col_attrs['Sample'] = filtered_feature_bc_matrix.barcodes
col_attrs['X'] = xcoord
col_attrs['Y'] = ycoord
row_attrs = dict() # empty dictionary to store row attributes
row_attrs['Symbol'] = filtered_feature_bc_matrix.feature_ref['name'] # features
#-----------------------------------------------------------------------------------
# Variable and relevant genes
#-----------------------------------------------------------------------------------
n = 10 # number of variable genes desired
v = np.var(filtered_feature_bc_matrix.matrix,axis=1) # compute variance
idx = np.argsort(v)[::-1][:n] # sort and select in descending order
most_variable_genes = row_attrs['Symbol'][idx] # trim symbol array
most_variable_genes = np.delete(labels, np.where(labels == 'nan')) # delete potential nan values
most_variable_genes = ','.join([x for x in most_variable_genes]) # coma-separated format
relevant_genes = 'SOX9,CD34,LHX2,NANOG' # hand-picked genes
#-----------------------------------------------------------------------------------
# Create Loom
#-----------------------------------------------------------------------------------
global_attrs = {
'Species' : species,
'LOOM_SPEC_VERSION' : loompy.__version__,
'Classes' : classes,
'reductions' : reductions,
'spot_diameter_fullres' : scale_factors['spot_diameter_fullres'],
'spatial_img_url' : 'datasets/spatial/<my_dataset>/tissue_lowres_image.png',
'most_variable_genes' : most_variable_genes,
'relevant_genes' : relevant_genes
}
output_file = f'test_data_spatial/HUGODECA/looms/{base_path.split("/")[-1]}.loom'
loompy.create(output_file, filtered_feature_bc_matrix.matrix, row_attrs=row_attrs, col_attrs=col_attrs, file_attrs=global_attrs) # create loom file