This tutorial will guide you through the steps to run the Lib Prep Visualization pipeline locally using a Conda environment. Follow the instructions below to set up your environment, install dependencies, and execute the pipeline.
Clone the repository to your local machine and navigate to the project directory:
git clone https://github.com/UCSC-Treehouse/lib-prep-visualization.git
cd lib-prep-visualizationEnsure you have conda installed: https://www.anaconda.com/download
Create and activate a Conda environment with the required dependencies and the lib-prep-tools package installed:
conda env create -f environment.yaml
conda activate lib-prep-visualizationThis tutorial will use a pilot dataset distributed with the repository. However, if you wish to download the full UCSC Treehouse tumor compendia, follow the instructions in this step.
Run the following command to download the two UCSC Treehouse tumor compendia (PolyA and RiboD):
python scripts/download_data.py --config configs/download_data/polyA_vs_riboD_v25.01.jsonThe downloaded compendia files will be saved in the data/ directory.
Run the following command to merge all of the downloaded compendia and run the UMAP algorithm:
python scripts/process_data.py --config configs/process_data/pilot_data/pilot_data_process.json --data-dir pilot_dataThe processed data with UMAP results will be saved in the processed/pilot_data/ directory as merged_compendia.hd5ad.
Run the following command generate a visualization coloring on library preparation method:
python scripts/plot_data.py --config configs/plot_data/pilot_data/pilot_compendia_type.jsonThe output visualization will be saved in the figures/pilot_data directory as Pilot_Data_UMAP_Labeled_by_Library_Prep_Method.png.
Next, plot a figure coloring on disease type.
python scripts/plot_data.py --config configs/plot_data/pilot_data/pilot_disease.jsonThe output visualization will be saved in the figures/pilot_data directory as Pilot_Data_UMAP_Labeled_by_Disease.png.
Pilot_Data_UMAP_Labeled_by_Disease.png uses the same legend colors as Pilot_Data_UMAP_Labeled_by_Library_Prep_Method.png for different values, which can be confusing.
To fix this, we can plot the disease type figure again using a custom color map.
python scripts/plot_data.py --config configs/plot_data/pilot_data/pilot_disease_cusom_colors.jsonThe output visualization will be saved in the figures/pilot_data directory as Pilot_Data_UMAP_Custom_Colormap_by_Disease.png.
You can hand select the samples to include in the UMAP analysis when running the process_data.py script. The following example selects 10 glioma and 10 synovial sarcoma samples from the two input compendia. The list of sample ids are provided to the process_data.py script through a .tsv file refrerenced in the configuration file.
python scripts/process_data.py --config configs/process_data/pilot_data/10_glioma_10_synovial_sarcoma.json --data-dir pilot_dataThe output processed data with UMAP results for the selected samples will be saved in the processed/pilot_data_glioma_ss directory as merged_compendia.hd5ad.
To visualize the UMAP results for this subset of samples, run the following commands:
python scripts/plot_data.py --config configs/plot_data/pilot_data_glioma_ss/compendia_type.json
python scripts/plot_data.py --config configs/plot_data/pilot_data_glioma_ss/disease.jsonThe output visualizations will be saved in the figures/pilot_data_glioma_ss directory.
Sometimes one compendia may have significantly more samples than another compendia. In such cases, it is desirable to match the sample counts coming from each compendia based on a specific metadata. For example, if one compendia has more samples for disease types than another compendia, we may want to subsample the larger compendia to match the counts of each disease type in the smaller compendia.
The matched_subsampling.py script performs count-matched subsampling based on a specified metadata column across multiple compendia and returns a list of sample IDs to include in the analysis. The output sample list can then be provided to the process_data.py script to generate UMAP results for the count-matched subsampled data.
The following example performs count-matched subsampling based on the disease metadata column when merging two input compendia.
python scripts/matched_subsampling.py --config configs/matched_subsampling/pilot_data/disease.json --data-dir pilot_dataThe output sample list will be saved in the matched_subsamples/pilot_data_disease_matched directory as subset_samples.tsv.
Now we can run the process_data.py script using the output sample list to generate UMAP results for the count-matched subsampled data. Remember that the sample list is referenced in the configuration file.
python scripts/process_data.py --config configs/process_data/pilot_data/disease_matched.json --data-dir pilot_dataThe output processed data with UMAP results for the count-matched subsampled disease data will be saved in the processed/pilot_data_disease_matched directory as merged_compendia.hd5ad.
To visualize the UMAP results for this count-matched subsampled data, run the following commands:
python scripts/plot_data.py --config configs/plot_data/pilot_data_disease_matched/compendia_type.json
python scripts/plot_data.py --config configs/plot_data/pilot_data_disease_matched/disease.jsonView the output visualizations in the figures/pilot_data_disease_matched directory.