Cancer-related exercises for the DTU course 22126
Adapted from original exercise by Marcin Krzystanek and Aron Eklund.
- Changes include: Some more tools, GATK calls updated to match vs. 4, new resources, comments and questions
These exercises will guide you through all steps starting from raw data (FASTQ files) and resulting in a list of somatic point mutations.
Estimated time: 2 hours
These exercises are tested with:
- Picard v. 2.21.6 (http://broadinstitute.github.io/picard)
- GATK v. 4.1.4.1 (https://github.com/broadinstitute/gatk)
- BWA v. 0.7.12 (http://bio-bwa.sourceforge.net/)
- Samtools v. 1.3.1 (http://samtools.sourceforge.net/)
- R v. 3.5.2 (https://www.r-project.org/)
- TrimGalore v. 0.6.5 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
- fastqCombinePairedEnd.py (by Eric Normandeau)
Known resources are important when working with human data and cancer seq. Luckily, the human is probably one of the most well-annotated species. It is generally a good idea to try and use the most up-to-date version of all resources as possible, although a common caveat is that resources have to match and some might not yet be available for the newest versions.
All resources are based on the newest genome build, hg38. The previous and still extensively used build is hg37 (also known as hg19). Matching other resources with the genome build is important as genomic coordinates (location of chromosome e.g. chr1:1001) differ between genome builds, meaning that, for instance, a reported SNP has different coordinates depending on the build. You can read about this here.
Important points are also that naming needs to match: mentionable naming conventions are defined by ENSEMBL, NCBI (RefSeq) and UCSC. In this exercise, the used resources are based on UCSC naming conventions: chromosomes are named chr1, chr2, ..., chrX, chrY and chrM (in "contrast" to ENSEMBL: 1, 2, X, Y and MT, and the less human-readable RefSeq: NC_000001.11, NC_000002.12, NC_000023.11, NC_000024.10 and NC_012920.1). Alternative scaffolds are named chr1_KI270765v1_alt.
The resources include:
- GRCh38 (hg38) with UCSC
- dbSNP build 138
- COSMIC with 30,178,158 known cancer variants
- Mills and 1000G genomes Gold Standard indels (known indels)
- gnomAD for allelic frequencies in population
All resources are in /home/27626/exercises/cancer_seq/resources
. The genome has been bwa-indexed. Consider taking a look at the hg38.fa file. You can look at the header encoding with
grep '^>' hg38.fa | head -15 # show first 15
Besides containing the chromosomes, the genome also contains decoy sequences and HLA-alleles.
You will analyze whole-exome sequencing data from a mix of tumor from infiltrating duct adenocarcinoma and head of pancreas, and a matched normal tissue. This is known as somatic variant calling or as paired variant calling because we have a tumor-normal pair, and germline variation will be filtered from our final set of variants. Thus we will only find mutations that are specific for the tumor and potential driver mutations.
The data used in this exercise has been released for scientific and educational use by the Texas Cancer Research Biobank and is fully described in this paper. You can read about the sequencing protocol here.
Please note the Conditions of Data Use:
By downloading or utilizing any part of this dataset, end users must agree to the following conditions of use:
- No attempt to identify any specific individual represented by these data or any derivatives of these data will be made.
- No attempt will be made to compare and/or link this public data set or derivatives in part or in whole to private health information.
- These data in part or in whole may be freely downloaded, used in analyses and repackaged in databases.
- Redistribution of any part of these data or any material derived from the data will include a copy of this notice.
- The data are intended for use as learning and/or research tools only.
- This data set is not intended for direct profit of anyone who receives it and may not be resold.
- Users are free to use the data in scientific publications if the providers of the data (Texas Cancer Research Biobank and Baylor College of Medicine Human Genome Sequencing Center) are properly acknowledged.
The raw data files are located on the server at /home/27626/exercises/cancer_seq
IMPORTANT IMPORTANT IMPORTANT - Since the full procedure takes a long time,
we will not ask you to perform the full alignment and full mutation calling, and assume that you have become familiar with this in previous exercises.
However, for reference, we provide the code needed for the full analysis.
Thus, you can use this code later in the course project or in your own work,
should you work with cancer patient sequencing data.
The parts where you should actually run the code are marked with a green square:
Much of this exercise is based on the Best Practices by Broad Institute by MIT and Harvard and located in Cambridge, Massachusetts. The Broad provides the Genome Analysis Toolkit (gatk) which you will probably become very familiar with if you continue with genomics and NGS in the future. Best practices for somatic variant calling can be found here.
First, let's define bash variables for keeping our workspace clean.
# Input files
f1n=/home/27626/exercises/cancer_seq/raw_data/TCRBOA2-N-WEX.read1.fastq.gz
f2n=/home/27626/exercises/cancer_seq/raw_data/TCRBOA2-N-WEX.read2.fastq.gz
f1t=/home/27626/exercises/cancer_seq/raw_data/TCRBOA2-T-WEX.read1.fastq.gz
f2t=/home/27626/exercises/cancer_seq/raw_data/TCRBOA2-T-WEX.read2.fastq.gz
# Resources
hg38=/home/27626/exercises/cancer_seq/resources/hg38.fa
mills=/home/27626/exercises/cancer_seq/resources/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
dbsnp=/home/27626/exercises/cancer_seq/resources/Homo_sapiens_assembly38.dbsnp138.vcf
cosmic=/home/27626/exercises/cancer_seq/resources/CosmicCodingMuts_chr_sorted.vcf
gnomad=/home/27626/exercises/cancer_seq/resources/af-only-gnomad.hg38.vcf.gz
# Tools
PICARD=/home/27626/bin/picard.jar
snpSift=/home/27626/bin/snpEff/SnpSift.jar
# rest are already in your path
Make a folder for the exercise:
mkdir cancer_seq
cd cancer_seq
outdir=`pwd`
Look at the data available. You do not need to copy it to your own folder because the files are quite large.
ls -l /home/27626/exercises/cancer_seq
zcat /home/27626/exercises/cancer_seq/TCRBOA2-N-WEX.read1.fastq.gz | head
- How long are the reads? Is your data single or paired end?
- What type would you prefer for cancer DNA sequencing, and why?
- What is the difference between whole-exome sequencing and RNA-seq?
We do this using Trim Galore!. Trim Galore is actually a wrapper using other trimmers. It can automatically detect adapters for removal, and envoke FastQC simultaneously.
### Trim reads with trim_galore wrapper, produce both fastqc and trimming reports
trim_galore --fastqc --fastqc_args "--outdir ${outdir}/trimmed_normal" --quality 20 --trim-n --length 50\
--trim1 --output_dir $outdir --paired $f1n $f2n
trim_galore --fastqc --fastqc_args "--outdir ${outdir}/trimmed_tumor" --quality 20 --trim-n --length 50\
--trim1 --output_dir $outdir --paired $f1t $f2t
-
What does the argument
--quality 20
mean? Get help by running:trim_galore --help
Trim Galore does not make sure that all reads are paired which is a requirement for bwa. This is handled with fastqCombinePairedEnd.py.
python fastqCombinePairedEnd.py $outdir/trimmed_normal/TCRBOA2-N-WEX.read1_trimmed.fq.gz $outdir/trimmed_normal/TCRBOA2-N-WEX.read2_trimmed.fq.gz
python fastqCombinePairedEnd.py $outdir/trimmed_normal/TCRBOA2-N-WEX.read1_trimmed.fq.gz $outdir/trimmed_normal/TCRBOA2-N-WEX.read2_trimmed.fq.gz
Set up new variables for the newly created files.
f1n_val=$outdir/trimmed_normal/TCRBOA2-N-WEX.read1_trimmed.fq.gz_pairs_R1.fastq.gz
f2n_val=$outdir/trimmed_normal/TCRBOA2-N-WEX.read2_trimmed.fq.gz_pairs_R2.fastq.gz
f1t_val=$outdir/trimmed_tumor/TCRBOA2-T-WEX.read1_trimmed.fq.gz_pairs_R1.fastq.gz
f2t_val=$outdir/trimmed_tumor/TCRBOA2-T-WEX.read2_trimmed.fq.gz_pairs_R2.fastq.gz
~4.5 hours with 4 processors, ~37 minutes with 14 processors
We use bwa mem for aligning reads to the genome. We align the tumor sample and normal sample separately.
Importantly, a Read Group ID line (@RG line) must be defined by the user, because Mutect2 and other programs in the pipeline below depend on information in this line. Here we demonstrate one way of adding the @RG line to the resulting BAM file:
### @RG ID # read group ID, needs to be unique for fastq file due to downstream processing, takes\
preference when used by some programs
### @RG SM # sample ID, unique for each tumor and normal sample, not to be confused with patient ID
### @RG PL # platform name
### @RG LB # library name
### @RG PU # Platform unit, needs to be unique for fastq file due to downstream processing, takes\
preference when used by some programs
### Let's create an @RG line that we will use when running bwa mem alignment
ReadGoupID_N="\"@RG\tID:TCRBOA2-N-WEX\tSM:TCRBOA2-N-WEX\tPL:ILLUMINA\tLB:libN\tPU:TCRBOA2-N-WEX"\"
ReadGoupID_T="\"@RG\tID:TCRBOA2-T-WEX\tSM:TCRBOA2-T-WEX\tPL:ILLUMINA\tLB:libT\tPU:TCRBOA2-T-WEX"\"
# make a directory for alignment and enter
mkdir aligned
cd aligned
### Run bwa mem
bwa mem -M -t 4 -R $ReadGoupID_N $hg38 $f1n_val $f2n_val \
| samtools view -Sb -@ 1 - | samtools sort -@ 3 > TCRBOA2-N-WEX.bam
bwa mem -M -t 4 -R $ReadGoupID_T $hg38 $f2t_val $f2t_val \
| samtools view -Sb -@ 1 - | samtools sort -@ 3 > TCRBOA2-T-WEX.bam
Optionally, the @RG line can provide additional information; please see the SAM format specification as well as samtools webpage if you want to know more.
The command after the pipe is to compress the output of bwa (sam-format) to binary (bam-format).
We also sorting the bam-file to avoid unnecessary intermediate files, keeping in mind a bam-file might take up a lot of space.
This can also be done separately with the following command:
Note that it is also possible to pipe the output of bwa directly to samtools
to sort (without writing an unsorted bam-file).
samtools sort -@ 3 TCRBOA2-N-WEX_unsorted.bam -o TCRBOA2-N-WEX.bam
samtools sort -@ 3 TCRBOA2-T-WEX_unsorted.bam -o TCRBOA2-N-WEX_unsorted.bam
~40/~58 min per file
We use Picard to mark PCR duplicates so that they will not introduce false positives and bias in the subsequent analysis.
mkdir tmp
java -Xmx5G -Xms1024M -XX:+UseParallelGC -XX:ParallelGCThreads=6 -jar $PICARD MarkDuplicates\
INPUT=TCRBOA2-N-WEX.bam OUTPUT=TCRBOA2-N-WEX_deduped.bam METRICS_FILE=TCRBOA2-N-WEX.metrics.txt \
TMP_DIR=./tmp
java -Xmx5G -Xms1024M -XX:+UseParallelGC -XX:ParallelGCThreads=6 -jar $PICARD MarkDuplicates\
INPUT=TCRBOA2-T-WEX.bam OUTPUT=TCRBOA2-T-WEX_deduped.bam METRICS_FILE=TCRBOA2-T-WEX.metrics.txt \
TMP_DIR=./tmp
~3 min per file
samtools index TCRBOA2-T-WEX_deduped.bam
samtools index TCRBOA2-N-WEX_deduped.bam
56/53 min per file
We use GATK to recalibrate base quality scores.
Each base in each sequencing read comes out of the sequencer with an individual quality score. Depending on the machine used for sequencing, these scores are subject to various sources of systematic technical error. Base quality score recalibration (BQSR) works by applying machine learning to model these errors empirically and adjust the quality scores accordingly. Read more about it here
Link to tool documentation. --known-sites specifies one or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis.
There is more information on BSQR here.
gatk BaseRecalibrator \
-I TCRBOA2-N-WEX_deduped.bam \
-R $hg38 \
--known-sites $dbsnp \
--known-sites $mills \
-O normal.recal.table
gatk BaseRecalibrator \
-I TCRBOA2-T-WEX_deduped.bam \
-R $hg38 \
--known-sites $dbsnp \
--known-sites $mills \
-O tumor.recal.table
~34 and ~32 min per file
gatk ApplyBQSR \
-I TCRBOA2-N-WEX_deduped.bam \
-R $hg38 \
--bqsr-recal-file normal.recal.table \
-O TCRBOA2-N-WEX_recaled.bam
gatk ApplyBQSR \
-I TCRBOA2-T-WEX_deduped.bam \
-R $hg38 \
--bqsr-recal-file tumor.recal.table \
-O TCRBOA2-T-WEX_recaled.bam
Now, the resulting BAM files are ready to be processed with MuTect2.
All chromosomes took ~327 minutes, (~5.4 hours with four threads)
We are going to use "soft-links" to the input bam-files and their indexes. This a way to avoid working with long paths and is an alternative to defining bash-variables.
mkdir variant_calling
cd variant_calling
# create links
ln -s /home/27626/exercises/cancer_seq/aligned/TCRBOA2-T-WEX_recaled.bam* .
ln -s /home/27626/exercises/cancer_seq/aligned/TCRBOA2-N-WEX_recaled.bam* .
Use ls -l
to see how they appear in your directory.
We use MuTect2, a somatic mutation caller that identifies both SNV and indels. The produced VCF-file (variant calling format) is the standard format for storing variants, although the output of Mutect2 has some information specific for somatic variants. See here for specs: https://samtools.github.io/hts-specs/VCFv4.1.pdf
A big difference in cancer-seq variant calling using Mutect2 is that there are no ploidy assumptions. This accommodates tumor data which can have many copy number variants (CNVs).
Mutect2 is computationally intensive so we recommend to parallelize if possible. One way to achieve this is to split processes by chromosomes (calling variants for each chromosome and then merging vcf-files.)
Since we do not have the time and capacity to process the entire genome during our
exercises, we will call somatic mutations on a small part of chromosome 10,
from the 3,100,000th to the 5,100,000th base pair, which is set with the -L
option.
Now we are ready for finding variants!
# Set chromosome and location:
CHR_LOC=chr10:3100000-5100000
# Run Mutect2
gatk Mutect2 \
-R $hg38 \
-I TCRBOA2-N-WEX_recaled.bam \
-I TCRBOA2-T-WEX_recaled.bam \
-normal TCRBOA2-N-WEX \
-L ${CHR_LOC} \
--germline-resource $gnomad \
-O TCRBOA2_${CHR_LOC}.vcf
### To process the whole genome, simply omit the -L option.
Take a look at the resulting VCF file with less -RS TCRBOA2_${CHR_LOC}.vcf
Then try to count the number of raw variants:
grep -v "^#" TCRBOA2_${CHR_LOC}.vcf | wc -l
- How many variants did you find?
Just for comparison, we try to call all variants in the interval for the germline and the tumor sample.
gatk HaplotypeCaller -I TCRBOA2-T-WEX_recaled.bam -R $hg38 -L ${CHR_LOC} -O TCRBOA2-T.vcf --dbsnp $dbsnp
gatk HaplotypeCaller -I TCRBOA2-N-WEX_recaled.bam -R $hg38 -L ${CHR_LOC} -O TCRBOA2-N.vcf --dbsnp $dbsnp
Count the variant lines in each with the same command as above:
grep -v "^#" TCRBOA2-T.vcf | wc -l
grep -v "^#" TCRBOA2-N.vcf | wc -l
- Where do you find the most raw variants and does that make biological sense? What is the difference in the two numbers and does it match above?
Before continuing, we need to filter the raw vcf-output to only get confident variants:
gatk FilterMutectCalls \
-V TCRBOA2_${CHR_LOC}.vcf \
-R $hg38 \
-O TCRBOA2_${CHR_LOC}_filtered.vcf
Try to look at the output with less -RS TCRBOA2_${CHR_LOC}_filtered.vcf
.
- What does it look like in the filter column? What kind of filters were applied?
To add some extra information to the vcf-file, we will also annotate with SNP-ids. HaplotypeCaller can do this as it calls variants, but using Mutect2 we need to do it ourselves:
dbSNP needs to look like it is in your own folder, so once again we link it to your working directory:
ln -s $dbsnp dbsnp_link.vcf
java -jar $snpSift annotate dbsnp_link.vcf TCRBOA2_${CHR_LOC}_filtered.vcf > TCRBOA2_${CHR_LOC}_filtered_anno.vcf
Note that this command may take up to 3 minutes. Now try to filter mutational calls by selecting those with Mutect "PASS" annotation.
grep PASS TCRBOA2_${CHR_LOC}_filtered_anno.vcf | grep -v "^#"
You should at least see this line (without the header). Don't forget you can scroll to the sides!
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCRBOA2-N-WEX TCRBOA2-T-WEX
chr10 3165513 rs9423502 G C . PASS CONTQ=93;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;SEQQ=19;STRANDQ=16;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:79,0:0.014:79:41,0:38,0:19,60,0,0 0/1:79,3:0.054:82:41,2:38,1:20,59,0,3
There is a lot of information in the line so no surprise of you're feeling a bit overwhelmed. A brief explanation of each part of the line above is in the header of the VCF file
(use less -RS TCRBOA2_${CHR_LOC}_filtered.vcf
to look at it, but it is also difficult to interpret).
Let's focus on the FORMAT lines. The information kept here is organized by GT:AD:AF:DP:F1R2:F2R1:SB
and the values are kept in the two proceeding columns, also seperated by colons.
The first starting with 0/0 refers to the normal sample, whereas the column beginning with 0/1 refers to the tumor.
This means that the tumor is heterozygote (REF/ALT) for the mutation which is not seen in the germline sample at all (it is REF/REF).
After genotype (GT) we have allelic depth (AD) which is "79,3" (i.e. 79 reads and 3 reads for the reference and mutant allele respectively) in the tumor and "79,0" in the normal sample. Then comes allelic frequency, which is a fraction of the mutant allele out of all aligned bases in this position and the depth. We will skip the remaining values for F1R2, F2R1 and SB for now.
A list of chromosome coordinates is kind of hard to interpret. Here are some ways to approach the results.
- Find the RS identifier from the cancer mutation and look it up at dbSNP.
- Find the frequency table tab. Is your mutation common in some populations? What does a high frequency tell you about its role in cancer?
So far you have processed and analyzed only a small section of chromosome 10.
Now, let's analyze a bigger piece of the genome. Pick your favorite chromosome and find the corresponding VCF file on the server. For example, if you choose chromosome 7, you would use this file:
ls -l /home/27626/exercises/cancer_seq/variant_calling/split_by_chromosome/chr7.vcf
Hint: your results will be more interesting if you pick chromosome 6, 13, 15, 17, 19, 20, 22, or X!
Filter the VCF to retain only the lines marked as "PASS".
grep "PASS" /home/27626/exercises/cancer_seq/variant_calling/split_by_chromosome/chr7.vcf > filtered.chr7.vcf
-
Download the filtered VCF to your own computer using
scp [email protected]:/path/to/file .
and submit it to the VEP website using default settings. When the results become available, look in the "Somatic status" column. Are there any known cancer mutations? -
If you find a known cancer mutation, find its COSMIC identifier (COSM######, e.g. COSM4597270) in the "existing variant" column. Search for your COSMIC identifier in the COSMIC database.
- In which tissues is this mutation found?
-
Go back to the VEP output and find the columns SIFT and PolyPhen (consequence predictors). Sort your results by either of the scores and see if you find any damaging or deleterious scores.
- What is the difference between how SIFT and PolyPhen predicts consequences? Look here for a hint
- Do SIFT and PolyPhen agree?
Go to cBioPortal, a website that provides tools to analyze several large cancer sequencing datasets. In Quick Select, choose "TCGA PanCancer Atlas studies". Then press "Query by Gene" and type in the name of the gene that was hit by this mutation. Choose "mutations" as we have not looked at Copy Number Alterations. Press "Submit Query". Look at the barcharts and play around with the options.
- How often is this gene mutated in various cancer types?
Next we'll do some analysis on a VCF file containing somatic mutations found throughout the entire genome:
ls /home/27626/exercises/cancer_seq/variant_calling/TCRBOA2_filtered.vcf
Unlike VEP, TumorTracer requires VCF files to have the header information. Thus, we will filter this VCF file to retain: 1) header lines (which begin with "#"), and 2) data lines with a PASS call.
grep -E "^#|PASS" /home/27626/exercises/cancer_seq/variant_calling/TCRBOA2_filtered.vcf > TCRBOA2_filtered_pass.vcf
-
Download the vcf-file and submit it to the TumorTracer server. Make sure to specify that this VCF was generated using GRCh38 coordinates.
-
What tissue does TumorTracer predict? Is it a confident prediction?