Skip to content

Tandem repeat catalog from public long-read sequence assemblies

Notifications You must be signed in to change notification settings

bcgsc/tr_catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A tandem repeat (TR) catalog generated from high-quality long-read human genome assemblies

This repository keeps the analysis scripts that were used to generated the TR catalog from public diploid long-read human genome assemblies from the following data soucres:

  1. Human Pangenome Reference Consortium (HPRC)
  2. Human Genome Structural Variation Consortium (HGSVC2)
  3. 1000G ONT Sequencing Consortium

Workflow

workflow

Mapping of TRs from assemblies to the reference genome

Catalog

v1 v2

  • haplotype names separated by semi-colons are shown in first header line preceded by '#'
  • column descriptions:
Column Description
chrom chromosome
start start coordinate
end end coordinate
motif consensus repeat motif
copy_numbers copy numbers in haplotypes separated by semi-colons ('-' for missing genotypes)
sizes sizes (bp) in haplotypes separated by semi-colons ('-' for missing genotypes)
motifs motifs in haplotypes separated by semi-colons ('-' for missing genotypes)
max_change maximum change (of all haplotypes) in size (bp) substracted from reference genome size
num_samples number of samples with genotype
num_calls number of haplotypes with genotype
motif_frequency number of haplotypes associated with each motif observed e.g. CAG(10);CAA(2)
feature gene element overlapped. Format: gene|transcript|, where = exon#|intron#|utr5|utr3|cds|promoter|exon_bound (exon boundary)

About

Tandem repeat catalog from public long-read sequence assemblies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages