This repository is designed for detecting transposable element (TE) insertions using short-read sequencing data. It enables both the assessment of annotated TE presence/absence in the reference genome and the detection and genotyping of novel insertions not represented in the reference. The repository includes a Snakemake pipeline used for training, deploying, and benchmarking TE insertion detection models based on LightGBM classifiers.
-
Snakemake Pipeline
- The main pipeline is controlled by
workflow/Snakefile. - It orchestrates rules for:
- Data preprocessing
- Model training
- Model deployment (prediction)
- Benchmarking
- The new user-facing training pipeline lives in
workflow/trainingand is launched viaTEforest_train.py. Some legacy training/benchmarking scripts still contain hard-coded paths.
- The main pipeline is controlled by
-
Scripts
- Supporting scripts for training and model deployment are located in
workflow/scripts.
- Supporting scripts for training and model deployment are located in
-
Pre-Trained Models
- Trained Random Forest models are stored in
workflow/models.
- Trained Random Forest models are stored in
-
Outputs
-
Final output files (predictions) will be generated in your chosen work directory under the path:
outputs/{sample}_TEforest_bps_nonredundant.bed outputs/{sample}_TEforest_bps_nonredundant.vcf -
Candidate regions prior to precise breakpoint detection are also avaliable:
outputs/{sample}_TEforest_nonredundant.bed
-
Two supported options are available:
Create and activate the environment with the provided YAML:
conda env create -f TEforest.yml
conda activate TEforestWe publish a container image via GitHub Container Registry (GHCR). Use the image from the repository to run TEforest without managing local dependencies. See the repository’s GHCR workflow for the current image name and tags.
A Python script named TEforest.py is provided for launching the pipeline. This script will run the trained models on one or more genomes of your choice.
Note: Ongoing usability developments may change the command-line arguments or environment requirements in the future.
python TEforest.py \
--workflow_dir <path/to/workflow> \
--workdir <path/to/desired_workdir> \
--threads 16 \
--consensusTEs <path/to/consensusTEs.fasta> \
--ref_genome <path/to/reference_genome.fasta> \
--ref_te_locations <path/to/te_locations.bed> \
--euchromatin <path/to/euchromatin.bed> \
--model <path/to/pretrained_model.pkl> \
--ref_model <path/to/reference_model.pkl> \
--fq_base_path <path/to/fastq/files> \
--cleanup_intermediates \
--samples A1 A2 A3--workflow_dir: Directory containing theSnakefile(workflow/Snakefile).--workdir: Directory to store outputs and logs.--threads: Number of CPU threads to use. 16 per sample is recommended.--consensusTEs,--ref_genome,--ref_te_locations,--euchromatin: Input reference files for TE detection. All calls outside of the regions denoted in euchromatin will be filtered. Example files used for Drosophila melanogaster are located in example_files/. Be aware the BWA-mem2 will treat IUPAC bases as missing, so TEforest may have reduced performance on consensus sequences with high IUPAC content.--model: Path to the non-reference random forest model. Select a model that best matches the coverage of your reads. Alignments are automatically downsampled to the nearest available coverage—whichever is immediately lower than your average—using one of the following trained models: 5X, 10X, 20X, 30X, 40X, or 50X.--ref_model: Path to the reference random forest model.--fq_base_path: Directory containing FASTQ files. Should contain sample in name formatted {sample}_1.fastq.gz and {sample}_2.fastq.gz or {sample}_1.fq.gz.--cleanup_intermediates: Optional flag to delete large intermediate files after they are used (e.g.,fastp/,aligned/,downsampled/,candidate_regions_data/). Omit this if you want to keep read alignments or candidate-region BAMs for debugging.--samples: List of sample identifiers to process (space-separated). Note more than one sample can be run in parallel.
The script will generate:
- A
config.yamlin your specifiedworkdirwith all parameters. - Intermediate files used to run the pipeline
- Output
.bedfiles ({sample}_TEforest_bps_nonredundant.bed) for each sample inoutput/within the specifiedworkdir. - Output
.vcffiles ({sample}_TEforest_bps_nonredundant.vcf) for each sample inoutput/within the specifiedworkdir.
VCF notes:
- Non-reference calls are represented as INS records with ALT
<INS:ME:<TEFAM>>. - Present reference TEs are represented as DEL records with GT=0/0 (absence will be GT=1/1 once deletion calling is added).
Use TEforest_train.py with the training Snakefile at workflow/training/Snakefile to train a model on your own data. This pipeline labels candidate regions by overlap with a truth BED and trains a LightGBM classifier. Candidate regions that do not overlap a truth locus are labeled absent (0). If you are interested in training on one region of the genome,be sure to use the --euchromatin flag to denote the regions you want to include (otherwise all regions of the genome will be considered for candidate regions).
Truth BED requirements:
- Columns:
seqnames,start,end,zygosity,sample(and optionaltefamily). zygosityvalues are interpreted as 1, 0.5, or 0; in classifier mode these are mapped to classes 2/1/0.
Example usage:
python TEforest_train.py \
--workflow_dir /path/to/TEforest/workflow/training \
--workdir /path/to/training_workdir \
--threads 32 \
--consensusTEs /path/to/consensusTEs.fasta \
--ref_genome /path/to/reference_genome.fasta \
--ref_te_locations /path/to/ref_te_locations.bed \
--euchromatin /path/to/euchromatin.bed \
--truth_bed /path/to/truth_labels.bed \
--fq_base_path /path/to/fastqs \
--samples SAMPLE1 SAMPLE2 \
--label_mode classifier \
--train_mode classifier \
--test_size 0.1 \
--model_out training/teforest_classifier.pklNotes:
--label_mode classifiermaps 1/0.5/0 to 2/1/0;regressorkeeps the numeric zygosity values.--test_sizecontrols the train/test split (0 disables the split).- Output model path is relative to the workdir unless you give an absolute path.
A small dataset is bundled under tests/data/simulate_te_insertions so you can sanity-check a new install after activating the TEforest conda environment.
# from the repo root
./tests/run_example.shThis runs TEforest on RL150IS400_rep0_fwd and compares the breakpoint output to the expected call. The expected breakpoint is:
3L 5001 5005 roo|non-reference|1|RL150IS400_rep0_fwd|TEforest|rp|1 0 .
You can also override the work directory:
./tests/run_example.sh /tmp/teforest_example- We are actively working on improving performance across different use cases and on large datasets.
- Issues and pull requests are welcome.