SAiLoR: SCRaMbLE Assembler with Long Reads

Figure 1: The SAiLoR pipeline can be used to solve SCRaMbLEd chromosomes. SAiLoR performs a step called subpath calling to identify the LUs of each read. The LU order on single reads is identified by mapping each LU to each read resulting in subpaths. The overlaps between the subpaths are found and used for the assembly, using either greedy algorithms (EL or GreedySCS), or by creating an overlap graph and finding all the solutions with a pathfinding algorithm (OG). In this directed and weighted graph, each subpath is a node, and each overlap between two subpaths is an edge with a value corresponding to the length of the overlap. Among all paths, the most likely solution is the one with maximum overlaps sum, called Shortest Common Superpath (SCS).

Getting Started

The whole pipeline can be launched with the script SAiLoR_pipeline.sh. However, you need to write within the script the name of your samples which should be in fastq format and the name of your reference which should be in fasta format.

./SAiLoR_pipeline.py

The pipeline can be tested on the circular synIXR SCRaMbLEd strain JS602 using the script work_test.sh in the "test" folder. Moreover, the pipelines can also be tested on simulated reads.

cd test
./work_test.sh

The assembly algorithm' performance can be tested on simulated SCRaMbLEd chromosomes using the script SCRaMbLE_simulation_work2_parallel2.1_short.excel.py

python analysis_SIM/SCRaMbLE_simulation_work2_parallel2.1_short.excel.py

Dependencies

All dependencies can be installed with conda. In venv/environment_sailor.yml there is the conda virtual environment I used with all the dependencies installed.

Dependencies for the subpath calling:

minimap2
seqtk
Biopython

The python module graphviz is needed to produce the Overlap Graph visualizations (PDF).

Users' Guide

Unlike other software, SAiLoR assembles reads not from DNA overlaps but from segment or LU overlaps, as it assumes that SCRaMbLE does not change DNA sequences within LUs. Regarding non-synthetic genomes, it is necessary to cut the reference genome into segments of the appropriate size, and also to cut in mapping breakpoints, which will ensure that a structural variant is not inside a segment and that it could be correctly assembled.

The SAiLoR pipeline contains two main steps. The subpath calling of the sequencing reads, and the assembly of the subpaths (Figure 1). The subpath calling contains three steps.

The first step is the division of the reference into LoxPsym Units (LUs) or in segments.
In the second step, minimap2 is used to map all LUs or segments against the long reads which are used as reference.
In the third step, the mapping result in SAM format is converted into a file with the list of subpaths.

The assembly using greedy algorithms (e.g. EL or GreedySCS) contains one step. The assembly using the OG algorithm contains three steps. The OG is an Overlap–Layout–Consensus (OLC) assembly algorithm.

The first step is the finding of the overlaps between all the subpaths and the creation of the overlap graph.
The second step is the cleaning and pruning of the overlap graph.
The third step is the generation of the solutions using a pathfinding algorithm.

Usage

Output

The output of the subpath calling is a file in .lst format which contains the list of subpaths. This file contains one row for each read and each row contains the read ID, the read length (bp), and the LUs sequence of the read. The outputs of the assembly are two files: a text file containing the LUs sequence of the assembly (one row for each solution); and another file in which all the LUs solutions from the first file are converted into DNA sequences using the script subpath_to_DNA1.py The assembly with the OG algorithm outputs one additional file that contains the Overlap Graph in PDF file format.

Compare chromosomes

There are different ways to compare two or more assemblies.

Dot Plot
Arrow Plot
Similarity Score
Global Alignment

To compare two chromosomes using a dot plot use the script dot_plot.py

python dot_plot.py -path1 LU_seq_path1 -path2 LU_seq_path2 -name1 sample_name1 -name2 sample_name2

To visualize the chromosomes as arrow plots you can use the python function arrowplot in the arrowplot.py script.

arrowplot(path, max_LU=0, essential=[], CEN=[], names="", essential_col=True, filename="Arrowplot", REF=False, Show=True)

To greedy try to finds the best alignment between two circular chromosomes use the python function align_two_paths_inv in the comparison_sol.py script.

align_two_paths_inv(path1, path2, circular=True)

The "Similarity Score" is defined as the number of LUs that are the same between the two solutions divided by the longest solution length.
To calculate the "Similarity Score" between two solutions you can use the python function compare_both in the comparison_sol.py script.

compare_both(path1: list, path2: list)

To calculate the "Global alignment score" between two solutions you can use the python function seq_to_alignement in the Global_alignment_MM_functions.py script. P.S. this function is very slow for sequences very far apart (e.g. with similarity score less than 50 %).

seq_to_alignement(path1, path2, match_score, mismatch_score, gap_penalty)
# You can set values for a match, mismatch and a gap. Following there are the default settings.
match_score = 2
mismatch_score = -1
gap_penalty = -2

Quality control of the assemblies

To validate all the SAiLoR solutions, we used two strategies.
Firstly, we verified that all the subpaths could fit in the generated solutions.
Secondly, we ensured an even coverage of the long reads against the solution.

To verify that all the subpaths could fit in the generated solutions we use the python function check_sol in the comparison_sol.py script.

check_sol(solutions, paths, debug=False)

An additional verification step is possible for circular chromosomes. In these chromosomes we can verify that the chromosome is circular by finding some subpaths that connect the beginning and end of the chromosome.
To make this verification step, use the python function circular_with_paths in the overlap_lists.py script.

circular_with_paths(solution, paths, k=3, force_1_sol=False, debug=False)

Assembly algorithm overview

SAiLoR contains three independent algorithms to assemble subpaths using overlaps.

Elongate Longest (EL or F_sol3): EL combines the longest subpath with the second longest subpath that overlaps with the first. After merging the two subpaths, it repeats the procedure until no overlaps are found. Therefore, EL generates a single solution as output.
Greedy Shortest Common Superpath (GreedySCS): GreedySCS uses dynamic programming to find the two subpaths with the longest overlap, and it merges them. It repeats the entire process until no overlaps are found, generating a single solution as output.
Overlap Graph (OG): OG finds the overlaps between all reads and generates an overlap graph where each subpath is a node, and the edges are the overlaps between the subpaths. This results in a directed weighted graph where the weights are the overlap lengths. The graph is assessed via a modified "depth-first search" (DFS) pathfinding algorithm to find all paths that visit all nodes.

The EL and GreedySCS algorithms keep joining subpaths until no overlaps are found anymore. If only one path remains, this is consider the only solution. If two or more path remain, the longest path is consider the most likely solution and the other subpaths are either other chromosomes or errors in the sequencing that generate subpaths without overlaps which cannot be join to the rest of the subpaths.

On the other hand, the OG algorithm should find all the possible solutions. Therefore, most of the time the EL and GreedySCS solutions are contained in the OG solutions. This happened more than 98 % of the times in our simulated dataset (Figure 2.C).

The three assembly algorithms can be used with the script get_sol_easy.py by using one of these three options --assembler OG, F_sol3, greedy_scs . Note: the algorithm EL here is called F_sol3

python get_sol_easy.py --subpaths file_name.subpath.lst --assembler OG

Figure 2: Performance of our algorithms in solving SCRaMbLEd chromosomes. We simulated 10,000 linear and 10,000 circular chromosomes using SCRaMbLE-SIM. The initial chromosome length was 44 LUs, and 15 SCRaMbLE events were performed on each chromosome. A) Statistics on the simulated dataset. The violin plots show the variation inside the dataset in chromosome length, the number of unique LUs, how many unique LUs are necessary to reach half of the length of the chromosome (LG50), the copy number of the LG50th LU (NG50), and mean LU copy number (also called duplication rate). B) The stacked bar chart shows the success rate or solved percentage of the three algorithms created plus the short-read algorithm "SynGenoR" and the long-read assembler "Canu". In yellow are the percentage of chromosomes solved with a single solution, in green are chromosomes solved with multiple solutions, and in violet are the unsolved chromosomes (e.g., the generated solutions were wrong). C) The Venn diagram shows the percentage of chromosomes solved by zero, one, two, or three algorithms. D) The scatter plots show the distribution in chromosome length, the number of solutions, assembly time of solved (yellow), and unsolved (violet) chromosomes by the two algorithms OG and SynGenoR. Note: For the OG algorithm, the solutions that were not compatible with all long reads were eliminated; this lowers the number of multiple solutions and causes many unsolved chromosomes to have zero solutions.

Benchmarking of SAiLoR

We tested and benchmarked the SAiLoR pipeline and its assembly algorithms using 26 synIXR SCRaMbLEd strains and 20,000 simulated linear or circular SCRaMbLEd synIXR-sim chromosomes. All the scripts and data used for the benchmarking used in the manuscript can be found in the folder "analysis_synIXR" or "analysis_SIM" for synIXR and the simulated data respectively.

Getting help

You can raise an issue on the issue page if you run into any problems or have any additional questions or requests.

Citation

If you use this software, please consider citing our manuscript.

Limitations

So far, SAiLoR was only tested on yeast genomes and on SCRaMbLEd strains with one or two chromosomes. In the future, we will test this software with more complex genomes and with synthetic strains with more than two SCRaMbLEd chromosomes.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
analysis_SIM		analysis_SIM
analysis_synIXR		analysis_synIXR
bin		bin
media		media
test		test
venv		venv
LICENCE.txt		LICENCE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAiLoR: SCRaMbLE Assembler with Long Reads

Getting Started

Table of Contents

Dependencies

Users' Guide

Usage

Output

Compare chromosomes

Quality control of the assemblies

Assembly algorithm overview

Benchmarking of SAiLoR

Getting help

Citation

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAiLoR: SCRaMbLE Assembler with Long Reads

Getting Started

Table of Contents

Dependencies

Users' Guide

Usage

Output

Compare chromosomes

Quality control of the assemblies

Assembly algorithm overview

Benchmarking of SAiLoR

Getting help

Citation

Limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages