Figure 1: The SAiLoR pipeline can be used to solve SCRaMbLEd chromosomes. SAiLoR performs a step called subpath calling to identify the LUs of each read. The LU order on single reads is identified by mapping each LU to each read resulting in subpaths. The overlaps between the subpaths are found and used for the assembly, using either greedy algorithms (EL or GreedySCS), or by creating an overlap graph and finding all the solutions with a pathfinding algorithm (OG). In this directed and weighted graph, each subpath is a node, and each overlap between two subpaths is an edge with a value corresponding to the length of the overlap. Among all paths, the most likely solution is the one with maximum overlaps sum, called Shortest Common Superpath (SCS).
The whole pipeline can be launched with the script SAiLoR_pipeline.sh. However, you need to write within the script the name of your samples which should be in fastq format and the name of your reference which should be in fasta format.
./SAiLoR_pipeline.pyThe pipeline can be tested on the circular synIXR SCRaMbLEd strain JS602 using the script work_test.sh in the "test" folder. Moreover, the pipelines can also be tested on simulated reads.
cd test
./work_test.shThe assembly algorithm' performance can be tested on simulated SCRaMbLEd chromosomes using the script SCRaMbLE_simulation_work2_parallel2.1_short.excel.py
python analysis_SIM/SCRaMbLE_simulation_work2_parallel2.1_short.excel.py- Getting Started
- Dependencies
- Users' Guide
- Usage
- Assembly algorithm overview
- Getting help
- Citation
- Limitations
All dependencies can be installed with conda. In venv/environment_sailor.yml there is the conda virtual environment I used with all the dependencies installed.
Dependencies for the subpath calling:
- minimap2
- seqtk
- Biopython
The python module graphviz is needed to produce the Overlap Graph visualizations (PDF).
Unlike other software, SAiLoR assembles reads not from DNA overlaps but from segment or LU overlaps, as it assumes that SCRaMbLE does not change DNA sequences within LUs. Regarding non-synthetic genomes, it is necessary to cut the reference genome into segments of the appropriate size, and also to cut in mapping breakpoints, which will ensure that a structural variant is not inside a segment and that it could be correctly assembled.
The SAiLoR pipeline contains two main steps. The subpath calling of the sequencing reads, and the assembly of the subpaths (Figure 1). The subpath calling contains three steps.
- The first step is the division of the reference into LoxPsym Units (LUs) or in segments.
- In the second step, minimap2 is used to map all LUs or segments against the long reads which are used as reference.
- In the third step, the mapping result in SAM format is converted into a file with the list of subpaths.
The assembly using greedy algorithms (e.g. EL or GreedySCS) contains one step. The assembly using the OG algorithm contains three steps. The OG is an Overlap–Layout–Consensus (OLC) assembly algorithm.
- The first step is the finding of the overlaps between all the subpaths and the creation of the overlap graph.
- The second step is the cleaning and pruning of the overlap graph.
- The third step is the generation of the solutions using a pathfinding algorithm.
The output of the subpath calling is a file in .lst format which contains the list of subpaths. This file contains one row for each read and each row contains the read ID, the read length (bp), and the LUs sequence of the read. The outputs of the assembly are two files: a text file containing the LUs sequence of the assembly (one row for each solution); and another file in which all the LUs solutions from the first file are converted into DNA sequences using the script subpath_to_DNA1.py The assembly with the OG algorithm outputs one additional file that contains the Overlap Graph in PDF file format.
There are different ways to compare two or more assemblies.
- Dot Plot
- Arrow Plot
- Similarity Score
- Global Alignment
To compare two chromosomes using a dot plot use the script dot_plot.py
python dot_plot.py -path1 LU_seq_path1 -path2 LU_seq_path2 -name1 sample_name1 -name2 sample_name2To visualize the chromosomes as arrow plots you can use the python function arrowplot in the arrowplot.py script.
arrowplot(path, max_LU=0, essential=[], CEN=[], names="", essential_col=True, filename="Arrowplot", REF=False, Show=True)To greedy try to finds the best alignment between two circular chromosomes use the python function align_two_paths_inv in the comparison_sol.py script.
align_two_paths_inv(path1, path2, circular=True)The "Similarity Score" is defined as the number of LUs that are the same between the two solutions divided by the longest solution length.
To calculate the "Similarity Score" between two solutions you can use the python function compare_both in the comparison_sol.py script.
compare_both(path1: list, path2: list)To calculate the "Global alignment score" between two solutions you can use the python function seq_to_alignement in the Global_alignment_MM_functions.py script. P.S. this function is very slow for sequences very far apart (e.g. with similarity score less than 50 %).
seq_to_alignement(path1, path2, match_score, mismatch_score, gap_penalty)
# You can set values for a match, mismatch and a gap. Following there are the default settings.
match_score = 2
mismatch_score = -1
gap_penalty = -2To validate all the SAiLoR solutions, we used two strategies.
Firstly, we verified that all the subpaths could fit in the generated solutions.
Secondly, we ensured an even coverage of the long reads against the solution.
To verify that all the subpaths could fit in the generated solutions we use the python function check_sol in the comparison_sol.py script.
check_sol(solutions, paths, debug=False)An additional verification step is possible for circular chromosomes. In these chromosomes we can verify that the chromosome is circular by finding some subpaths that connect the beginning and end of the chromosome.
To make this verification step, use the python function circular_with_paths in the overlap_lists.py script.
circular_with_paths(solution, paths, k=3, force_1_sol=False, debug=False)SAiLoR contains three independent algorithms to assemble subpaths using overlaps.
- Elongate Longest (EL or F_sol3): EL combines the longest subpath with the second longest subpath that overlaps with the first. After merging the two subpaths, it repeats the procedure until no overlaps are found. Therefore, EL generates a single solution as output.
- Greedy Shortest Common Superpath (GreedySCS): GreedySCS uses dynamic programming to find the two subpaths with the longest overlap, and it merges them. It repeats the entire process until no overlaps are found, generating a single solution as output.
- Overlap Graph (OG): OG finds the overlaps between all reads and generates an overlap graph where each subpath is a node, and the edges are the overlaps between the subpaths. This results in a directed weighted graph where the weights are the overlap lengths. The graph is assessed via a modified "depth-first search" (DFS) pathfinding algorithm to find all paths that visit all nodes.
The EL and GreedySCS algorithms keep joining subpaths until no overlaps are found anymore. If only one path remains, this is consider the only solution. If two or more path remain, the longest path is consider the most likely solution and the other subpaths are either other chromosomes or errors in the sequencing that generate subpaths without overlaps which cannot be join to the rest of the subpaths.
On the other hand, the OG algorithm should find all the possible solutions. Therefore, most of the time the EL and GreedySCS solutions are contained in the OG solutions. This happened more than 98 % of the times in our simulated dataset (Figure 2.C).
The three assembly algorithms can be used with the script get_sol_easy.py by using one of these three options --assembler OG, F_sol3, greedy_scs . Note: the algorithm EL here is called F_sol3
python get_sol_easy.py --subpaths file_name.subpath.lst --assembler OG
Figure 2: Performance of our algorithms in solving SCRaMbLEd chromosomes. We simulated 10,000 linear and 10,000 circular chromosomes using SCRaMbLE-SIM. The initial chromosome length was 44 LUs, and 15 SCRaMbLE events were performed on each chromosome. A) Statistics on the simulated dataset. The violin plots show the variation inside the dataset in chromosome length, the number of unique LUs, how many unique LUs are necessary to reach half of the length of the chromosome (LG50), the copy number of the LG50th LU (NG50), and mean LU copy number (also called duplication rate). B) The stacked bar chart shows the success rate or solved percentage of the three algorithms created plus the short-read algorithm "SynGenoR" and the long-read assembler "Canu". In yellow are the percentage of chromosomes solved with a single solution, in green are chromosomes solved with multiple solutions, and in violet are the unsolved chromosomes (e.g., the generated solutions were wrong). C) The Venn diagram shows the percentage of chromosomes solved by zero, one, two, or three algorithms. D) The scatter plots show the distribution in chromosome length, the number of solutions, assembly time of solved (yellow), and unsolved (violet) chromosomes by the two algorithms OG and SynGenoR. Note: For the OG algorithm, the solutions that were not compatible with all long reads were eliminated; this lowers the number of multiple solutions and causes many unsolved chromosomes to have zero solutions.
We tested and benchmarked the SAiLoR pipeline and its assembly algorithms using 26 synIXR SCRaMbLEd strains and 20,000 simulated linear or circular SCRaMbLEd synIXR-sim chromosomes. All the scripts and data used for the benchmarking used in the manuscript can be found in the folder "analysis_synIXR" or "analysis_SIM" for synIXR and the simulated data respectively.
You can raise an issue on the issue page if you run into any problems or have any additional questions or requests.
If you use this software, please consider citing our manuscript.
So far, SAiLoR was only tested on yeast genomes and on SCRaMbLEd strains with one or two chromosomes. In the future, we will test this software with more complex genomes and with synthetic strains with more than two SCRaMbLEd chromosomes.