Skip to content

Commit 54946ec

Browse files
committed
v0.5.0
1 parent c66e3a0 commit 54946ec

File tree

17 files changed

+268
-208
lines changed

17 files changed

+268
-208
lines changed

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Changelog
22

3-
### v0.5.0 - 2024-xx-xx
3+
### v0.5.0 - 2024-12-18
44

55
- New commands:
66
- **`lexicmap utils remerge`: Rerun the merging step for an unfinished index**.
@@ -26,7 +26,7 @@
2626
- Improve the speed of anchor deduplication, genome information extraction, and result ordering.
2727
- Improve the speed of chaining for long queries.
2828
- Improve the speed of seed matching when using `-w/--load-whole-seeds`.
29-
- Improve the speed of alignment, and reduce the memory usage.
29+
- **Improve the speed of alignment, and reduce the memory usage**.
3030
- Remain compatible after the change of `lexicmap index`.
3131
- Add a new flag `--debug`.
3232
- `lexicmap utils genomes`:

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Running at this scale has previously only been achieved by [Phylign](https://git
5151
(prefiltering with [COBS](https://github.com/iqbal-lab-org/cobs) and alignment with [minimap2](https://github.com/lh3/minimap2)).
5252
1. For searching in all **2,340,672 Genbank+Refseq prokaryotic genomes**, *Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM*. (see [performance](#performance)).
5353

54-
**With LexicMap** (48 CPUs),
54+
**With LexicMap v0.4.0** (48 CPUs),
5555

5656
|Query |Genome hits|Time |RAM |
5757
|:-------------------|----------:|---------:|------:|

demo/README.md

Lines changed: 101 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -65,56 +65,57 @@ Overview
6565

6666
## Building an index
6767

68-
69-
21:14:53.264 [INFO] LexicMap v0.4.0 (12c33a3)
70-
21:14:53.264 [INFO] https://github.com/shenwei356/LexicMap
71-
21:14:53.264 [INFO]
72-
21:14:53.264 [INFO] checking input files ...
73-
21:14:53.265 [INFO] 15 input file(s) given
74-
21:14:53.265 [INFO]
75-
21:14:53.265 [INFO] --------------------- [ main parameters ] ---------------------
76-
21:14:53.265 [INFO]
77-
21:14:53.265 [INFO] input and output:
78-
21:14:53.265 [INFO] input directory: refs/
79-
21:14:53.265 [INFO] regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
80-
21:14:53.265 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
81-
21:14:53.265 [INFO] *regular expressions for filtering out sequences: []
82-
21:14:53.265 [INFO] min sequence length: 31
83-
21:14:53.265 [INFO] max genome size: 15000000
84-
21:14:53.265 [INFO] output directory: demo.lmi
85-
21:14:53.265 [INFO]
86-
21:14:53.265 [INFO] mask generation:
87-
21:14:53.265 [INFO] k-mer size: 31
88-
21:14:53.265 [INFO] number of masks: 40000
89-
21:14:53.265 [INFO] rand seed: 1
90-
21:14:53.265 [INFO]
91-
21:14:53.265 [INFO] seed data:
92-
21:14:53.265 [INFO] maximum sketching desert length: 200
93-
21:14:53.265 [INFO] distance of k-mers to fill deserts: 50
94-
21:14:53.265 [INFO] seeds data chunks: 16
95-
21:14:53.265 [INFO] seeds data indexing partitions: 1024
96-
21:14:53.265 [INFO]
97-
21:14:53.265 [INFO] general:
98-
21:14:53.265 [INFO] genome batch size: 5000
99-
21:14:53.265 [INFO] batch merge threads: 8
100-
21:14:53.265 [INFO]
101-
21:14:53.265 [INFO]
102-
21:14:53.265 [INFO] --------------------- [ generating masks ] ---------------------
103-
21:14:53.277 [INFO]
104-
21:14:53.277 [INFO] --------------------- [ building index ] ---------------------
105-
21:14:53.418 [INFO]
106-
21:14:53.418 [INFO] ------------------------[ batch 1/1 ]------------------------
107-
21:14:53.418 [INFO] building index for batch 1 with 15 files...
68+
19:13:55.369 [INFO] LexicMap v0.5.0 (c66e3a0)
69+
19:13:55.369 [INFO] https://github.com/shenwei356/LexicMap
70+
19:13:55.369 [INFO]
71+
19:13:55.369 [INFO] checking input files ...
72+
19:13:55.369 [INFO] scanning files from directory: refs/
73+
19:13:55.370 [INFO] 15 input file(s) given
74+
19:13:55.370 [INFO]
75+
19:13:55.370 [INFO] --------------------- [ main parameters ] ---------------------
76+
19:13:55.370 [INFO]
77+
19:13:55.370 [INFO] input and output:
78+
19:13:55.370 [INFO] input directory: refs/
79+
19:13:55.370 [INFO] regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
80+
19:13:55.370 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
81+
19:13:55.370 [INFO] *regular expressions for filtering out sequences: []
82+
19:13:55.370 [INFO] min sequence length: 31
83+
19:13:55.370 [INFO] max genome size: 15000000
84+
19:13:55.370 [INFO] output directory: demo.lmi
85+
19:13:55.370 [INFO]
86+
19:13:55.370 [INFO] mask generation:
87+
19:13:55.370 [INFO] k-mer size: 31
88+
19:13:55.370 [INFO] number of masks: 40000
89+
19:13:55.370 [INFO] rand seed: 1
90+
19:13:55.370 [INFO]
91+
19:13:55.370 [INFO] seed data:
92+
19:13:55.370 [INFO] maximum sketching desert length: 200
93+
19:13:55.370 [INFO] distance of k-mers to fill deserts: 50
94+
19:13:55.370 [INFO] seeds data chunks: 16
95+
19:13:55.370 [INFO] seeds data indexing partitions: 4096
96+
19:13:55.370 [INFO]
97+
19:13:55.370 [INFO] general:
98+
19:13:55.370 [INFO] genome batch size: 5000
99+
19:13:55.370 [INFO] threads: 16
100+
19:13:55.370 [INFO] batch merge threads: 8
101+
19:13:55.370 [INFO]
102+
19:13:55.370 [INFO]
103+
19:13:55.370 [INFO] --------------------- [ generating masks ] ---------------------
104+
19:13:55.382 [INFO]
105+
19:13:55.382 [INFO] --------------------- [ building index ] ---------------------
106+
19:13:56.018 [INFO]
107+
19:13:56.018 [INFO] ------------------------[ batch 1/1 ]------------------------
108+
19:13:56.018 [INFO] building index for batch 1 with 15 files...
108109
processed files: 15 / 15 [======================================] ETA: 0s. done
109-
21:14:58.041 [INFO] writing seeds...
110-
21:14:58.353 [INFO] finished writing seeds in 312.385058ms
111-
21:14:58.353 [INFO] finished building index for batch 1 in: 4.935232077s
112-
21:14:58.354 [INFO]
113-
21:14:58.354 [INFO] finished building LexicMap index from 15 files with 40000 masks in 5.0896074s
114-
21:14:58.354 [INFO] LexicMap index saved: demo.lmi
115-
21:14:58.354 [INFO]
116-
21:14:58.354 [INFO] elapsed time: 5.089632193s
117-
21:14:58.354 [INFO]
110+
19:14:00.601 [INFO] writing seeds...
111+
19:14:00.745 [INFO] finished writing seeds in 143.225662ms
112+
19:14:00.745 [INFO] finished building index for batch 1 in: 4.72683742s
113+
19:14:00.746 [INFO]
114+
19:14:00.746 [INFO] finished building LexicMap index from 15 files with 40000 masks in 5.392303552s
115+
19:14:00.746 [INFO] LexicMap index saved: demo.lmi
116+
19:14:00.746 [INFO]
117+
19:14:00.746 [INFO] elapsed time: 5.392329816s
118+
19:14:00.746 [INFO]
118119

119120
Overview of index files:
120121

@@ -136,40 +137,41 @@ Overview of index files:
136137

137138

138139
$ dirsize demo.lmi/
139-
demo.lmi/: 59.55 MB
140-
46.31 MB seeds
141-
12.93 MB genomes
142-
312.53 KB masks.bin
143-
375.00 B genomes.map.bin
144-
322.00 B info.toml
140+
demo.lmi/: 73.24 MiB (76,801,302)
141+
60.00 MiB seeds
142+
12.94 MiB genomes
143+
312.53 KiB masks.bin
144+
563 B info.toml
145+
375 B genomes.map.bin
146+
0 B genomes.chunks.bin
145147

146148
## Searching
147149

148150
### A 16S rRNA gene sequence
149151

150152
$ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv
151-
21:16:05.831 [INFO] LexicMap v0.4.0 (12c33a3)
152-
21:16:05.832 [INFO] https://github.com/shenwei356/LexicMap
153-
21:16:05.832 [INFO]
154-
21:16:05.832 [INFO] checking input files ...
155-
21:16:05.832 [INFO] 1 input file given: q.gene.fasta
156-
21:16:05.832 [INFO]
157-
21:16:05.832 [INFO] loading index: demo.lmi/
158-
21:16:05.832 [INFO] reading masks...
159-
21:16:05.835 [INFO] reading indexes of seeds (k-mer-value) data...
160-
21:16:06.521 [INFO] creating genome reader pools, each batch with 16 readers...
161-
21:16:06.522 [INFO] index loaded in 690.267508ms
162-
21:16:06.522 [INFO]
163-
21:16:06.522 [INFO] searching ...
164-
165-
21:16:06.569 [INFO]
166-
21:16:06.569 [INFO] processed queries: 1, speed: 1278.719 queries per minute
167-
21:16:06.569 [INFO] 100.0000% (1/1) queries matched
168-
21:16:06.569 [INFO] done searching
169-
21:16:06.569 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv
170-
21:16:06.569 [INFO]
171-
21:16:06.569 [INFO] elapsed time: 737.449755ms
172-
21:16:06.569 [INFO]
153+
19:16:55.757 [INFO] LexicMap v0.5.0 (c66e3a0)
154+
19:16:55.757 [INFO] https://github.com/shenwei356/LexicMap
155+
19:16:55.757 [INFO]
156+
19:16:55.757 [INFO] checking input files ...
157+
19:16:55.757 [INFO] 1 input file given: q.gene.fasta
158+
19:16:55.757 [INFO]
159+
19:16:55.757 [INFO] loading index: demo.lmi/
160+
19:16:55.758 [INFO] reading masks...
161+
19:16:55.762 [INFO] reading indexes of seeds (k-mer-value) data...
162+
19:16:58.781 [INFO] creating genome reader pools, each batch with 16 readers...
163+
19:16:58.781 [INFO] index loaded in 3.023370768s
164+
19:16:58.781 [INFO]
165+
19:16:58.781 [INFO] searching with 16 threads...
166+
167+
19:16:58.821 [INFO]
168+
19:16:58.821 [INFO] processed queries: 1, speed: 1506.171 queries per minute
169+
19:16:58.821 [INFO] 100.0000% (1/1) queries matched
170+
19:16:58.821 [INFO] done searching
171+
19:16:58.821 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv
172+
19:16:58.821 [INFO]
173+
19:16:58.821 [INFO] elapsed time: 3.063458635s
174+
19:16:58.821 [INFO]
173175

174176
Result preview.
175177
Here we create a `species` column from the genome ID column (`sgenome`) and replace the assemby accessions with species names.
@@ -353,28 +355,28 @@ Sbjct 460059 CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA 460100
353355
Here we use the flag `-w/--load-whole-seeds` to accelerate searching.
354356

355357
$ lexicmap search -d demo.lmi/ q.long-reads.fasta.gz -o q.long-reads.fasta.gz.lexicmap.tsv.gz -w -q 70
356-
21:19:44.244 [INFO] LexicMap v0.4.0 (12c33a3)
357-
21:19:44.244 [INFO] https://github.com/shenwei356/LexicMap
358-
21:19:44.244 [INFO]
359-
21:19:44.244 [INFO] checking input files ...
360-
21:19:44.244 [INFO] 1 input file given: q.long-reads.fasta.gz
361-
21:19:44.244 [INFO]
362-
21:19:44.244 [INFO] loading index: demo.lmi/
363-
21:19:44.245 [INFO] reading masks...
364-
21:19:44.248 [INFO] reading seeds (k-mer-value) data into memory...
365-
21:19:44.404 [INFO] creating genome reader pools, each batch with 16 readers...
366-
21:19:44.404 [INFO] index loaded in 159.465898ms
367-
21:19:44.404 [INFO]
368-
21:19:44.404 [INFO] searching ...
369-
processed queries: 3584, speed: 3958.189 queries per minute
370-
21:20:41.841 [INFO]
371-
21:20:41.841 [INFO] processed queries: 3692, speed: 3856.741 queries per minute
372-
21:20:41.841 [INFO] 76.2730% (2816/3692) queries matched
373-
21:20:41.841 [INFO] done searching
374-
21:20:41.841 [INFO] search results saved to: q.long-reads.fasta.gz.lexicmap.tsv.gz
375-
21:20:41.846 [INFO]
376-
21:20:41.846 [INFO] elapsed time: 57.601479821s
377-
21:20:41.846 [INFO]
358+
19:17:49.069 [INFO] LexicMap v0.5.0 (c66e3a0)
359+
19:17:49.069 [INFO] https://github.com/shenwei356/LexicMap
360+
19:17:49.069 [INFO]
361+
19:17:49.069 [INFO] checking input files ...
362+
19:17:49.069 [INFO] 1 input file given: q.long-reads.fasta.gz
363+
19:17:49.069 [INFO]
364+
19:17:49.069 [INFO] loading index: demo.lmi/
365+
19:17:49.069 [INFO] reading masks...
366+
19:17:49.073 [INFO] reading seeds (k-mer-value) data into memory...
367+
19:17:51.324 [INFO] creating genome reader pools, each batch with 16 readers...
368+
19:17:51.325 [INFO] index loaded in 2.256185788s
369+
19:17:51.325 [INFO]
370+
19:17:51.325 [INFO] searching with 16 threads...
371+
processed queries: 3584, speed: 2235.509 queries per minute
372+
19:19:33.442 [INFO]
373+
19:19:33.442 [INFO] processed queries: 3692, speed: 2169.281 queries per minute
374+
19:19:33.442 [INFO] 76.3543% (2819/3692) queries matched
375+
19:19:33.442 [INFO] done searching
376+
19:19:33.442 [INFO] search results saved to: q.long-reads.fasta.gz.lexicmap.tsv.gz
377+
19:19:33.449 [INFO]
378+
19:19:33.449 [INFO] elapsed time: 1m44.380463612s
379+
19:19:33.449 [INFO]
378380

379381
Result overview:
380382

0 commit comments

Comments
 (0)