@@ -65,56 +65,57 @@ Overview
65
65
66
66
## Building an index
67
67
68
-
69
- 21:14:53.264 [INFO] LexicMap v0.4.0 (12c33a3)
70
- 21:14:53.264 [INFO] https://github.com/shenwei356/LexicMap
71
- 21:14:53.264 [INFO]
72
- 21:14:53.264 [INFO] checking input files ...
73
- 21:14:53.265 [INFO] 15 input file(s) given
74
- 21:14:53.265 [INFO]
75
- 21:14:53.265 [INFO] --------------------- [ main parameters ] ---------------------
76
- 21:14:53.265 [INFO]
77
- 21:14:53.265 [INFO] input and output:
78
- 21:14:53.265 [INFO] input directory: refs/
79
- 21:14:53.265 [INFO] regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
80
- 21:14:53.265 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
81
- 21:14:53.265 [INFO] *regular expressions for filtering out sequences: []
82
- 21:14:53.265 [INFO] min sequence length: 31
83
- 21:14:53.265 [INFO] max genome size: 15000000
84
- 21:14:53.265 [INFO] output directory: demo.lmi
85
- 21:14:53.265 [INFO]
86
- 21:14:53.265 [INFO] mask generation:
87
- 21:14:53.265 [INFO] k-mer size: 31
88
- 21:14:53.265 [INFO] number of masks: 40000
89
- 21:14:53.265 [INFO] rand seed: 1
90
- 21:14:53.265 [INFO]
91
- 21:14:53.265 [INFO] seed data:
92
- 21:14:53.265 [INFO] maximum sketching desert length: 200
93
- 21:14:53.265 [INFO] distance of k-mers to fill deserts: 50
94
- 21:14:53.265 [INFO] seeds data chunks: 16
95
- 21:14:53.265 [INFO] seeds data indexing partitions: 1024
96
- 21:14:53.265 [INFO]
97
- 21:14:53.265 [INFO] general:
98
- 21:14:53.265 [INFO] genome batch size: 5000
99
- 21:14:53.265 [INFO] batch merge threads: 8
100
- 21:14:53.265 [INFO]
101
- 21:14:53.265 [INFO]
102
- 21:14:53.265 [INFO] --------------------- [ generating masks ] ---------------------
103
- 21:14:53.277 [INFO]
104
- 21:14:53.277 [INFO] --------------------- [ building index ] ---------------------
105
- 21:14:53.418 [INFO]
106
- 21:14:53.418 [INFO] ------------------------[ batch 1/1 ]------------------------
107
- 21:14:53.418 [INFO] building index for batch 1 with 15 files...
68
+ 19:13:55.369 [INFO] LexicMap v0.5.0 (c66e3a0)
69
+ 19:13:55.369 [INFO] https://github.com/shenwei356/LexicMap
70
+ 19:13:55.369 [INFO]
71
+ 19:13:55.369 [INFO] checking input files ...
72
+ 19:13:55.369 [INFO] scanning files from directory: refs/
73
+ 19:13:55.370 [INFO] 15 input file(s) given
74
+ 19:13:55.370 [INFO]
75
+ 19:13:55.370 [INFO] --------------------- [ main parameters ] ---------------------
76
+ 19:13:55.370 [INFO]
77
+ 19:13:55.370 [INFO] input and output:
78
+ 19:13:55.370 [INFO] input directory: refs/
79
+ 19:13:55.370 [INFO] regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
80
+ 19:13:55.370 [INFO] *regular expression for extracting reference name from file name: (?i)(.+)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
81
+ 19:13:55.370 [INFO] *regular expressions for filtering out sequences: []
82
+ 19:13:55.370 [INFO] min sequence length: 31
83
+ 19:13:55.370 [INFO] max genome size: 15000000
84
+ 19:13:55.370 [INFO] output directory: demo.lmi
85
+ 19:13:55.370 [INFO]
86
+ 19:13:55.370 [INFO] mask generation:
87
+ 19:13:55.370 [INFO] k-mer size: 31
88
+ 19:13:55.370 [INFO] number of masks: 40000
89
+ 19:13:55.370 [INFO] rand seed: 1
90
+ 19:13:55.370 [INFO]
91
+ 19:13:55.370 [INFO] seed data:
92
+ 19:13:55.370 [INFO] maximum sketching desert length: 200
93
+ 19:13:55.370 [INFO] distance of k-mers to fill deserts: 50
94
+ 19:13:55.370 [INFO] seeds data chunks: 16
95
+ 19:13:55.370 [INFO] seeds data indexing partitions: 4096
96
+ 19:13:55.370 [INFO]
97
+ 19:13:55.370 [INFO] general:
98
+ 19:13:55.370 [INFO] genome batch size: 5000
99
+ 19:13:55.370 [INFO] threads: 16
100
+ 19:13:55.370 [INFO] batch merge threads: 8
101
+ 19:13:55.370 [INFO]
102
+ 19:13:55.370 [INFO]
103
+ 19:13:55.370 [INFO] --------------------- [ generating masks ] ---------------------
104
+ 19:13:55.382 [INFO]
105
+ 19:13:55.382 [INFO] --------------------- [ building index ] ---------------------
106
+ 19:13:56.018 [INFO]
107
+ 19:13:56.018 [INFO] ------------------------[ batch 1/1 ]------------------------
108
+ 19:13:56.018 [INFO] building index for batch 1 with 15 files...
108
109
processed files: 15 / 15 [======================================] ETA: 0s. done
109
- 21 :14:58.041 [INFO] writing seeds...
110
- 21 :14:58.353 [INFO] finished writing seeds in 312.385058ms
111
- 21 :14:58.353 [INFO] finished building index for batch 1 in: 4.935232077s
112
- 21 :14:58.354 [INFO]
113
- 21 :14:58.354 [INFO] finished building LexicMap index from 15 files with 40000 masks in 5.0896074s
114
- 21 :14:58.354 [INFO] LexicMap index saved: demo.lmi
115
- 21 :14:58.354 [INFO]
116
- 21 :14:58.354 [INFO] elapsed time: 5.089632193s
117
- 21 :14:58.354 [INFO]
110
+ 19 :14:00.601 [INFO] writing seeds...
111
+ 19 :14:00.745 [INFO] finished writing seeds in 143.225662ms
112
+ 19 :14:00.745 [INFO] finished building index for batch 1 in: 4.72683742s
113
+ 19 :14:00.746 [INFO]
114
+ 19 :14:00.746 [INFO] finished building LexicMap index from 15 files with 40000 masks in 5.392303552s
115
+ 19 :14:00.746 [INFO] LexicMap index saved: demo.lmi
116
+ 19 :14:00.746 [INFO]
117
+ 19 :14:00.746 [INFO] elapsed time: 5.392329816s
118
+ 19 :14:00.746 [INFO]
118
119
119
120
Overview of index files:
120
121
@@ -136,40 +137,41 @@ Overview of index files:
136
137
137
138
138
139
$ dirsize demo.lmi/
139
- demo.lmi/: 59.55 MB
140
- 46.31 MB seeds
141
- 12.93 MB genomes
142
- 312.53 KB masks.bin
143
- 375.00 B genomes.map.bin
144
- 322.00 B info.toml
140
+ demo.lmi/: 73.24 MiB (76,801,302)
141
+ 60.00 MiB seeds
142
+ 12.94 MiB genomes
143
+ 312.53 KiB masks.bin
144
+ 563 B info.toml
145
+ 375 B genomes.map.bin
146
+ 0 B genomes.chunks.bin
145
147
146
148
## Searching
147
149
148
150
### A 16S rRNA gene sequence
149
151
150
152
$ lexicmap search -d demo.lmi/ q.gene.fasta -o q.gene.fasta.lexicmap.tsv
151
- 21 :16:05.831 [INFO] LexicMap v0.4 .0 (12c33a3 )
152
- 21 :16:05.832 [INFO] https://github.com/shenwei356/LexicMap
153
- 21 :16:05.832 [INFO]
154
- 21 :16:05.832 [INFO] checking input files ...
155
- 21 :16:05.832 [INFO] 1 input file given: q.gene.fasta
156
- 21 :16:05.832 [INFO]
157
- 21 :16:05.832 [INFO] loading index: demo.lmi/
158
- 21 :16:05.832 [INFO] reading masks...
159
- 21 :16:05.835 [INFO] reading indexes of seeds (k-mer-value) data...
160
- 21 :16:06.521 [INFO] creating genome reader pools, each batch with 16 readers...
161
- 21 :16:06.522 [INFO] index loaded in 690.267508ms
162
- 21 :16:06.522 [INFO]
163
- 21 :16:06.522 [INFO] searching ...
164
-
165
- 21 :16:06.569 [INFO]
166
- 21 :16:06.569 [INFO] processed queries: 1, speed: 1278.719 queries per minute
167
- 21 :16:06.569 [INFO] 100.0000% (1/1) queries matched
168
- 21 :16:06.569 [INFO] done searching
169
- 21 :16:06.569 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv
170
- 21 :16:06.569 [INFO]
171
- 21 :16:06.569 [INFO] elapsed time: 737.449755ms
172
- 21 :16:06.569 [INFO]
153
+ 19 :16:55.757 [INFO] LexicMap v0.5 .0 (c66e3a0 )
154
+ 19 :16:55.757 [INFO] https://github.com/shenwei356/LexicMap
155
+ 19 :16:55.757 [INFO]
156
+ 19 :16:55.757 [INFO] checking input files ...
157
+ 19 :16:55.757 [INFO] 1 input file given: q.gene.fasta
158
+ 19 :16:55.757 [INFO]
159
+ 19 :16:55.757 [INFO] loading index: demo.lmi/
160
+ 19 :16:55.758 [INFO] reading masks...
161
+ 19 :16:55.762 [INFO] reading indexes of seeds (k-mer-value) data...
162
+ 19 :16:58.781 [INFO] creating genome reader pools, each batch with 16 readers...
163
+ 19 :16:58.781 [INFO] index loaded in 3.023370768s
164
+ 19 :16:58.781 [INFO]
165
+ 19 :16:58.781 [INFO] searching with 16 threads ...
166
+
167
+ 19 :16:58.821 [INFO]
168
+ 19 :16:58.821 [INFO] processed queries: 1, speed: 1506.171 queries per minute
169
+ 19 :16:58.821 [INFO] 100.0000% (1/1) queries matched
170
+ 19 :16:58.821 [INFO] done searching
171
+ 19 :16:58.821 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv
172
+ 19 :16:58.821 [INFO]
173
+ 19 :16:58.821 [INFO] elapsed time: 3.063458635s
174
+ 19 :16:58.821 [INFO]
173
175
174
176
Result preview.
175
177
Here we create a ` species ` column from the genome ID column (` sgenome ` ) and replace the assemby accessions with species names.
@@ -353,28 +355,28 @@ Sbjct 460059 CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA 460100
353
355
Here we use the flag ` -w/--load-whole-seeds ` to accelerate searching.
354
356
355
357
$ lexicmap search -d demo.lmi/ q.long-reads.fasta.gz -o q.long-reads.fasta.gz.lexicmap.tsv.gz -w -q 70
356
- 21: 19:44.244 [INFO] LexicMap v0.4 .0 (12c33a3 )
357
- 21: 19:44.244 [INFO] https://github.com/shenwei356/LexicMap
358
- 21: 19:44.244 [INFO]
359
- 21: 19:44.244 [INFO] checking input files ...
360
- 21: 19:44.244 [INFO] 1 input file given: q.long-reads.fasta.gz
361
- 21: 19:44.244 [INFO]
362
- 21: 19:44.244 [INFO] loading index: demo.lmi/
363
- 21: 19:44.245 [INFO] reading masks...
364
- 21: 19:44.248 [INFO] reading seeds (k-mer-value) data into memory...
365
- 21: 19:44.404 [INFO] creating genome reader pools, each batch with 16 readers...
366
- 21: 19:44.404 [INFO] index loaded in 159.465898ms
367
- 21: 19:44.404 [INFO]
368
- 21: 19:44.404 [INFO] searching ...
369
- processed queries: 3584, speed: 3958.189 queries per minute
370
- 21:20:41.841 [INFO]
371
- 21:20:41.841 [INFO] processed queries: 3692, speed: 3856.741 queries per minute
372
- 21:20:41.841 [INFO] 76.2730 % (2816 /3692) queries matched
373
- 21:20:41.841 [INFO] done searching
374
- 21:20:41.841 [INFO] search results saved to: q.long-reads.fasta.gz.lexicmap.tsv.gz
375
- 21:20:41.846 [INFO]
376
- 21:20:41.846 [INFO] elapsed time: 57.601479821s
377
- 21:20:41.846 [INFO]
358
+ 19:17:49.069 [INFO] LexicMap v0.5 .0 (c66e3a0 )
359
+ 19:17:49.069 [INFO] https://github.com/shenwei356/LexicMap
360
+ 19:17:49.069 [INFO]
361
+ 19:17:49.069 [INFO] checking input files ...
362
+ 19:17:49.069 [INFO] 1 input file given: q.long-reads.fasta.gz
363
+ 19:17:49.069 [INFO]
364
+ 19:17:49.069 [INFO] loading index: demo.lmi/
365
+ 19:17:49.069 [INFO] reading masks...
366
+ 19:17:49.073 [INFO] reading seeds (k-mer-value) data into memory...
367
+ 19:17:51.324 [INFO] creating genome reader pools, each batch with 16 readers...
368
+ 19:17:51.325 [INFO] index loaded in 2.256185788s
369
+ 19:17:51.325 [INFO]
370
+ 19:17:51.325 [INFO] searching with 16 threads ...
371
+ processed queries: 3584, speed: 2235.509 queries per minute
372
+ 19:19:33.442 [INFO]
373
+ 19:19:33.442 [INFO] processed queries: 3692, speed: 2169.281 queries per minute
374
+ 19:19:33.442 [INFO] 76.3543 % (2819 /3692) queries matched
375
+ 19:19:33.442 [INFO] done searching
376
+ 19:19:33.442 [INFO] search results saved to: q.long-reads.fasta.gz.lexicmap.tsv.gz
377
+ 19:19:33.449 [INFO]
378
+ 19:19:33.449 [INFO] elapsed time: 1m44.380463612s
379
+ 19:19:33.449 [INFO]
378
380
379
381
Result overview:
380
382
0 commit comments