Skip to content

Commit 33891d8

Browse files
committed
docs: more documentation
1 parent da5965e commit 33891d8

8 files changed

+44
-20
lines changed

docs/count_example1.rst

+28-13
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
Basic Experiment workflow
88
=========================
99

10-
This example runs the count workflow on 5'/5' WT MPRA data in the HEPG2 cell line from `Klein J., Agarwal, V., Keith, A., et al. 2019 <https://www.biorxiv.org/content/10.1101/576405v1.full.pdf>`_.
10+
This example runs the count workflow on 5'/5' WT MPRA data in the HepG2 cell line from `Klein J., Agarwal, V., Keith, A., et al. 2019 <https://www.biorxiv.org/content/10.1101/576405v1.full.pdf>`_.
1111

1212
Prerequirements
1313
======================
@@ -21,8 +21,8 @@ Installing MPRAsnakeflow
2121
Please install conda, the MPRAsnakeflow environment, and clone the actual ``MPRAsnakeflow`` master branch. You will find more help under :ref:`Installation`.
2222

2323
Producing an association (.tsv.gz) file
24-
------------------------------------
25-
This workflow requires a python dictionary of candidate regulatory sequence (CRS) mapped to their barcodes in a tab separated (.tsv) format. For this example the file can be generated using :ref:`Assignment example` or it can be found in :code:`resources/count_basic` folder in `MPRAsnakelfow <https://github.com/kircherlab/MPRAsnakeflow/>`_.
24+
----------------------------------------
25+
This workflow requires a python dictionary of candidate regulatory sequence (CRS) mapped to their barcodes in a tab separated (.tsv) format. For this example the file can be generated using :ref:`Assignment example` or it can be found in :code:`resources/count_basic` folder in `MPRAsnakelfow <https://github.com/kircherlab/MPRAsnakeflow/>`_(file :code:`SRR10800986_barcodes_to_coords.tsv.gz`).
2626

2727
Alternatively, if the association file is in pickle (.pickle) format because you used MPRAflow, you can convert the same file to .tsv.gz format with the in-built function in MPRsnakeflow with the following code:
2828

@@ -98,7 +98,7 @@ The folder should look like this:
9898
├── SRR10800886_1.fastq.gz
9999
├── SRR10800886_2.fastq.gz
100100
├── SRR10800886_3.fastq.gz
101-
└── SRR10800986_filtered_coords_to_barcodes.tsv.gz
101+
└── SRR10800986_barcodes_to_coords.tsv.gz
102102
103103
Here is an overview of the files:
104104

@@ -157,10 +157,15 @@ First we do a try run using snakemake :code:`-n` option. The MPRAsnakeflow comma
157157
You should see a list of rules that will be executed. This is the summary:
158158

159159
.. code-block:: text
160+
160161
Job stats:
161162
job count min threads max threads
162163
------------------------------------------------------------ ------- ------------- -------------
163164
all 1 1 1
165+
assigned_counts_assignBarcodes 6 1 1
166+
assigned_counts_dna_rna_merge 3 1 1
167+
assigned_counts_filterAssignment 1 1 1
168+
assigned_counts_make_master_tables 1 1 1
164169
counts_create_BAM_umi 6 1 1
165170
counts_dna_rna_merge_counts 6 1 1
166171
counts_filter_counts 6 1 1
@@ -185,7 +190,7 @@ You should see a list of rules that will be executed. This is the summary:
185190
statistic_counts_frequent_umis 6 1 1
186191
statistic_counts_stats_merge 2 1 1
187192
statistic_counts_table 12 1 1
188-
total 139 1 10
193+
total 94 1 1
189194
190195
When dry-drun does not give any errors we will run the workflow. We use a machine with 30 threads/cores to run the workflow. The MPRAsnakeflow command is:
191196

@@ -195,20 +200,30 @@ When dry-drun does not give any errors we will run the workflow. We use a machin
195200
196201
.. note:: Please modify your code when running in a cluster environment. We have an example SLURM config file here :code:`config/sbatch.yml`.
197202

198-
If everything works fine the 25 rules showed above will run:
203+
If everything works fine the 29 rules showed above will run. Everything starting with :code:`counts_` beolngs to raw count rules, with :code:`assigned_counts_` to counts assigned to the assignment and :code:`statistic_` to statistics. Here is a brief description of the rules.
199204

200205
all
201206
The overall all rule. Here is defined what final output files are expected.
202207
counts_create_BAM_umi
203-
TODO
204-
counts_dna_rna_merge_counts
205-
TODO
208+
Create a BAM file from FASTQ input, merge FW and REV read and save UMI in XI flag.
209+
counts_raw_counts_umi
210+
Counting BCsxUMIs from the BAM files.
206211
counts_filter_counts
207-
TODO
212+
Filter the counts to BCs only of the correct length (defined in the config file).
208213
counts_final_counts_umi
209-
TODO
210-
counts_raw_counts_umi
211-
TODO
214+
Discarding PCR duplicates (taking BCxUMI only one time). Final result of counts can be found here: :code:`results/experiments/exampleCount/counts/HepG2_<1,2,3>_<DNA/RNA>_filtered_counts.tsv.gz`.
215+
counts_dna_rna_merge_counts
216+
Merge DNA and RNA counts together.
217+
This is done in two ways. First no not allow zeros in DNA or RNA BCs (when :code:`min_counts` is not zero for DNA and RNA).
218+
Second with zeros, so a BC can be defined only in the DNA or RNA (when :code:`min_counts` is zero for DNA or RNA)
219+
assigned_counts_filterAssignment
220+
Use only unique assignments.
221+
assigned_counts_assignBarcodes
222+
Assign RNA and DNA barcodes seperately to make the statistic for assigned.
223+
assigned_counts_dna_rna_merge
224+
Assign merged RNA/DNA barcodes. Filter BC depending on the min_counts option. Output for each replicate is here: :code:`results/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_<1,2,3>_merged_assigned_counts.tsv.gz`.
225+
assigned_counts_make_master_tables
226+
Final master table with all replicates combined. Output is here: :code:`results/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_allreps_merged.tsv.gz` and using the :code:`bc-threshold` here :code:`results/experiments/exampleCount/assigned_counts/fromFile/exampleConfig/HepG2_allreps_minThreshold_merged.tsv.gz`.
212227
statistic_assigned_counts_combine_BC_assignment_stats
213228
TODO
214229
statistic_assigned_counts_combine_BC_assignment_stats_helper
Binary file not shown.
Binary file not shown.

resources/count_basic/config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ experiments:
99
assignments:
1010
fromFile:
1111
type: file
12-
assignment_file: data/SRR10800986_filtered_coords_to_barcodes.tsv.gz
12+
assignment_file: data/SRR10800986_barcodes_to_coords.tsv.gz
1313
design_file: data/design.fa
1414
configs:
1515
exampleConfig:

resources/count_basic/experiment.csv

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R
22
HEPG2,1,SRR10800881_1.fastq.gz,SRR10800881_2.fastq.gz,SRR10800881_3.fastq.gz,SRR10800882_1.fastq.gz,SRR10800882_2.fastq.gz,SRR10800882_3.fastq.gz
33
HEPG2,2,SRR10800883_1.fastq.gz,SRR10800883_2.fastq.gz,SRR10800883_3.fastq.gz,SRR10800884_1.fastq.gz,SRR10800884_2.fastq.gz,SRR10800884_3.fastq.gz
4-
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz
4+
HEPG2,3,SRR10800885_1.fastq.gz,SRR10800885_2.fastq.gz,SRR10800885_3.fastq.gz,SRR10800886_1.fastq.gz,SRR10800886_2.fastq.gz,SRR10800886_3.fastq.gz

workflow/rules/assigned_counts.smk

+6
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,9 @@ rule assigned_counts_assignBarcodes:
8585

8686

8787
rule assigned_counts_dna_rna_merge:
88+
"""
89+
Assign merged RNA/DNA barcodes. Filter BC depending on the min_counts option.
90+
"""
8891
conda:
8992
"../envs/python3.yaml"
9093
input:
@@ -116,6 +119,9 @@ rule assigned_counts_dna_rna_merge:
116119

117120

118121
rule assigned_counts_make_master_tables:
122+
"""
123+
Final master table with all replicates combined. With and without threshold.
124+
"""
119125
conda:
120126
"../envs/r.yaml"
121127
input:

workflow/rules/assignment/statistic.smk

+3-3
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ rule assignment_statistic_totalCounts:
1212
output:
1313
"results/assignment/{assignment}/statistic/total_counts.tsv.gz",
1414
log:
15-
"results/log/assignment/statistic_totalCounts.{assignment}.log",
15+
"results/logs/assignment/statistic_totalCounts.{assignment}.log",
1616
shell:
1717
"""
1818
python {input.script} --input {input.bc} --output {output} &> {log}
@@ -31,7 +31,7 @@ rule assignment_statistic_assignedCounts:
3131
output:
3232
"results/assignment/{assignment}/statistic/assigned_counts.{assignment_config}.tsv.gz",
3333
log:
34-
"results/log/assignment/statistic_assignedCounts.{assignment}.{assignment_config}.log",
34+
"results/logs/assignment/statistic_assignedCounts.{assignment}.{assignment_config}.log",
3535
shell:
3636
"""
3737
python {input.script} --input {input.bc} --output {output} &> {log}
@@ -51,7 +51,7 @@ rule assignment_statistic_assignment:
5151
stats="results/assignment/{assignment}/statistic/assignment.{assignment_config}.tsv.gz",
5252
plot="results/assignment/{assignment}/statistic/assignment.{assignment_config}.png",
5353
log:
54-
"results/log/assignment/statistic_assignment.{assignment}.{assignment_config}.log",
54+
"results/logs/assignment/statistic_assignment.{assignment}.{assignment_config}.log",
5555
shell:
5656
"""
5757
Rscript {input.script} --input {input.bc} --statistic {output.stats} --plot {output.plot} &> {log}

workflow/rules/counts.smk

+5-2
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,9 @@ rule counts_mergeTrimReads_demultiplexed_BAM_umi:
9999

100100

101101
rule counts_create_BAM_umi:
102+
"""
103+
Create a BAM file from FASTQ input, merge FW and REV read and save UMI in XI flag.
104+
"""
102105
input:
103106
fw_fastq=lambda wc: getFW(wc.project, wc.condition, wc.replicate, wc.type),
104107
rev_fastq=lambda wc: getRev(wc.project, wc.condition, wc.replicate, wc.type),
@@ -267,8 +270,8 @@ rule counts_final_counts_umi_samplerer:
267270
rule counts_dna_rna_merge_counts:
268271
"""
269272
Merge DNA and RNA counts together.
270-
Is done in two ways. First no not allow zeros in DNA or RNA BCs (withoutZeros).
271-
Second with zeros, so a BC can be defined only in the DNA or RNA (withZeros)
273+
Is done in two ways. First no not allow zeros in DNA or RNA BCs (RNA and DNA min_counts not zero).
274+
Second with zeros, so a BC can be defined only in the DNA or RNA (RNA or DNA min_counts zero)
272275
"""
273276
conda:
274277
"../envs/default.yaml"

0 commit comments

Comments
 (0)