Skip to content

Commit e969838

Browse files
authored
Merge branch 'main' into feature/update-loadable-refs-to-read_group
2 parents 75d6a1f + 181f36a commit e969838

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+740
-227
lines changed

.github/workflows/docker_branches.yml

+12-2
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@ jobs:
1919
id: meta
2020
uses: docker/metadata-action@v3
2121
with:
22-
images: ${{ env.REPO_LOWER }}
22+
images: |
23+
${{ env.REPO_LOWER }}
24+
ghcr.io/${{ env.REPO_LOWER }}
2325
tags: |
2426
type=ref,event=branch,prefix=branch-
2527
type=ref,event=pr
@@ -35,8 +37,16 @@ jobs:
3537
with:
3638
username: ${{ secrets.DOCKERHUB_USERNAME }}
3739
password: ${{ secrets.DOCKERHUB_TOKEN }}
40+
-
41+
name: Login to GitHub Container Registry
42+
uses: docker/login-action@v2
43+
with:
44+
registry: ghcr.io
45+
username: ${{ secrets.GH_USERNAME }}
46+
password: ${{ secrets.GH_TOKEN }}
3847
- name: Push to Docker Hub
3948
uses: docker/build-push-action@v4
4049
with:
4150
push: true
42-
tags: ${{ steps.meta.outputs.tags }}
51+
tags: |
52+
${{ steps.meta.outputs.tags }}

Dockerfile

+24-24
Original file line numberDiff line numberDiff line change
@@ -13,23 +13,24 @@ RUN apt-get update \
1313
zlib1g-dev \
1414
&& rm -rf /var/lib/apt/lists/*
1515

16-
RUN pip3 install --user --ignore-installed \
16+
RUN pip3 install --ignore-installed \
17+
--prefix /usr/local \
1718
cwlref-runner \
1819
html5lib
1920

2021
RUN cd /tmp \
21-
&& wget https://github.com/lh3/bwa/releases/download/v0.7.13/bwa-0.7.13.tar.bz2 \
22-
&& echo "559b3c63266e5d5351f7665268263dbb9592f3c1c4569e7a4a75a15f17f0aedc *bwa-0.7.13.tar.bz2" | sha256sum --check \
23-
&& tar xf bwa-0.7.13.tar.bz2 \
24-
&& cd bwa-0.7.13 \
22+
&& wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2 \
23+
&& echo "de1b4d4e745c0b7fc3e107b5155a51ac063011d33a5d82696331ecf4bed8d0fd *bwa-0.7.17.tar.bz2" | sha256sum --check \
24+
&& tar xf bwa-0.7.17.tar.bz2 \
25+
&& cd bwa-0.7.17 \
2526
&& make -j$(nproc) \
2627
&& mv bwa /usr/local/bin
2728

2829
RUN cd /tmp \
29-
&& wget https://github.com/alexdobin/STAR/archive/2.7.1a.tar.gz \
30-
&& echo "9a35bf4e8a12bec505e11132bc53f94671f596584a6a0dd8f237120dd0df740e *2.7.1a.tar.gz" | sha256sum --check \
31-
&& tar xf 2.7.1a.tar.gz \
32-
&& mv STAR-2.7.1a/bin/Linux_x86_64_static/STAR /usr/local/bin
30+
&& wget https://github.com/alexdobin/STAR/archive/refs/tags/2.7.10a.tar.gz \
31+
&& echo "af0df8fdc0e7a539b3ec6665dce9ac55c33598dfbc74d24df9dae7a309b0426a *2.7.10a.tar.gz" | sha256sum --check \
32+
&& tar xf 2.7.10a.tar.gz \
33+
&& mv STAR-2.7.10a/bin/Linux_x86_64_static/STAR /usr/local/bin
3334

3435
# bz2 and lzma support is for CRAM files. curses is for `samtools tview`.
3536
RUN cd /tmp \
@@ -68,11 +69,13 @@ ENV PATH /opt/gradle/bin:${PATH}
6869
COPY bin /tmp/xenocp/bin
6970
COPY src /tmp/xenocp/src
7071
COPY dependencies /tmp/xenocp/dependencies
72+
COPY gradle /tmp/xenocp/gradle
73+
COPY gradlew /tmp/xenocp/gradlew
7174
COPY build.gradle /tmp/xenocp/build.gradle
7275
COPY settings.gradle /tmp/xenocp/settings.gradle
7376

7477
RUN cd /tmp/xenocp \
75-
&& gradle installDist \
78+
&& ./gradlew installDist \
7679
&& cp -r build/install/xenocp /opt
7780

7881
FROM ubuntu:20.04
@@ -88,17 +91,14 @@ RUN apt-get update \
8891
file \
8992
&& rm -rf /var/lib/apt/lists/*
9093

91-
ENV PATH /root/.local/bin:$PATH
92-
93-
COPY --from=builder /root/.local /root/.local
94-
COPY --from=builder /usr/local/bin/bwa /usr/local/bin/bwa
95-
COPY --from=builder /usr/local/bin/STAR /usr/local/bin/STAR
96-
COPY --from=builder /usr/local/bin/samtools /usr/local/bin/samtools
97-
COPY --from=builder /usr/local/bin/sambamba /usr/local/bin/sambamba
98-
COPY --from=builder /opt/picard /opt/picard
99-
COPY --from=builder /opt/xenocp /opt/xenocp
100-
COPY --from=builder /opt/xenocp/bin/* /usr/local/bin/
101-
102-
COPY cwl /opt/xenocp/cwl
103-
104-
ENTRYPOINT ["cwl-runner", "--parallel", "--outdir", "results", "--no-container", "/opt/xenocp/cwl/xenocp.cwl"]
94+
COPY --chmod=755 --from=builder /usr/local/bin/cwl* /usr/local/bin/
95+
COPY --chmod=755 --from=builder /usr/local/lib /usr/local/lib/
96+
COPY --chmod=755 --from=builder /usr/local/bin/bwa /usr/local/bin/bwa
97+
COPY --chmod=755 --from=builder /usr/local/bin/STAR /usr/local/bin/STAR
98+
COPY --chmod=755 --from=builder /usr/local/bin/samtools /usr/local/bin/samtools
99+
COPY --chmod=755 --from=builder /usr/local/bin/sambamba /usr/local/bin/sambamba
100+
COPY --chmod=755 --from=builder /opt/picard /opt/picard
101+
COPY --chmod=755 --from=builder /opt/xenocp /opt/xenocp
102+
COPY --chmod=755 --from=builder /opt/xenocp/bin/* /usr/local/bin/
103+
104+
COPY --chmod=755 cwl /opt/xenocp/cwl

README.md

+134-42
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,36 @@
11
# XenoCP
22

3+
- [XenoCP](#xenocp)
4+
- [Quick Start](#quick-start)
5+
- [Introduction to XenoCP](#introduction-to-xenocp)
6+
- [Reference Files](#reference-files)
7+
- [BWA for DNA Reads](#bwa-for-dna-reads)
8+
- [STAR for RNA Reads](#star-for-rna-reads)
9+
- [Local Usage without Docker](#local-usage-without-docker)
10+
- [Prerequisites](#prerequisites)
11+
- [Obtain and Build XenoCP](#obtain-and-build-xenocp)
12+
- [Inputs](#inputs)
13+
- [Run](#run)
14+
- [Local Usage with Docker](#local-usage-with-docker)
15+
- [Build Docker image](#build-docker-image)
16+
- [Run](#run-1)
17+
- [Singularity as a Docker alternative](#singularity-as-a-docker-alternative)
18+
- [WDL workflow](#wdl-workflow)
19+
- [WDL reference files](#wdl-reference-files)
20+
- [Running WDL](#running-wdl)
21+
- [Evaluate test data results](#evaluate-test-data-results)
22+
- [St. Jude Cloud](#st-jude-cloud)
23+
- [Availability](#availability)
24+
- [Seeking help](#seeking-help)
25+
- [Citing XenoCP](#citing-xenocp)
26+
- [Common Issues](#common-issues)
27+
328
XenoCP is a tool for cleansing mouse reads in xenograft BAMs.
429
XenoCP can be easily incorporated into any workflow, as it takes a BAM file
530
as input and efficiently cleans up the mouse contamination. The output is a clean
631
human BAM file that could be used for downstream genomic analysis.
732

8-
## Getting started
33+
## Quick Start
934

1035
XenoCP can be run in the cloud on DNAnexus at
1136
https://platform.dnanexus.com/app/stjude_xenocp
@@ -39,7 +64,38 @@ XenoCP workflow:
3964
<!--![Alt text](images/xenocp_workflow2.png) -->
4065
<img src="images/xenocp_workflow2.png" width="500">
4166

42-
## Prerequisites
67+
## Reference Files
68+
69+
XenoCP performs mapping against the host genome, so it requires indexes for the
70+
host reference genome and mapper being used.
71+
72+
A common use case is cleansing DNA reads with a mouse host. For this use case,
73+
you can download the a BWA index for MGSCv37 from
74+
http://ftp.stjude.org/pub/software/xenocp/reference/MGSCv37
75+
76+
To build your own reference files, first download the FASTA file for your genome
77+
assembly. Then, create the index for your mapper:
78+
79+
### BWA for DNA Reads
80+
81+
```
82+
$ bwa index -p $FASTA $FASTA
83+
```
84+
85+
### STAR for RNA Reads
86+
87+
Download an annotation file such as gencode, and then run:
88+
89+
```
90+
$ STAR --runMode genomeGenerate --genomeDir STAR --genomeFastaFiles $FASTA --sjdbGTFfile $ANNOTATION --sjdbOverhang 125
91+
```
92+
93+
## Local Usage without Docker
94+
95+
### Prerequisites
96+
97+
First, install the following prerequisites. Note that if you are only using one
98+
of the two mappers, bwa and STAR, you can omit the other.
4399

44100
* [bwa] =0.7.13
45101
* [STAR] =2.7.1a
@@ -73,28 +129,25 @@ disabled.
73129
[zlib]: https://www.zlib.net/
74130
[sambamba]: http://lomereiter.github.io/sambamba/
75131

132+
### Obtain and Build XenoCP
76133

77-
78-
## Local usage
79-
80-
81-
### Obtain XenoCP
82-
83-
Clone XenoCP from GitHub:
134+
Clone XenoCP from GitHub:
84135
```
85136
git clone https://github.com/stjude/XenoCP.git
86137
```
87138

88-
### Build XenoCP
89-
90-
Once the prerequisites are satisfied, build XenoCP using Gradle.
139+
Build XenoCP using Gradle:
91140

92141
```
93142
$ gradle installDist
94143
```
95144

96-
Add the artifacts under `build/install/xenocp/lib` to your Java `CLASSPATH`.
97-
Add the artifacts under `build/install/xenocp/bin` to your `PATH`.
145+
Add the artifacts under `build/install/xenocp` to your `PATH` and your Java `CLASSPATH`:
146+
147+
```
148+
export PATH=$PATH:`pwd`/build/install/xenocp/bin
149+
export CLASSPATH=$CLASSPATH:`pwd`/build/install/xenocp/lib/*
150+
```
98151

99152
### Inputs
100153

@@ -113,8 +166,8 @@ aligner: "bwa aln"
113166
For example, a prefix of `MGSCv37.fa` would assume for bwa alignment that
114167
the following files in the same directory exist:
115168
`MGSCv37.fa.amb`, `MGSCv37.fa.ann`, `MGSCv37.fa.bwt`,
116-
`MGSCv37.fa.pac`, and `MGSCv37.fa.sa`.
117-
For STAR alignment, `ref_db_prefix` should be a directory and
169+
`MGSCv37.fa.pac`, and `MGSCv37.fa.sa`. `index` should be the path to that folder.
170+
For STAR alignment, `index` should be a directory and
118171
it would assume the following files exist in the directory:
119172
`chrLength.txt`, `chrNameLength.txt`, `chrName.txt`, `chrStart.txt`,
120173
`exonGeTrInfo.tab`, `exonInfo.tab`, `geneInfo.tab`, `Genome`,
@@ -134,25 +187,8 @@ output_prefix: xenocp-
134187
output_extension: bam
135188
```
136189

137-
### Create Reference Files
138-
139-
Download the FASTA file for your genome assembly and run the following commands to create other files:
140-
#### BWA reference files
141-
```
142-
$ bwa index -p $FASTA $FASTA
143-
```
144-
#### STAR reference files
145-
In addition the genomic FASTA, STAR reference should use an annotation file (e.g. gencode).
146-
```
147-
$ STAR --runMode genomeGenerate --genomeDir STAR --genomeFastaFiles $FASTA --sjdbGTFfile $ANNOTATION --sjdbOverhang 125
148-
```
149-
150190
[CWL inputs]: https://www.commonwl.org/user_guide/02-1st-example/index.html
151191

152-
### Download MGSCv37 reference files
153-
154-
Reference files are provided for version MGSCv37 of mouse and are available from http://ftp.stjude.org/pub/software/xenocp/reference/MGSCv37
155-
156192
### Run
157193

158194
XenoCP uses [CWL] to describe its workflow.
@@ -162,12 +198,12 @@ Then run the following.
162198

163199
```
164200
$ mkdir results
165-
$ cwltool --outdir results cwl/xenocp.cwl sample_data/input_data/inputs_local.yml
201+
$ cwltool --preserve-environment CLASSPATH --no-container --outdir results cwl/xenocp.cwl sample_data/input_data/inputs_local.yml
166202
```
167203

168204
[CWL]: https://www.commonwl.org/
169205

170-
## Docker
206+
## Local Usage with Docker
171207

172208
XenoCP provides a [Dockerfile] that builds an image with all the included
173209
dependencies. To use this image, install [Docker] for your platform.
@@ -184,10 +220,10 @@ $ docker build --tag xenocp .
184220

185221
### Run
186222

187-
The Docker image uses `cwl-runner cwl/xenocp.cwl` as its entrypoint.
223+
The Docker image does not provide an entrypoint.
188224

189-
The image assumes three working directories: `/data` for inputs, `/references` for
190-
reference files, and `/results` for outputs. `/data` and `/references` can be
225+
The image assumes three working directories: `/data` for inputs, `/reference` for
226+
reference files, and `/results` for outputs. `/data` and `/reference` can be
191227
read-only, where as `/results` needs write access.
192228

193229
The paths given in the input parameters file must be from inside the
@@ -197,13 +233,16 @@ container, not the host, e.g.,
197233
bam:
198234
class: File
199235
path: /data/sample.bam
200-
ref_db_prefix: /reference/ref.fa
236+
ref_db_prefix: ref.fa
237+
index:
238+
class: Directory
239+
path: /reference
201240
aligner: "bwa aln"
202241
```
203242

204-
The following is an example `run` command where files are stored in `test/{data,reference}`. Outputs are saved in `test/results`.
243+
The following is an example `run` command where the data files are stored in the current directory under `sample_data/input_data`. Outputs are saved in `results` in the current directory. The path to the reference files on the host machine needs to be provided.
205244

206-
This example assumes you are running against Mus musculus (genome build MGSCv37). Set the path to the folder containing your reference data
245+
This example assumes you are running against *Mus musculus* (genome build MGSCv37). Set the path to the folder containing your reference data
207246
and run the following command to produce output from the included sample data. Test output for comparison is located at `sample_data/output_data`.
208247

209248
```
@@ -212,12 +251,65 @@ $ docker run \
212251
--mount type=bind,source=$(pwd)/sample_data/input_data,target=/data,readonly \
213252
--mount type=bind,source=/path/to/reference,target=/reference,readonly \
214253
--mount type=bind,source=$(pwd)/results,target=/results \
215-
xenocp \
254+
ghcr.io/stjude/xenocp:latest \
255+
cwl-runner \
256+
--parallel \
257+
--outdir results \
258+
--no-container \
259+
/opt/xenocp/cwl/xenocp.cwl \
260+
/data/inputs.yml
261+
```
262+
263+
### Singularity as a Docker alternative
264+
265+
Singularity is an experimental container solution that is an HPC-friendly alternative to Docker. For many reasons, `singularity` is not a drop-in replacement for Docker. Many applications require modification to fully run with `singularity`. This alternative is provided on a best-effort basis. If issues are encountered, please open an issue on this repository with details and the maintainers will try to provide support as possible.
266+
267+
```
268+
$ mkdir $(pwd)/results
269+
$ singularity run \
270+
--containall \ # Isolate container from host
271+
-W /path/to/directory \ # Provide a directory with sufficient space to use for working directory
272+
-B $(pwd)/sample_data/input_data:/data \
273+
-B /path/to/reference:/reference \
274+
-B $(pwd)/results:/results \
275+
docker://ghcr.io/stjude/xenocp:latest \
276+
cwl-runner \
277+
--parallel \
278+
--outdir results \
279+
--no-container \
280+
/opt/xenocp/cwl/xenocp.cwl \
216281
/data/inputs.yml
217282
```
218283

284+
Note: when running using Singularity on an HPC, problems can arise if the
285+
default temporary file location, /tmp, is small. To solve this, include
286+
`-W <dir>` when executing via Singularity to redirect temp files to a
287+
larger directory `<dir>`.
288+
289+
Note: By default, `singularity` makes many host resources available inside the container. This is in contrast with Docker's native isolation. This also tends to cause conflicts and errors when running Docker-based workflows. Therefore we recommend always using the `--containall` option to Singularity.
290+
219291
[Dockerfile]: ./Dockerfile
220292

293+
## WDL workflow
294+
295+
XenoCP includes a [WDL](https://github.com/openwdl/wdl) workflow implementation. This can be run locally or on a supported HPC system. It can also use Docker or Singularity for containerization.
296+
297+
### WDL reference files
298+
299+
As of v1.2, WDL does not support directory inputs. Therefore the reference files provided to the WDL workflow must be compressed (`.tar.gz`) before running. The compressed reference files can be downloaded from [Zenodo](https://zenodo.org/uploads/10162103).
300+
301+
### Running WDL
302+
303+
To run the WDL workflow, you will need a WDL engine. We suggest [miniwdl](https://github.com/chanzuckerberg/miniwdl), though the [Cromwell](https://github.com/broadinstitute/cromwell/) engine should work, but is untested with XenoCP.
304+
305+
After acquiring the reference files for your chosen aligner, you can run the sample data through the WDL workflow with the following command.
306+
307+
```
308+
miniwdl run https://raw.githubusercontent.com/stjude/XenoCP/main/wdl/workflows/xenocp.wdl input_bam=https://github.com/stjude/XenoCP/raw/main/sample_data/input_data/SJRB001_X.subset.bam input_bai=https://github.com/stjude/XenoCP/raw/main/sample_data/input_data/SJRB001_X.subset.bam.bai reference_tar_gz=MGSCv37_bwa.tar.gz aligner='bwa aln'
309+
```
310+
311+
This will run all of the steps on the local machine with Docker. The WDL runner `miniwdl` supports alternative execution modes, such as the [Singularity](https://miniwdl.readthedocs.io/en/latest/runner_backends.html#singularity-beta) container engine, [Slurm](https://github.com/miniwdl-ext/miniwdl-slurm) for batch systems, and [LSF](https://github.com/adthrasher/miniwdl-lsf) for batch systems. Alternative execution modes can be specified using `miniwdl`'s [configuration system](https://miniwdl.readthedocs.io/en/latest/runner_reference.html#configuration).
312+
221313
## Evaluate test data results
222314

223315
If you have [bcftools] and a [GRCh37-lite] reference file, the following will show two variants in the input file.

RELEASE.md

+1
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
* [ ] Update version in `dx_app/dxapp.json`.
44
* [ ] Update `wdl/tools/xenocp.wdl` with version.
55
* [ ] Update `wdl/workflows/xenocp.wdl` with version.
6+
* [ ] Update `build.gradle` with version.

bin/java.sh

+5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
#!/usr/bin/env bash
22

3+
# If the classpath is already set, then delegate directly to java
4+
if [ "$CLASSPATH" != "" ]; then exec java "$@"; fi
5+
6+
# Otherwise, build an appropriate classpath
7+
# This section assumes you are running inside the container
38
for arg in "$@"; do
49
case $arg in
510
org.stjude.compbio.*)

0 commit comments

Comments
 (0)