Skip to content

Commit a0742f6

Browse files
authored
Merge pull request #361 from broadinstitute/development
Release 1.34.0
2 parents 1f0dcf5 + b820bbb commit a0742f6

14 files changed

+3220
-65
lines changed

README.md

+9-8
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
File Ingest Pipeline for Single Cell Portal
44

55
[![Build status](https://img.shields.io/circleci/build/github/broadinstitute/scp-ingest-pipeline.svg)](https://circleci.com/gh/broadinstitute/scp-ingest-pipeline)
6-
[![Code coverage](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline/branch/master/graph/badge.svg)](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline)
6+
[![Code coverage](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline/branch/main/graph/badge.svg)](https://codecov.io/gh/broadinstitute/scp-ingest-pipeline)
77

88
The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.
99

@@ -27,21 +27,21 @@ cd scp-ingest-pipeline
2727
python3 -m venv env --copies
2828
source env/bin/activate
2929
pip install -r requirements.txt
30-
scripts/setup-mongo-dev.sh <PATH_TO_YOUR_VAULT_TOKEN> # E.g. ~/.github-token
30+
source scripts/setup-mongo-dev.sh
3131
```
3232

3333
### Docker
3434

35-
With Docker running and Vault active on your local machine, run:
35+
With Docker running and `gcloud` authenticated on your local machine, run:
3636

3737
```
38-
scripts/docker-compose-setup.sh -t <PATH_TO_YOUR_VAULT_TOKEN> # E.g. ~/.github-token
38+
scripts/docker-compose-setup.sh
3939
```
4040

4141
If on Apple silicon Mac (e.g. M1), and performance seems poor, consider generating a docker image using the arm64 base. Example test image: gcr.io/broad-singlecellportal-staging/single-cell-portal:development-2.2.0-arm64, usage:
4242

4343
```
44-
scripts/docker-compose-setup.sh -i development-2.2.0-arm64 -t <PATH_TO_YOUR_VAULT_TOKEN>
44+
scripts/docker-compose-setup.sh -i development-2.2.0-arm64
4545
```
4646

4747
To update dependencies when in Docker, you can pip install from within the Docker Bash shell after adjusting your requirements.txt.
@@ -132,10 +132,10 @@ Pro-Tip: For local builds, you can try adding docker build options `--progress=p
132132

133133
### 2. Set up environment variables
134134

135-
Run the following to pull database-specific secrets out of vault (passing in the path to your vault token):
135+
Run the following to pull database-specific secrets out of Google Secrets Manager (GSM):
136136

137137
```
138-
source scripts/setup-mongo-dev.sh ~/.your-vault-token
138+
source scripts/setup-mongo-dev.sh
139139
```
140140

141141
Now run `env` to make sure you've set the following values:
@@ -152,7 +152,8 @@ DATABASE_HOST=<ip address>
152152
Run the following to export out your default service account JSON keyfile:
153153

154154
```
155-
vault read -format=json secret/kdux/scp/development/$(whoami)/scp_service_account.json | jq .data > /tmp/keyfile.json
155+
GOOGLE_PROJECT=$(gcloud info --format="value(config.project)")
156+
gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=default-sa-keyfile | jq > /tmp/keyfile.json
156157
```
157158

158159
### 4. Start the Docker container

ingest/anndata_.py

+36-6
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,19 @@ def _field_template(self, field, precision):
1515

1616

1717
try:
18-
from ingest_files import IngestFiles
18+
from ingest_files import DataArray, IngestFiles
1919
from expression_files.expression_files import GeneExpression
20-
from monitor import log_exception
20+
from monitor import log_exception, bypass_mongo_writes
2121
from validation.validate_metadata import list_duplicates
2222
except ImportError:
2323
# Used when importing as external package, e.g. imports in single_cell_portal code
24-
from .ingest_files import IngestFiles
24+
from .ingest_files import DataArray, IngestFiles
2525
from .expression_files.expression_files import GeneExpression
26-
from .monitor import log_exception
26+
from .monitor import log_exception, bypass_mongo_writes
2727
from .validation.validate_metadata import list_duplicates
2828

2929

30-
class AnnDataIngestor(GeneExpression, IngestFiles):
30+
class AnnDataIngestor(GeneExpression, IngestFiles, DataArray):
3131
ALLOWED_FILE_TYPES = ['application/x-hdf5']
3232

3333
def __init__(self, file_path, study_file_id, study_id, **kwargs):
@@ -57,6 +57,36 @@ def basic_validation(self):
5757
except ValueError:
5858
return False
5959

60+
def create_cell_data_arrays(self):
61+
"""Extract cell name DataArray documents for raw data"""
62+
adata = self.obtain_adata()
63+
cells = list(adata.obs_names)
64+
# use filename denoting a raw 'fragment' to allow successful ingest and downstream queries
65+
raw_filename = "h5ad_frag.matrix.raw.mtx.gz"
66+
data_arrays = []
67+
for data_array in GeneExpression.create_data_arrays(
68+
name=f"{raw_filename} Cells",
69+
array_type="cells",
70+
values=cells,
71+
linear_data_type="Study",
72+
linear_data_id=self.study_file_id,
73+
cluster_name=raw_filename,
74+
study_file_id=self.study_file_id,
75+
study_id=self.study_id
76+
):
77+
data_arrays.append(data_array)
78+
79+
return data_arrays
80+
81+
def ingest_raw_cells(self):
82+
"""Insert raw count cells into MongoDB"""
83+
arrays = self.create_cell_data_arrays()
84+
if not bypass_mongo_writes():
85+
self.load(arrays, DataArray.COLLECTION_NAME)
86+
else:
87+
dev_msg = f"Extracted {len(arrays)} DataArray for {self.study_file_id}:{arrays[0]['name']}"
88+
IngestFiles.dev_logger.info(dev_msg)
89+
6090
@staticmethod
6191
def generate_cluster_header(adata, clustering_name):
6292
"""
@@ -117,7 +147,7 @@ def generate_metadata_file(adata, output_name):
117147
headers = adata.obs.columns.tolist()
118148
types = []
119149
for header in headers:
120-
if pd.api.types.is_number(adata.obs[header]):
150+
if pd.api.types.is_numeric_dtype(adata.obs[header]):
121151
types.append("NUMERIC")
122152
else:
123153
types.append("GROUP")

ingest/expression_files/expression_files.py

+10
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,16 @@ def is_raw_count_file(study_id, study_file_id, client):
101101
QUERY = {"_id": study_file_id, "study_id": study_id}
102102

103103
study_file_doc = list(client[COLLECTION_NAME].find(QUERY)).pop()
104+
# special handling of non-reference AnnData files to always return false
105+
# this will allow normal extraction of expression data as raw count cells are already ingested during
106+
# the "raw_counts" extract phase
107+
if (
108+
study_file_doc.get("file_type") == "AnnData" and
109+
"ann_data_file_info" in study_file_doc.keys() and
110+
not study_file_doc["ann_data_file_info"].get("reference_file")
111+
):
112+
return False
113+
104114
# Name of embedded document that holds 'is_raw_count_files is named expression_file_info.
105115
# If study files does not have document expression_file_info
106116
# field, "is_raw_count_files", will not exist.:

ingest/ingest_pipeline.py

+6
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@
5454
# Ingest AnnData - happy path processed expression data only extraction
5555
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['processed_expression']"
5656
57+
# Ingest AnnData - happy path raw count cell name only extraction
58+
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['raw_counts']"
59+
5760
# Ingest AnnData - happy path cluster and metadata extraction
5861
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['cluster', 'metadata']" --obsm-keys "['X_umap','X_tsne']"
5962
@@ -537,6 +540,9 @@ def extract_from_anndata(self):
537540
"extract"
538541
):
539542
self.anndata.generate_processed_matrix(self.anndata.adata)
543+
544+
if self.kwargs.get('extract') and "raw_counts" in self.kwargs.get('extract'):
545+
self.anndata.ingest_raw_cells()
540546
self.report_validation("success")
541547
return 0
542548
# scanpy unable to open AnnData file

ingest/monitor.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -129,13 +129,13 @@ def integrate_sentry():
129129
See also: links to Sentry resources atop this module
130130
"""
131131

132-
# Ultimately stored in Vault, passed in as environmen variable to PAPI
132+
# Ultimately stored in GSM, passed in as environment variable to PAPI
133133
sentry_DSN = os.environ.get("SENTRY_DSN")
134134

135135
if sentry_DSN is None:
136136
# Don't log to Sentry unless its DSN is set.
137137
# This disables Sentry logging in development and test (i.e.,
138-
# environments without a SENTRY_DSN in their scp_config vault secret).
138+
# environments without a SENTRY_DSN in their scp-config-json GSM secret).
139139
return
140140

141141
sentry_logging = LoggingIntegration(

scripts/docker-compose-setup.sh

+2-12
Original file line numberDiff line numberDiff line change
@@ -10,23 +10,18 @@ usage=$(
1010
cat <<EOF
1111
$0 [OPTION]
1212
-i Set URL for GCR image; helpful if not using latest development
13-
-t Set GitHub Vault token (e.g. ~/.github-token)
1413
-h print this text
1514
EOF
1615
)
1716

1817
GCR_IMAGE=""
1918
VAULT_TOKEN_PATH=""
20-
while getopts "i:t:h" OPTION; do
19+
while getopts "i:h" OPTION; do
2120
case $OPTION in
2221
i)
2322
echo "### SETTING GCR IMAGE ###"
2423
export GCR_IMAGE="$OPTARG"
2524
;;
26-
t)
27-
echo "### SETTING VAULT TOKEN ###"
28-
VAULT_TOKEN_PATH="$OPTARG"
29-
;;
3025
h)
3126
echo "$usage"
3227
exit 0
@@ -45,13 +40,8 @@ if [[ $GCR_IMAGE = "" ]]; then
4540
export GCR_IMAGE="${IMAGE_NAME}:${LATEST_TAG}"
4641
fi
4742

48-
if [[ $VAULT_TOKEN_PATH = "" ]]; then
49-
echo "Did not provide VAULT_TOKEN_PATH"
50-
exit 1
51-
fi
52-
5343
echo "### SETTING UP ENVIRONMENT ###"
54-
./scripts/ingest-local-setup.sh $VAULT_TOKEN_PATH
44+
./scripts/ingest-local-setup.sh
5545

5646
docker pull $GCR_IMAGE
5747

scripts/ingest-local-setup.sh

+3-14
Original file line numberDiff line numberDiff line change
@@ -4,25 +4,14 @@
44
#
55
# Keep "Dev env vars" synced with `setup-mongo-dev.sh`
66

7-
VAULT_TOKEN_PATH="$1"
8-
if [[ -z "$VAULT_TOKEN_PATH" ]]
9-
then
10-
echo "You must provide a path to a GitHub token to proceed, e.g. ~/.github-token"
11-
exit 1
12-
fi
13-
vault login -method=github token=$(cat $VAULT_TOKEN_PATH)
14-
if [[ $? -ne 0 ]]
15-
then
16-
echo "Unable to authenticate into Vault"
17-
exit 1
18-
fi
7+
GOOGLE_PROJECT=$(gcloud info --format="value(config.project)")
198

209
# Dev env vars
2110
BROAD_USER=`whoami`
2211
MONGODB_USERNAME='single_cell'
2312
DATABASE_NAME='single_cell_portal_development'
24-
MONGODB_PASSWORD=`vault read secret/kdux/scp/development/$BROAD_USER/mongo/user | grep password | awk '{ print $2 }' `
25-
DATABASE_HOST=`vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname | grep ip | awk '{ $2=substr($2,2,length($2)-2); print $2 }' `
13+
MONGODB_PASSWORD=$(gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=mongo-user | jq .password)
14+
DATABASE_HOST=$(gcloud secrets versions access latest --project=$GOOGLE_PROJECT --secret=mongo-hostname| jq -r '.ip[0]')
2615
BYPASS_MONGO_WRITES='yes'
2716
BARD_HOST_URL="https://terra-bard-dev.appspot.com"
2817

0 commit comments

Comments
 (0)