Skip to content

Parallelization of build_all.py #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
May 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
1ed02b9
Docker compose integrated for parallel docker builds.
jjacobson95 May 10, 2024
998e39f
Local Build Script Optimized for speed.
jjacobson95 May 10, 2024
82e0a67
updated cptac version
jjacobson95 May 10, 2024
9bb08ad
simulation command time updated to be non-rand
jjacobson95 May 10, 2024
5138871
added docker compose print statement
jjacobson95 May 10, 2024
8366aa5
fixed issue when trying to run any method besides --all
jjacobson95 May 13, 2024
57778ba
beataml typo fix.
jjacobson95 May 13, 2024
e0aef45
For some reason, the arguments are parsed in a weird way for this scr…
jjacobson95 May 13, 2024
fac64d6
Fix to early end of script
jjacobson95 May 13, 2024
f672149
essentially added caching for samples files
jjacobson95 May 13, 2024
01f8cde
reverting a previous change that was specific to a unique scenario
jjacobson95 May 13, 2024
ea0e9df
Adding checker for drug files. We need a better method than this
jjacobson95 May 13, 2024
b98efe6
Added logic for gene files
jjacobson95 May 13, 2024
a4fca49
added genes to the docker compose file
jjacobson95 May 13, 2024
4e2b90c
Reduced some parallelism due to memory issues
jjacobson95 May 13, 2024
81c3e00
working on figshare and pypi upload
jjacobson95 May 14, 2024
33e1188
update move logic
jjacobson95 May 14, 2024
e6ca2f9
update move logic
jjacobson95 May 14, 2024
a911583
bug fix. file re-org should be working now
jjacobson95 May 14, 2024
ff7d106
added upload docker file
jjacobson95 May 14, 2024
6c3e744
working on sed command for versioning
jjacobson95 May 14, 2024
90608d6
working on sed command for versioning pt2
jjacobson95 May 14, 2024
cd9da5d
working on sed command for versioning pt3
jjacobson95 May 14, 2024
984bc9d
still working on sed command. Moved to python script instead
jjacobson95 May 14, 2024
d140ac8
still working on sed command. Moved to python script instead 2
jjacobson95 May 14, 2024
2d62425
working on figshare command
jjacobson95 May 14, 2024
5e6603f
working on figshare command 2
jjacobson95 May 14, 2024
ff054f2
updating docker to work on this branch
jjacobson95 May 14, 2024
4eb2aeb
reverting version.
jjacobson95 May 14, 2024
bddff32
Debugging beataml Drug file
jjacobson95 May 14, 2024
8ad9468
New approach to the Figshare / PyPI push
jjacobson95 May 14, 2024
1eeefb5
ensured that figshare file would not get compressed
jjacobson95 May 14, 2024
92e2e4c
update syntax of setup.py during edit step
jjacobson95 May 14, 2024
997f98d
Integrating schema checker
jjacobson95 May 14, 2024
ec2512d
Added check_all_schemas.py
jjacobson95 May 14, 2024
750d85d
small update
jjacobson95 May 14, 2024
e5bc3f7
Working on schema checker
jjacobson95 May 14, 2024
ecb80e7
fixed schema
jjacobson95 May 14, 2024
713525b
cleaning up script a bit
jjacobson95 May 14, 2024
a2469b8
Bug fixes. Added high_mem option for users.
jjacobson95 May 21, 2024
6545b71
Pulled updates from main to this branch
jjacobson95 May 21, 2024
ab08bf2
Adding MPLCONFIGDIR to broad_sanger Dockerfiles.
jjacobson95 May 21, 2024
36ab819
Added more robust checks / validations between steps
jjacobson95 May 22, 2024
fe3cd6b
check_all_schemas.py now run in parallel
jjacobson95 May 24, 2024
aca012b
Update check_all_schemas.py
jjacobson95 May 24, 2024
d7cb910
Merge branch 'docker-build-multi' of https://github.com/PNNL-CompBio/…
jjacobson95 May 24, 2024
2541a97
clean up
jjacobson95 May 24, 2024
579cbb0
Update README.md
jjacobson95 May 24, 2024
5cbe991
Merge remote-tracking branch 'origin/main' into docker-build-multi
jjacobson95 May 29, 2024
8ecff40
updated hcmi dockerfile
jjacobson95 May 29, 2024
097afc1
small bug fix when running full pipeline with all options
jjacobson95 May 30, 2024
01d1b8c
Updating All site pages and links and counts and tables with latest data
jjacobson95 May 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@ name: CI
on:
push:
branches:
- builder_branch_JJ
- docs_update_4_5_24
- doc_update_4_23_24
- docker-build-multi
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

Expand Down
29 changes: 29 additions & 0 deletions build/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,35 @@

All data collected for this package has been collated from stable/reproducible sources using the scripts contained here. The figure below shows a brief description of the process, which is designed to be run serially, as new identifiers are generated as data are added.

## build_all.py script

This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.

It requires the following authorization tokens to be set in the local environment depending on the use case:
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
`PYPI_TOKEN`: This token is required to upload to PyPI.
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.

Available arguments:

- `--docker`: Initializes and builds all docker containers.
- `--samples`: Processes and builds the sample data files.
- `--omics`: Processes and builds the omics data files.
- `--drugs`: Processes and builds the drug data files.
- `--exp`: Processes and builds the experiment data files.
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp).
- `--validate`: Validates the generated datasets using the schema check scripts.
- `--figshare`: Uploads the datasets to Figshare.
- `--pypi`: Uploads the package to PyPI.
- `--high_mem`: Utilizes high memory mode for concurrent data processing.
- `--dataset`: Specifies the datasets to process (default='broad_sanger,hcmi,beataml,mpnst,cptac').
- `--version`: Specifies the version number for the package and data upload title. This is required to upload to figshare and PyPI

Example usage:
```bash
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.29
```

### Directory structure

We have created a separate directory with scripts that collect data from distinct sources as described below.
Expand Down
5 changes: 4 additions & 1 deletion build/beatAML/GetBeatAML.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,9 @@ def merge_drug_info(d_df,drug_map):
#print(drug_map)
#print(d_df.columns)
#print(d_df)
print(d_df['isoSMILES'].dtype, drug_map['isoSMILES'].dtype)
d_df['isoSMILES'] = d_df['isoSMILES'].astype(str)
drug_map['isoSMILES'] = drug_map['isoSMILES'].astype(str)
result_df = d_df.merge(drug_map[['isoSMILES', 'improve_drug_id']], on='isoSMILES', how='left')
return result_df

Expand Down Expand Up @@ -607,7 +610,7 @@ def generate_drug_list(drug_map_path,drug_path):
if args.samples:
if args.prevSamples is None or args.prevSamples=='':
print("Cannot run sample file generation without previous samples")
edit()
exit()
else:
print("Only running Samples File Generation")
prev_samples_path = args.prevSamples
Expand Down
Loading