-
Notifications
You must be signed in to change notification settings - Fork 4
Prep/Artifact human filtering #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
8513628
init changes
antgonza af8a225
Merge branch 'main' of https://github.com/qiita-spots/qp-knight-lab-p…
antgonza 20eb229
adding prep_NuQCJob
antgonza 8559557
addressing some of @wasade comments
antgonza c3498c4
add traceback
antgonza 4567a79
error-traceback.err
antgonza 5bdbb67
making some progress
antgonza 73bb155
fix test
antgonza File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,18 @@ | ||
from functools import partial | ||
import sample_sheet | ||
import pandas as pd | ||
from os.path import basename, join | ||
from os import symlink, makedirs | ||
from datetime import datetime | ||
from metapool import MetagenomicSampleSheetv90 | ||
from pathlib import Path | ||
|
||
from .Protocol import Illumina | ||
from sequence_processing_pipeline.Pipeline import Pipeline | ||
from .Assays import Metagenomic | ||
from .Assays import ASSAY_NAME_METAGENOMIC | ||
from .FailedSamplesRecord import FailedSamplesRecord | ||
from .Workflows import Workflow | ||
from .Workflows import Workflow, WorkflowError | ||
|
||
|
||
class StandardMetagenomicWorkflow(Workflow, Metagenomic, Illumina): | ||
|
@@ -49,3 +58,130 @@ def __init__(self, **kwargs): | |
"type bool") | ||
|
||
self.update = kwargs['update_qiita'] | ||
|
||
|
||
class PrepNuQC(StandardMetagenomicWorkflow): | ||
def __init__(self, **kwargs): | ||
qclient = kwargs['qclient'] | ||
job_id = kwargs['job_id'] | ||
parameters = kwargs['parameters'] | ||
out_dir = kwargs['out_dir'] | ||
config_fp = kwargs['config_fp'] | ||
status_line = kwargs['status_line'] | ||
|
||
out_path = partial(join, out_dir) | ||
self.final_results_path = out_path('final_results') | ||
makedirs(self.final_results_path, exist_ok=True) | ||
|
||
pid = parameters.pop('prep_id') | ||
|
||
prep_info = qclient.get(f'/qiita_db/prep_template/{pid}/') | ||
dt = prep_info['data_type'] | ||
sid = prep_info['study'] | ||
if dt not in {'Metagenomic', 'Metatranscriptomic'}: | ||
raise WorkflowError(f'Prep {pid} has a not valid data type: {dt}') | ||
aid = prep_info['artifact'] | ||
if not str(aid).isnumeric(): | ||
raise WorkflowError(f'Prep {pid} has a not valid artifact: {aid}') | ||
|
||
files, pt = qclient.artifact_and_preparation_files(aid) | ||
html_summary = qclient.get_artifact_html_summary(aid) | ||
if html_summary is None: | ||
raise WorkflowError(f'Artifact {aid} doesnot have a summary, ' | ||
'please generate one.') | ||
df_summary = pd.read_html(html_summary)[0] | ||
pt.set_index('sample_name', inplace=True) | ||
|
||
project_name = f'qiita-{pid}-{aid}_{sid}' | ||
|
||
sheet = MetagenomicSampleSheetv90() | ||
sheet.Header['IEMFileVersion'] = '4' | ||
sheet.Header['Date'] = datetime.today().strftime('%m/%d/%y') | ||
sheet.Header['Workflow'] = 'GenerateFASTQ' | ||
sheet.Header['Application'] = 'FASTQ Only' | ||
sheet.Header['Assay'] = prep_info['data_type'] | ||
sheet.Header['Description'] = f'prep_NuQCJob - {pid}' | ||
sheet.Header['Chemistry'] = 'Default' | ||
sheet.Header['SheetType'] = 'standard_metag' | ||
sheet.Header['SheetVersion'] = '90' | ||
sheet.Header['Investigator Name'] = 'Qiita' | ||
sheet.Header['Experiment Name'] = project_name | ||
|
||
sheet.Bioinformatics = pd.DataFrame( | ||
columns=['Sample_Project', 'ForwardAdapter', 'ReverseAdapter', | ||
'library_construction_protocol', | ||
'experiment_design_description', | ||
'PolyGTrimming', 'HumanFiltering', 'QiitaID'], | ||
data=[[project_name, 'NA', 'NA', 'NA', 'NA', | ||
'FALSE', 'TRUE', sid]]) | ||
|
||
df_summary = df_summary[df_summary.file_type == 'raw_forward_seqs'] | ||
data = [] | ||
for k, vals in pt.iterrows(): | ||
k = k.split('.', 1)[-1] | ||
rp = vals['run_prefix'] | ||
sample = { | ||
'Sample_Name': k, | ||
'Sample_ID': k.replace('.', '_'), | ||
'Sample_Plate': '', | ||
'well_id_384': '', | ||
'I7_Index_ID': '', | ||
'index': vals['index'], | ||
'I5_Index_ID': '', | ||
'index2': vals['index2'], | ||
'Sample_Project': project_name, | ||
'Well_description': '', | ||
'Sample_Well': '', | ||
'Lane': '1'} | ||
sheet.add_sample(sample_sheet.Sample(sample)) | ||
_d = df_summary[ | ||
df_summary.filename.str.startswith(rp)] | ||
if _d.shape[0] != 1: | ||
ValueError(f'The run_prefix {rp} from {k} has {_d.shape[0]} ' | ||
'matches with files') | ||
data.append({ | ||
'Lane': '1', 'SampleID': rp, 'Sample_Project': project_name, | ||
'Index': vals['index'], '# Reads': _d.reads.values[0]}) | ||
|
||
sheet.Contact = pd.DataFrame( | ||
columns=['Email', 'Sample_Project'], | ||
data=[['[email protected]', project_name]]) | ||
|
||
new_sample_sheet = out_path('sample-sheet.csv') | ||
with open(new_sample_sheet, 'w') as f: | ||
sheet.write(f, 1) | ||
|
||
# now that we have a sample_sheet we can fake the | ||
# ConvertJob folder so we are ready for the restart | ||
convert_path = out_path('ConvertJob') | ||
project_folder = out_path('ConvertJob', project_name) | ||
makedirs(project_folder, exist_ok=True) | ||
# creating Demultiplex_Stats.csv | ||
reports_folder = out_path('ConvertJob', 'Reports') | ||
makedirs(reports_folder, exist_ok=True) | ||
pd.DataFrame(data).set_index('SampleID').to_csv( | ||
f'{reports_folder}/Demultiplex_Stats.csv') | ||
|
||
for fs in files.values(): | ||
for f in fs: | ||
bn = basename(f['filepath']).replace( | ||
'.trimmed.fastq.gz', '.fastq.gz') | ||
symlink(f['filepath'], f'{project_folder}/{bn}') | ||
|
||
# create job_completed file to skip this step | ||
Path(f'{convert_path}/job_completed').touch() | ||
|
||
kwargs = {'qclient': qclient, | ||
'uif_path': new_sample_sheet, | ||
'lane_number': "1", | ||
'config_fp': config_fp, | ||
'run_identifier': '250225_LH00444_0301_B22N7T2LT4', | ||
'output_dir': out_dir, | ||
'job_id': job_id, | ||
'status_update_callback': status_line.update_job_status, | ||
# set 'update_qiita' to False to avoid updating Qiita DB | ||
# and copying files into uploads dir. Useful for testing. | ||
'update_qiita': True, | ||
'is_restart': True} | ||
|
||
super().__init__(**kwargs) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
qp_klp/tests/data/250225_LH00444_0301_B22N7T2LT4/RTAComplete.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
this is a test |
16 changes: 16 additions & 0 deletions
16
qp_klp/tests/data/250225_LH00444_0301_B22N7T2LT4/RunInfo.xml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
<?xml version="1.0"?> | ||
<RunInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="2"> | ||
<Run Id="170523_M09999_0010_000000000-XXXXX" Number="10"> | ||
<Flowcell>000000000-XXXXX</Flowcell> | ||
<Instrument>M09999</Instrument> | ||
<Date>170523</Date> | ||
<Reads> | ||
<Read NumCycles="151" Number="1" IsIndexedRead="N" /> | ||
<Read NumCycles="8" Number="2" IsIndexedRead="Y" /> | ||
<Read NumCycles="8" Number="3" IsIndexedRead="Y" /> | ||
<Read NumCycles="151" Number="4" IsIndexedRead="N" /> | ||
</Reads> | ||
<FlowcellLayout LaneCount="1" SurfaceCount="2" SwathCount="1" TileCount="14" /> | ||
</Run> | ||
</RunInfo> |
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
<table border="1" class="dataframe"> | ||
<thead> | ||
<tr style="text-align: right;"> | ||
<th>filename</th> | ||
<th>md5</th> | ||
<th>file_type</th> | ||
<th>reads</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>S22205_S104_L001_R1_001.fastq.gz</td> | ||
<td>9dcfb0c77674fdada176262963196db0</td> | ||
<td>raw_forward_seqs</td> | ||
<td>1000000</td> | ||
</tr> | ||
<tr> | ||
<td>S22282_S102_L001_R1_001.fastq.gz</td> | ||
<td>9dcfb0c77674fdada176262963196db0</td> | ||
<td>raw_forward_seqs</td> | ||
<td>1000000</td> | ||
</tr> | ||
<tr> | ||
<td>S22205_S104_L001_R2_001.fastq.gz</td> | ||
<td>9dcfb0c77674fdada176262963196db0</td> | ||
<td>raw_reverse_seqs</td> | ||
<td>1000000</td> | ||
</tr> | ||
<tr> | ||
<td>S22282_S102_L001_R2_001.fastq.gz</td> | ||
<td>9dcfb0c77674fdada176262963196db0</td> | ||
<td>raw_reverse_seqs</td> | ||
<td>1000000</td> | ||
</tr> | ||
</tbody> | ||
</table> |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback is lost here:
See this SO post. I think the intent is
str(e)
to instead betraceback.format_exc()
, right? If so thenimport traceback
is also neededThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the intent is completely the oposite, just raise/report the error in the GUI with minimal information for the user. I think that way, if is something obvious and handled, like "sample sheet has wrong OverwriteCycles value", that's what's shown but if there is something less obvious, users will need to contact the admins/devs to investigate. FWIW, this has been a useful interaction for this specific plugin: wet/dry-lab interactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the dev's know what line in the codebase is throwing the exception without the traceback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Each step is reported in the job's step in the db and as a new folder in the working directory, each folder has its own logs and details. In other words, via the jobs "step" & last folder written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the developer will not know the exact line of code raising the exception. Won't that require the developer to then guess or perform a much more time expensive debugging process to determine what specifically failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying but in experience so far that's not the case. However, we might be missing something so I'll change and we can revert back if users get too annoyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! For users, the traceback could either be post processed, or an additional item in the tuple could be returned (the original
str(e)
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, decided to write a log in the outdir so devs can see the full traceback but keep it simple ("str(e)") for users.