-
Notifications
You must be signed in to change notification settings - Fork 4
Prep/Artifact human filtering #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It seems like some of the logic here would be better placed in the sample sheet object, and it would be valuable to have a test asserting successful execution
qp_klp/klp.py
Outdated
sheet = MetagenomicSampleSheetv90() | ||
sheet.Header['IEMFileVersion'] = '4' | ||
sheet.Header['Date'] = datetime.today().strftime('%m/%d/%y') | ||
sheet.Header['Workflow'] = 'GenerateFASTQ' | ||
sheet.Header['Application'] = 'FASTQ Only' | ||
sheet.Header['Assay'] = prep_info['data_type'] | ||
sheet.Header['Description'] = f'prep_NuQCJob - {pid}' | ||
sheet.Header['Chemistry'] = 'Default' | ||
sheet.Header['SheetType'] = 'standard_metag' | ||
sheet.Header['SheetVersion'] = '90' | ||
sheet.Header['Investigator Name'] = 'Qiita' | ||
sheet.Header['Experiment Name'] = project_name | ||
|
||
sheet.Bioinformatics = pd.DataFrame( | ||
columns=['Sample_Project', 'ForwardAdapter', 'ReverseAdapter', | ||
'library_construction_protocol', | ||
'experiment_design_description', | ||
'PolyGTrimming', 'HumanFiltering', 'QiitaID'], | ||
data=[[project_name, 'NA', 'NA', 'NA', 'NA', 'FALSE', 'TRUE', sid]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this stuff isn't done by the MetagenomicSampleSheetv90
constructor?
qp_klp/klp.py
Outdated
for k, vals in pt.iterrows(): | ||
k = k.split('.', 1)[-1] | ||
sample = { | ||
'Sample_Name': k, | ||
'Sample_ID': k.replace('.', '_'), | ||
'Sample_Plate': '', | ||
'well_id_384': '', | ||
'I7_Index_ID': '', | ||
'index': vals['index'], | ||
'I5_Index_ID': '', | ||
'index2': vals['index2'], | ||
'Sample_Project': project_name, | ||
'Well_description': '', | ||
'Sample_Well': '', | ||
'Lane': '1'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of what's here are attributes specific to the sample sheet. Is there a reason this is done outside of the sample sheet, rather than controlled by the sample sheet?
qp_klp/klp.py
Outdated
sheet.Contact = pd.DataFrame( | ||
columns=['Email', 'Sample_Project'], | ||
data=[['[email protected]', project_name]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Or this perhaps could be sheet.set_contact(email, project_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure I'm not missing something obvious but are you sure this method exists or are you asking why not put it together?
FWIW, neither the original sample_sheet or the augmented version metapool provides it:
In [1]: import sample_sheet
In [2]: sample_sheet.set_contact
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 sample_sheet.set_contact
AttributeError: module 'sample_sheet' has no attribute 'set_contact'
In [3]: from metapool import MetagenomicSampleSheetv90
In [4]: MetagenomicSampleSheetv90.set_contact
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 MetagenomicSampleSheetv90.set_contact
AttributeError: type object 'MetagenomicSampleSheetv90' has no attribute 'set_contact'
Now, just to be clear, the idea of this code is to quickly create a "valid" sample_sheet from a loaded prep in qiita so we can human-filter/qc their files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the method doesn't exist, then for the same scope of responsibility reasons noted elsewhere, creating it seems like it would improve the overall codebase. In general, I advise making these type of low risk easy changes (such as this and abstracting sbatch) under the principle of continuous improvement, rather than creating issues and deferring. But you are the lead on this project so it is your decision about how to proceed.
qp_klp/klp.py
Outdated
project_folder = out_path('ConvertJob', project_name) | ||
makedirs(project_folder, exist_ok=True) | ||
|
||
for _, fs in files.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, fs in files.items(): | |
for fs in files.values(): |
qp_klp/klp.py
Outdated
with open(f'{convert_path}/job_completed', 'w') as f: | ||
f.write('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See SO post, but Path(f'{convert_path}/job_completed').touch()
qp_klp/klp.py
Outdated
try: | ||
kwargs = {'qclient': qclient, | ||
'uif_path': new_sample_sheet, | ||
'lane_number': "1", | ||
'config_fp': CONFIG_FP, | ||
'run_identifier': '211021_A00000_0000_SAMPLE', | ||
'output_dir': out_dir, | ||
'job_id': job_id, | ||
'status_update_callback': status_line.update_job_status, | ||
# set 'update_qiita' to False to avoid updating Qiita DB | ||
# and copying files into uploads dir. Useful for testing. | ||
'update_qiita': True, | ||
'is_restart': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
kwargs = {'qclient': qclient, | |
'uif_path': new_sample_sheet, | |
'lane_number': "1", | |
'config_fp': CONFIG_FP, | |
'run_identifier': '211021_A00000_0000_SAMPLE', | |
'output_dir': out_dir, | |
'job_id': job_id, | |
'status_update_callback': status_line.update_job_status, | |
# set 'update_qiita' to False to avoid updating Qiita DB | |
# and copying files into uploads dir. Useful for testing. | |
'update_qiita': True, | |
'is_restart': True} | |
kwargs = {'qclient': qclient, | |
'uif_path': new_sample_sheet, | |
'lane_number': "1", | |
'config_fp': CONFIG_FP, | |
'run_identifier': '211021_A00000_0000_SAMPLE', | |
'output_dir': out_dir, | |
'job_id': job_id, | |
'status_update_callback': status_line.update_job_status, | |
# set 'update_qiita' to False to avoid updating Qiita DB | |
# and copying files into uploads dir. Useful for testing. | |
'update_qiita': True, | |
'is_restart': True} | |
try: |
except (PipelineError, WorkflowError) as e: | ||
# assume AttributeErrors are issues w/bad sample-sheets or | ||
# mapping-files. | ||
return False, None, str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback is lost here:
In [4]: def foo():
...: raise ValueError("stuff")
...:
In [5]: try:
...: foo()
...: except ValueError as e:
...: print(str(e))
...:
stuff
See this SO post. I think the intent is str(e)
to instead be traceback.format_exc()
, right? If so then import traceback
is also needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the intent is completely the oposite, just raise/report the error in the GUI with minimal information for the user. I think that way, if is something obvious and handled, like "sample sheet has wrong OverwriteCycles value", that's what's shown but if there is something less obvious, users will need to contact the admins/devs to investigate. FWIW, this has been a useful interaction for this specific plugin: wet/dry-lab interactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the dev's know what line in the codebase is throwing the exception without the traceback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Each step is reported in the job's step in the db and as a new folder in the working directory, each folder has its own logs and details. In other words, via the jobs "step" & last folder written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the developer will not know the exact line of code raising the exception. Won't that require the developer to then guess or perform a much more time expensive debugging process to determine what specifically failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying but in experience so far that's not the case. However, we might be missing something so I'll change and we can revert back if users get too annoyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! For users, the traceback could either be post processed, or an additional item in the tuple could be returned (the original str(e)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, decided to write a log in the outdir so devs can see the full traceback but keep it simple ("str(e)") for users.
qp_klp/tests/test_prep_nuqcjob.py
Outdated
self.qclient, job_id, {'prep_id': pid}, out_dir) | ||
self.assertTrue( | ||
msg.startswith("Execute command-line statement failure:")) | ||
self.assertTrue('sbatch: command not found' in msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a test be included which asserts successful execution?
@wasade, thank you for the prompt review! I agree on moving the main parts of the code to a new object/class but note that the main difference is that the current objects read a sample-sheet, while the code creates a sample-sheet and then uses that to process the reads. Now, allowing for a full run in the github CI might be really difficult as it depends on slurm and running the full pipeline (which some steps are really compute intensive) - IMOO, we should tests correct execution on qiita-rc; from start to finish. The good news is that I already have this mostly written and I could add that "script" to this repo in a future PR. |
I wasn't suggesting a new object, but to place it in the sample sheet objects? I may not know enough about this project though to comment sufficiently, other than it looked like object specific responsibilities might be handled outside of the object.
An example of mocking out |
Thank you. Agree on "other than it looked like object specific responsibilities"; I'll move the code around. I'll also add an issue to "mock out sbatch for pipeline testing" so we don't forget about it but also do not stop this branch to move forward as I consider that outside the scope of this branch. |
Closing in favor of: #132 |
This piggybacks from the current workflows so we can human filter existing prep's raw data in Qiita. This version should be ready for testing in qiita-rc, which I'll do when available.