Skip to content

Prep/Artifact human filtering #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

antgonza
Copy link
Member

This piggybacks from the current workflows so we can human filter existing prep's raw data in Qiita. This version should be ready for testing in qiita-rc, which I'll do when available.

@antgonza antgonza requested a review from wasade March 20, 2025 17:49
Copy link
Contributor

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It seems like some of the logic here would be better placed in the sample sheet object, and it would be valuable to have a test asserting successful execution

qp_klp/klp.py Outdated
Comment on lines 209 to 227
sheet = MetagenomicSampleSheetv90()
sheet.Header['IEMFileVersion'] = '4'
sheet.Header['Date'] = datetime.today().strftime('%m/%d/%y')
sheet.Header['Workflow'] = 'GenerateFASTQ'
sheet.Header['Application'] = 'FASTQ Only'
sheet.Header['Assay'] = prep_info['data_type']
sheet.Header['Description'] = f'prep_NuQCJob - {pid}'
sheet.Header['Chemistry'] = 'Default'
sheet.Header['SheetType'] = 'standard_metag'
sheet.Header['SheetVersion'] = '90'
sheet.Header['Investigator Name'] = 'Qiita'
sheet.Header['Experiment Name'] = project_name

sheet.Bioinformatics = pd.DataFrame(
columns=['Sample_Project', 'ForwardAdapter', 'ReverseAdapter',
'library_construction_protocol',
'experiment_design_description',
'PolyGTrimming', 'HumanFiltering', 'QiitaID'],
data=[[project_name, 'NA', 'NA', 'NA', 'NA', 'FALSE', 'TRUE', sid]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this stuff isn't done by the MetagenomicSampleSheetv90 constructor?

qp_klp/klp.py Outdated
Comment on lines 229 to 243
for k, vals in pt.iterrows():
k = k.split('.', 1)[-1]
sample = {
'Sample_Name': k,
'Sample_ID': k.replace('.', '_'),
'Sample_Plate': '',
'well_id_384': '',
'I7_Index_ID': '',
'index': vals['index'],
'I5_Index_ID': '',
'index2': vals['index2'],
'Sample_Project': project_name,
'Well_description': '',
'Sample_Well': '',
'Lane': '1'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a lot of what's here are attributes specific to the sample sheet. Is there a reason this is done outside of the sample sheet, rather than controlled by the sample sheet?

qp_klp/klp.py Outdated
Comment on lines 246 to 248
sheet.Contact = pd.DataFrame(
columns=['Email', 'Sample_Project'],
data=[['[email protected]', project_name]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Or this perhaps could be sheet.set_contact(email, project_name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure I'm not missing something obvious but are you sure this method exists or are you asking why not put it together?

FWIW, neither the original sample_sheet or the augmented version metapool provides it:

In [1]: import sample_sheet

In [2]: sample_sheet.set_contact
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 sample_sheet.set_contact

AttributeError: module 'sample_sheet' has no attribute 'set_contact'

In [3]: from metapool import MetagenomicSampleSheetv90

In [4]: MetagenomicSampleSheetv90.set_contact
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 MetagenomicSampleSheetv90.set_contact

AttributeError: type object 'MetagenomicSampleSheetv90' has no attribute 'set_contact'

Now, just to be clear, the idea of this code is to quickly create a "valid" sample_sheet from a loaded prep in qiita so we can human-filter/qc their files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the method doesn't exist, then for the same scope of responsibility reasons noted elsewhere, creating it seems like it would improve the overall codebase. In general, I advise making these type of low risk easy changes (such as this and abstracting sbatch) under the principle of continuous improvement, rather than creating issues and deferring. But you are the lead on this project so it is your decision about how to proceed.

qp_klp/klp.py Outdated
project_folder = out_path('ConvertJob', project_name)
makedirs(project_folder, exist_ok=True)

for _, fs in files.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, fs in files.items():
for fs in files.values():

qp_klp/klp.py Outdated
Comment on lines 266 to 267
with open(f'{convert_path}/job_completed', 'w') as f:
f.write('')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See SO post, but Path(f'{convert_path}/job_completed').touch()

qp_klp/klp.py Outdated
Comment on lines 269 to 281
try:
kwargs = {'qclient': qclient,
'uif_path': new_sample_sheet,
'lane_number': "1",
'config_fp': CONFIG_FP,
'run_identifier': '211021_A00000_0000_SAMPLE',
'output_dir': out_dir,
'job_id': job_id,
'status_update_callback': status_line.update_job_status,
# set 'update_qiita' to False to avoid updating Qiita DB
# and copying files into uploads dir. Useful for testing.
'update_qiita': True,
'is_restart': True}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
kwargs = {'qclient': qclient,
'uif_path': new_sample_sheet,
'lane_number': "1",
'config_fp': CONFIG_FP,
'run_identifier': '211021_A00000_0000_SAMPLE',
'output_dir': out_dir,
'job_id': job_id,
'status_update_callback': status_line.update_job_status,
# set 'update_qiita' to False to avoid updating Qiita DB
# and copying files into uploads dir. Useful for testing.
'update_qiita': True,
'is_restart': True}
kwargs = {'qclient': qclient,
'uif_path': new_sample_sheet,
'lane_number': "1",
'config_fp': CONFIG_FP,
'run_identifier': '211021_A00000_0000_SAMPLE',
'output_dir': out_dir,
'job_id': job_id,
'status_update_callback': status_line.update_job_status,
# set 'update_qiita' to False to avoid updating Qiita DB
# and copying files into uploads dir. Useful for testing.
'update_qiita': True,
'is_restart': True}
try:

except (PipelineError, WorkflowError) as e:
# assume AttributeErrors are issues w/bad sample-sheets or
# mapping-files.
return False, None, str(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traceback is lost here:

In [4]: def foo():
   ...:     raise ValueError("stuff")
   ...: 

In [5]: try:
   ...:     foo()
   ...: except ValueError as e:
   ...:     print(str(e))
   ...: 
stuff

See this SO post. I think the intent is str(e) to instead be traceback.format_exc(), right? If so then import traceback is also needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the intent is completely the oposite, just raise/report the error in the GUI with minimal information for the user. I think that way, if is something obvious and handled, like "sample sheet has wrong OverwriteCycles value", that's what's shown but if there is something less obvious, users will need to contact the admins/devs to investigate. FWIW, this has been a useful interaction for this specific plugin: wet/dry-lab interactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the dev's know what line in the codebase is throwing the exception without the traceback?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Each step is reported in the job's step in the db and as a new folder in the working directory, each folder has its own logs and details. In other words, via the jobs "step" & last folder written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the developer will not know the exact line of code raising the exception. Won't that require the developer to then guess or perform a much more time expensive debugging process to determine what specifically failed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying but in experience so far that's not the case. However, we might be missing something so I'll change and we can revert back if users get too annoyed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! For users, the traceback could either be post processed, or an additional item in the tuple could be returned (the original str(e))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, decided to write a log in the outdir so devs can see the full traceback but keep it simple ("str(e)") for users.

self.qclient, job_id, {'prep_id': pid}, out_dir)
self.assertTrue(
msg.startswith("Execute command-line statement failure:"))
self.assertTrue('sbatch: command not found' in msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a test be included which asserts successful execution?

@antgonza
Copy link
Member Author

@wasade, thank you for the prompt review!

I agree on moving the main parts of the code to a new object/class but note that the main difference is that the current objects read a sample-sheet, while the code creates a sample-sheet and then uses that to process the reads.

Now, allowing for a full run in the github CI might be really difficult as it depends on slurm and running the full pipeline (which some steps are really compute intensive) - IMOO, we should tests correct execution on qiita-rc; from start to finish. The good news is that I already have this mostly written and I could add that "script" to this repo in a future PR.

@wasade
Copy link
Contributor

wasade commented Mar 20, 2025

I wasn't suggesting a new object, but to place it in the sample sheet objects? I may not know enough about this project though to comment sufficiently, other than it looked like object specific responsibilities might be handled outside of the object.

sbatch can be thought of as fancy bash, why not just mock it out so the commands can run? If the concern is runtime in CI, then one solution would be to run a subset of the data.

An example of mocking out sbatch can be found here with the capture here. It assumes use of sbatch is by PIPE although a similar type of approach could be used for a .sbatch script if that's what the framework here does (but again, I don't know much about this project)

@antgonza
Copy link
Member Author

Thank you.

Agree on "other than it looked like object specific responsibilities"; I'll move the code around.

I'll also add an issue to "mock out sbatch for pipeline testing" so we don't forget about it but also do not stop this branch to move forward as I consider that outside the scope of this branch.

@antgonza
Copy link
Member Author

antgonza commented May 9, 2025

Closing in favor of: #132

@antgonza antgonza closed this May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants