Prep/Artifact human filtering #102

antgonza · 2025-03-20T17:49:55Z

This piggybacks from the current workflows so we can human filter existing prep's raw data in Qiita. This version should be ready for testing in qiita-rc, which I'll do when available.

…rocessing into prep-nuqc

wasade

Thanks! It seems like some of the logic here would be better placed in the sample sheet object, and it would be valuable to have a test asserting successful execution

wasade · 2025-03-20T18:13:44Z

qp_klp/klp.py

+    sheet = MetagenomicSampleSheetv90()
+    sheet.Header['IEMFileVersion'] = '4'
+    sheet.Header['Date'] = datetime.today().strftime('%m/%d/%y')
+    sheet.Header['Workflow'] = 'GenerateFASTQ'
+    sheet.Header['Application'] = 'FASTQ Only'
+    sheet.Header['Assay'] = prep_info['data_type']
+    sheet.Header['Description'] = f'prep_NuQCJob - {pid}'
+    sheet.Header['Chemistry'] = 'Default'
+    sheet.Header['SheetType'] = 'standard_metag'
+    sheet.Header['SheetVersion'] = '90'
+    sheet.Header['Investigator Name'] = 'Qiita'
+    sheet.Header['Experiment Name'] = project_name
+
+    sheet.Bioinformatics = pd.DataFrame(
+        columns=['Sample_Project', 'ForwardAdapter', 'ReverseAdapter',
+                 'library_construction_protocol',
+                 'experiment_design_description',
+                 'PolyGTrimming', 'HumanFiltering', 'QiitaID'],
+        data=[[project_name, 'NA', 'NA', 'NA', 'NA', 'FALSE', 'TRUE', sid]])


Is there a reason this stuff isn't done by the MetagenomicSampleSheetv90 constructor?

wasade · 2025-03-20T18:14:48Z

qp_klp/klp.py

+    for k, vals in pt.iterrows():
+        k = k.split('.', 1)[-1]
+        sample = {
+            'Sample_Name': k,
+            'Sample_ID': k.replace('.', '_'),
+            'Sample_Plate': '',
+            'well_id_384': '',
+            'I7_Index_ID': '',
+            'index': vals['index'],
+            'I5_Index_ID': '',
+            'index2': vals['index2'],
+            'Sample_Project': project_name,
+            'Well_description': '',
+            'Sample_Well': '',
+            'Lane': '1'}


It seems like a lot of what's here are attributes specific to the sample sheet. Is there a reason this is done outside of the sample sheet, rather than controlled by the sample sheet?

wasade · 2025-03-20T18:15:37Z

qp_klp/klp.py

+    sheet.Contact = pd.DataFrame(
+        columns=['Email', 'Sample_Project'],
+        data=[['[email protected]', project_name]])


Same here. Or this perhaps could be sheet.set_contact(email, project_name)

I just want to make sure I'm not missing something obvious but are you sure this method exists or are you asking why not put it together?

FWIW, neither the original sample_sheet or the augmented version metapool provides it:

In [1]: import sample_sheet In [2]: sample_sheet.set_contact --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 sample_sheet.set_contact AttributeError: module 'sample_sheet' has no attribute 'set_contact' In [3]: from metapool import MetagenomicSampleSheetv90 In [4]: MetagenomicSampleSheetv90.set_contact --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 MetagenomicSampleSheetv90.set_contact AttributeError: type object 'MetagenomicSampleSheetv90' has no attribute 'set_contact'

Now, just to be clear, the idea of this code is to quickly create a "valid" sample_sheet from a loaded prep in qiita so we can human-filter/qc their files.

If the method doesn't exist, then for the same scope of responsibility reasons noted elsewhere, creating it seems like it would improve the overall codebase. In general, I advise making these type of low risk easy changes (such as this and abstracting sbatch) under the principle of continuous improvement, rather than creating issues and deferring. But you are the lead on this project so it is your decision about how to proceed.

wasade · 2025-03-20T18:16:52Z

qp_klp/klp.py

+    project_folder = out_path('ConvertJob', project_name)
+    makedirs(project_folder, exist_ok=True)
+
+    for _, fs in files.items():


Suggested change

for _, fs in files.items():

for fs in files.values():

wasade · 2025-03-20T18:18:20Z

qp_klp/klp.py

+    with open(f'{convert_path}/job_completed', 'w') as f:
+        f.write('')


See SO post, but Path(f'{convert_path}/job_completed').touch()

wasade · 2025-03-20T18:19:23Z

qp_klp/klp.py

+    try:
+        kwargs = {'qclient': qclient,
+                  'uif_path': new_sample_sheet,
+                  'lane_number': "1",
+                  'config_fp': CONFIG_FP,
+                  'run_identifier': '211021_A00000_0000_SAMPLE',
+                  'output_dir': out_dir,
+                  'job_id': job_id,
+                  'status_update_callback': status_line.update_job_status,
+                  # set 'update_qiita' to False to avoid updating Qiita DB
+                  # and copying files into uploads dir. Useful for testing.
+                  'update_qiita': True,
+                  'is_restart': True}


Suggested change

try:

kwargs = {'qclient': qclient,

'uif_path': new_sample_sheet,

'lane_number': "1",

'config_fp': CONFIG_FP,

'run_identifier': '211021_A00000_0000_SAMPLE',

'output_dir': out_dir,

'job_id': job_id,

'status_update_callback': status_line.update_job_status,

# set 'update_qiita' to False to avoid updating Qiita DB

# and copying files into uploads dir. Useful for testing.

'update_qiita': True,

'is_restart': True}

kwargs = {'qclient': qclient,

'uif_path': new_sample_sheet,

'lane_number': "1",

'config_fp': CONFIG_FP,

'run_identifier': '211021_A00000_0000_SAMPLE',

'output_dir': out_dir,

'job_id': job_id,

'status_update_callback': status_line.update_job_status,

# set 'update_qiita' to False to avoid updating Qiita DB

# and copying files into uploads dir. Useful for testing.

'update_qiita': True,

'is_restart': True}

try:

wasade · 2025-03-20T18:22:03Z

qp_klp/klp.py

+    except (PipelineError, WorkflowError) as e:
+        # assume AttributeErrors are issues w/bad sample-sheets or
+        # mapping-files.
+        return False, None, str(e)


The traceback is lost here:

In [4]: def foo(): ...: raise ValueError("stuff") ...: In [5]: try: ...: foo() ...: except ValueError as e: ...: print(str(e)) ...: stuff

See this SO post. I think the intent is str(e) to instead be traceback.format_exc(), right? If so then import traceback is also needed

AFAIK, the intent is completely the oposite, just raise/report the error in the GUI with minimal information for the user. I think that way, if is something obvious and handled, like "sample sheet has wrong OverwriteCycles value", that's what's shown but if there is something less obvious, users will need to contact the admins/devs to investigate. FWIW, this has been a useful interaction for this specific plugin: wet/dry-lab interactions.

How do the dev's know what line in the codebase is throwing the exception without the traceback?

Good question! Each step is reported in the job's step in the db and as a new folder in the working directory, each folder has its own logs and details. In other words, via the jobs "step" & last folder written.

But the developer will not know the exact line of code raising the exception. Won't that require the developer to then guess or perform a much more time expensive debugging process to determine what specifically failed?

I see what you are saying but in experience so far that's not the case. However, we might be missing something so I'll change and we can revert back if users get too annoyed.

thanks! For users, the traceback could either be post processed, or an additional item in the tuple could be returned (the original str(e))

Yeah, decided to write a log in the outdir so devs can see the full traceback but keep it simple ("str(e)") for users.

wasade · 2025-03-20T18:26:14Z

qp_klp/tests/test_prep_nuqcjob.py

+            self.qclient, job_id, {'prep_id': pid}, out_dir)
+        self.assertTrue(
+            msg.startswith("Execute command-line statement failure:"))
+        self.assertTrue('sbatch: command not found' in msg)


Could a test be included which asserts successful execution?

antgonza · 2025-03-20T18:34:21Z

@wasade, thank you for the prompt review!

I agree on moving the main parts of the code to a new object/class but note that the main difference is that the current objects read a sample-sheet, while the code creates a sample-sheet and then uses that to process the reads.

Now, allowing for a full run in the github CI might be really difficult as it depends on slurm and running the full pipeline (which some steps are really compute intensive) - IMOO, we should tests correct execution on qiita-rc; from start to finish. The good news is that I already have this mostly written and I could add that "script" to this repo in a future PR.

wasade · 2025-03-20T18:42:25Z

I wasn't suggesting a new object, but to place it in the sample sheet objects? I may not know enough about this project though to comment sufficiently, other than it looked like object specific responsibilities might be handled outside of the object.

sbatch can be thought of as fancy bash, why not just mock it out so the commands can run? If the concern is runtime in CI, then one solution would be to run a subset of the data.

An example of mocking out sbatch can be found here with the capture here. It assumes use of sbatch is by PIPE although a similar type of approach could be used for a .sbatch script if that's what the framework here does (but again, I don't know much about this project)

antgonza · 2025-03-20T18:49:33Z

Thank you.

Agree on "other than it looked like object specific responsibilities"; I'll move the code around.

I'll also add an issue to "mock out sbatch for pipeline testing" so we don't forget about it but also do not stop this branch to move forward as I consider that outside the scope of this branch.

antgonza · 2025-05-09T12:40:07Z

Closing in favor of: #132

antgonza added 3 commits March 15, 2025 08:31

init changes

8513628

Merge branch 'main' of https://github.com/qiita-spots/qp-knight-lab-p…

af8a225

…rocessing into prep-nuqc

adding prep_NuQCJob

20eb229

antgonza requested a review from wasade March 20, 2025 17:49

wasade requested changes Mar 20, 2025

View reviewed changes

antgonza added 5 commits March 24, 2025 09:05

addressing some of @wasade comments

8559557

add traceback

c3498c4

error-traceback.err

4567a79

making some progress

5bdbb67

fix test

73bb155

antgonza closed this May 9, 2025

		with open(f'{convert_path}/job_completed', 'w') as f:
		f.write('')

Prep/Artifact human filtering #102

Prep/Artifact human filtering #102

Uh oh!

Conversation

antgonza commented Mar 20, 2025

Uh oh!

wasade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antgonza commented Mar 20, 2025

Uh oh!

wasade commented Mar 20, 2025

Uh oh!

antgonza commented Mar 20, 2025

Uh oh!

antgonza commented May 9, 2025

Uh oh!

Uh oh!