Update paper after carpentries lab review process (#554)

svenvanderburg · carschno · tobyhodges · web-flow · commit a4667415d21a · 2025-02-25T16:42:18.000+01:00
* Update paper after carpentries lab review process

* Improve the paper a bit more

* Fix typo

* Mention that there are probably more taught workshops

Co-authored-by: Toby Hodges &lt;tobyhodges@carpentries.org&gt;

* Apply suggestions from code review

Co-authored-by: Toby Hodges &lt;tobyhodges@carpentries.org&gt;

---------

Co-authored-by: Carsten Schnober &lt;c.schnober@esciencecenter.nl&gt;
Co-authored-by: Toby Hodges &lt;tobyhodges@carpentries.org&gt;
diff --git a/paper.bib b/paper.bib
@@ -190,3 +190,117 @@ @article{gaviria_rojas_dollar_2022
   pages      = {12979--12990},
   file       = {Full Text PDF:/Users/carstenschnober/Zotero/storage/PJZDNZTV/Gaviria Rojas et al. - 2022 - The Dollar Street Dataset Images Representing the.pdf:application/pdf}
 }
+
+
+@article{huber_ms2deepscore_2021,
+	title = {{MS2DeepScore}: a novel deep learning similarity measure to compare tandem mass spectra},
+	volume = {13},
+	issn = {1758-2946},
+	shorttitle = {{MS2DeepScore}},
+	url = {https://doi.org/10.1186/s13321-021-00558-4},
+	doi = {10.1186/s13321-021-00558-4},
+	abstract = {Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of {\textgreater} 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.},
+	number = {1},
+	urldate = {2025-02-11},
+	journal = {Journal of Cheminformatics},
+	author = {Huber, Florian and van der Burg, Sven and van der Hooft, Justin J. J. and Ridder, Lars},
+	month = oct,
+	year = {2021},
+	keywords = {Deep learning, Mass spectrometry, Metabolomics, Spectral similarity measure, Supervised machine learning},
+	pages = {84},
+	file = {Full Text PDF:/Users/svenvanderburg/Zotero/storage/Y3KAXM5F/Huber et al. - 2021 - MS2DeepScore a novel deep learning similarity mea.pdf:application/pdf;Snapshot:/Users/svenvanderburg/Zotero/storage/BIH5UWCE/s13321-021-00558-4.html:text/html},
+}
+
+@misc{van_der_burg_dollar_2024,
+	title = {Dollar street 10 - 64x64x3},
+	url = {https://zenodo.org/records/10970014},
+	doi = {10.5281/zenodo.10970014},
+	abstract = {The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.
+
+This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.
+
+These are the preprocessing steps that were performed:
+
+
+
+Only take examples with one imagenet\_synonym label
+
+Use only examples with the 10 most frequently occuring labels
+
+Downscale images to 64 x 64 pixels
+
+Split data in train and test
+
+Store as numpy array
+
+
+This is the label mapping:
+
+
+
+
+Category
+label
+
+
+day bed
+0
+
+
+dishrag
+1
+
+
+plate
+2
+
+
+running shoe
+3
+
+
+soap dispenser
+4
+
+
+street sign
+5
+
+
+table lamp
+6
+
+
+tile roof
+7
+
+
+toilet seat
+8
+
+
+washing machine
+9
+
+
+
+
+Checkout this notebook to see how the subset was created.
+
+The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.},
+	urldate = {2025-02-11},
+	publisher = {Zenodo},
+	author = {van der burg, Sven},
+	month = apr,
+	year = {2024},
+	keywords = {CC-BY, CIFAR-10, Deep learning, Image classification, Machine learning},
+	file = {Snapshot:/Users/svenvanderburg/Zotero/storage/QPJDIYXH/10970014.html:text/html},
+}
+
+@misc{noauthor_cifar-10_nodate,
+	title = {{CIFAR}-10 and {CIFAR}-100 datasets},
+	url = {https://www.cs.toronto.edu/~kriz/cifar.html},
+	urldate = {2025-02-11},
+	file = {CIFAR-10 and CIFAR-100 datasets:/Users/svenvanderburg/Zotero/storage/CTXZX76B/cifar.html:text/html},
+}
+
diff --git a/paper.md b/paper.md
@@ -83,9 +83,9 @@ The lesson starts by explaining the basic concepts of neural networks,
 and then guides learners through the different steps of a deep learning workflow.  
 After following this lesson, 
 learners will be able to prepare data for deep learning, 
-implement a basic deep learning model in Python with Keras, 
-monitor and troubleshoot the training process, and implement different layer types, 
-such as convolutional layers.
+implement a basic deep learning model in Python with Keras,
+and monitor and troubleshoot the training process.
+In addition, they will be able to implement and understand different layer types, such as convolutional layers and dropout layers, and apply transfer learning.
 
 We use data with permissive licenses and designed for real world use cases:
 
@@ -148,16 +148,22 @@ and these can even be included at the level of the lesson content.
 In addition, the Carpentries Workbench prioritises accessibility of the content, for example by having clearly visible figure captions
 and promoting alt-texts for pictures.
 
-The lesson is split into a general introduction, and 3 episodes that cover 3 distinct increasingly more complex deep learning problems.
+
+The lesson is split into a general introduction, and 4 episodes that cover 3 distinct increasingly more complex deep learning problems.
 Each of the deep learning problems is approached using the same 10-step deep learning workflow (https://carpentries-lab.github.io/deep-learning-intro/1-introduction.html#deep-learning-workflow).
+
 By going through the deep learning cycle three times with different problems, learners become increasingly confident in applying this deep learning workflow to their own projects.
+We end with an outlook episode. Firstly, the outlook eposide discusses a real-world application of deep learning in chemistry [@huber_ms2deepscore_2021]. In addition, it discusses bias in datasets, large language models, and good practices for organising deep learning projects. Finally, we end with ideas for next steps after finishing the lesson.
 
 # Feedback
-This course was taught 12 times over the course of 3 years, both online and in-person, by the Netherlands eScience Center
-(Netherlands, https://www.esciencecenter.nl/) and Helmholz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
-Apart from the core group of contributors, the workshop was also taught at 3 independent institutes, namely:
+This course was taught 13 times over the course of 4 years, both online and in-person, by the Netherlands eScience Center
+(Netherlands, https://www.esciencecenter.nl/) and Helmholtz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
+Apart from the core group of contributors, the workshop was also taught at at least 3 independent institutes, namely:
 University of Wisconson-Madison (US, https://www.wisc.edu/), University of Auckland (New Zealand, https://www.auckland.ac.nz/), 
 and EMBL Heidelberg (Germany, https://www.embl.org/sites/heidelberg/).
+
+An up-to-date list of workshops that the authors are aware of having using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).
+
 In general, adoption of the lesson material by the instructors not involved in the project went well.
 The feedback gathered from our own and others' teachings was used to polish the lesson further.
 
@@ -193,6 +199,13 @@ The results from these 2 workshops are a good representation of the general feed
 Table 2: Quality of the different episodes of the workshop as rated by students from 2 workshops taught at the Netherlands eScience Center. 
 The results from these 2 workshops are a good representation of the general feedback we get when teaching this workshop.
 
+## Carpentries Lab review process
+Prior to submitting this paper the lesson went through the substantial review in the process of becoming an official Carpentries Lab (https://carpentries-lab.org/) lesson. This led to a number of improvements to the lesson. In general the accessibility and user-friendliness improved, for example by updating alt-texts and using more beginner-friendly and clearer wording. Additionally, the instructor notes were improved and many missing explanations of important deep learning concepts were added to the lesson. 
+
+Most importantly, the reviewers pointed out that the CIFAR-10 [@noauthor_cifar-10_nodate] dataset that we initially used does not have a license. We were surprised to find out that this dataset, that is one of the most widely used datasets in the field of machine learning and deep learning, is actually unethically scraped from the internet without permission from image owners. As an alternative we now use 'Dollar street 10' [@van_der_burg_dollar_2024], a dataset that was adapted for this lesson from The Dollar Street Dataset (@gaviria_rojas_dollar_2022). The Dollar Street Dataset is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations. In addition, it is a great entry point to teach learners about ethical AI and bias in datasets.
+
+You can find all details of the review process on GitHub: https://github.com/carpentries-lab/reviews/issues/25.
+
 # Conclusion
 This lesson can be taught as a stand-alone workshop to students already familiar with machine learning and Python.
 It can also be taught in a broader curriculum after an introduction to Python programming (for example: @azalee_bostroem_software_2016) 
@@ -208,6 +221,7 @@ Nidhi Gowdra (University of Auckland, New Zealand, https://www.auckland.ac.nz/),
 Renato Alves and Lisanna Paladin (EMBL Heidelberg, Germany, https://www.embl.org/sites/heidelberg/),
 that piloted this workshop at their institutes.
 We thank the Carpentries for providing such a great framework for developing this lesson material.
+We thank Sarah Brown, Johanna Bayer, and Mike Laverick for giving us excellent feedback on the lesson during the Carpentries Lab review process.
 We thank all students enrolled in the workshops that were taught using this lesson material for providing us with feedback.
 
 # References