You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update paper after carpentries lab review process (#554)
* Update paper after carpentries lab review process
* Improve the paper a bit more
* Fix typo
* Mention that there are probably more taught workshops
Co-authored-by: Toby Hodges <[email protected]>
* Apply suggestions from code review
Co-authored-by: Toby Hodges <[email protected]>
---------
Co-authored-by: Carsten Schnober <[email protected]>
Co-authored-by: Toby Hodges <[email protected]>
file = {Full Text PDF:/Users/carstenschnober/Zotero/storage/PJZDNZTV/Gaviria Rojas et al. - 2022 - The Dollar Street Dataset Images Representing the.pdf:application/pdf}
192
192
}
193
+
194
+
195
+
@article{huber_ms2deepscore_2021,
196
+
title = {{MS2DeepScore}: a novel deep learning similarity measure to compare tandem mass spectra},
abstract = {Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of {\textgreater} 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.},
203
+
number = {1},
204
+
urldate = {2025-02-11},
205
+
journal = {Journal of Cheminformatics},
206
+
author = {Huber, Florian and van der Burg, Sven and van der Hooft, Justin J. J. and Ridder, Lars},
file = {Full Text PDF:/Users/svenvanderburg/Zotero/storage/Y3KAXM5F/Huber et al. - 2021 - MS2DeepScore a novel deep learning similarity mea.pdf:application/pdf;Snapshot:/Users/svenvanderburg/Zotero/storage/BIH5UWCE/s13321-021-00558-4.html:text/html},
212
+
}
213
+
214
+
@misc{van_der_burg_dollar_2024,
215
+
title = {Dollar street 10 - 64x64x3},
216
+
url = {https://zenodo.org/records/10970014},
217
+
doi = {10.5281/zenodo.10970014},
218
+
abstract = {The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.
219
+
220
+
This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.
221
+
222
+
These are the preprocessing steps that were performed:
223
+
224
+
225
+
226
+
Only take examples with one imagenet\_synonym label
227
+
228
+
Use only examples with the 10 most frequently occuring labels
229
+
230
+
Downscale images to 64 x 64 pixels
231
+
232
+
Split data in train and test
233
+
234
+
Store as numpy array
235
+
236
+
237
+
This is the label mapping:
238
+
239
+
240
+
241
+
242
+
Category
243
+
label
244
+
245
+
246
+
day bed
247
+
0
248
+
249
+
250
+
dishrag
251
+
1
252
+
253
+
254
+
plate
255
+
2
256
+
257
+
258
+
running shoe
259
+
3
260
+
261
+
262
+
soap dispenser
263
+
4
264
+
265
+
266
+
street sign
267
+
5
268
+
269
+
270
+
table lamp
271
+
6
272
+
273
+
274
+
tile roof
275
+
7
276
+
277
+
278
+
toilet seat
279
+
8
280
+
281
+
282
+
washing machine
283
+
9
284
+
285
+
286
+
287
+
288
+
Checkout this notebook to see how the subset was created.
289
+
290
+
The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.},
291
+
urldate = {2025-02-11},
292
+
publisher = {Zenodo},
293
+
author = {van der burg, Sven},
294
+
month = apr,
295
+
year = {2024},
296
+
keywords = {CC-BY, CIFAR-10, Deep learning, Image classification, Machine learning},
Copy file name to clipboardexpand all lines: paper.md
+21-7
Original file line number
Diff line number
Diff line change
@@ -83,9 +83,9 @@ The lesson starts by explaining the basic concepts of neural networks,
83
83
and then guides learners through the different steps of a deep learning workflow.
84
84
After following this lesson,
85
85
learners will be able to prepare data for deep learning,
86
-
implement a basic deep learning model in Python with Keras,
87
-
monitor and troubleshoot the training process, and implement different layer types,
88
-
such as convolutional layers.
86
+
implement a basic deep learning model in Python with Keras,
87
+
and monitor and troubleshoot the training process.
88
+
In addition, they will be able to implement and understand different layer types, such as convolutional layers and dropout layers, and apply transfer learning.
89
89
90
90
We use data with permissive licenses and designed for real world use cases:
91
91
@@ -148,16 +148,22 @@ and these can even be included at the level of the lesson content.
148
148
In addition, the Carpentries Workbench prioritises accessibility of the content, for example by having clearly visible figure captions
149
149
and promoting alt-texts for pictures.
150
150
151
-
The lesson is split into a general introduction, and 3 episodes that cover 3 distinct increasingly more complex deep learning problems.
151
+
152
+
The lesson is split into a general introduction, and 4 episodes that cover 3 distinct increasingly more complex deep learning problems.
152
153
Each of the deep learning problems is approached using the same 10-step deep learning workflow (https://carpentries-lab.github.io/deep-learning-intro/1-introduction.html#deep-learning-workflow).
154
+
153
155
By going through the deep learning cycle three times with different problems, learners become increasingly confident in applying this deep learning workflow to their own projects.
156
+
We end with an outlook episode. Firstly, the outlook eposide discusses a real-world application of deep learning in chemistry [@huber_ms2deepscore_2021]. In addition, it discusses bias in datasets, large language models, and good practices for organising deep learning projects. Finally, we end with ideas for next steps after finishing the lesson.
154
157
155
158
# Feedback
156
-
This course was taught 12 times over the course of 3 years, both online and in-person, by the Netherlands eScience Center
157
-
(Netherlands, https://www.esciencecenter.nl/) and Helmholz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
158
-
Apart from the core group of contributors, the workshop was also taught at 3 independent institutes, namely:
159
+
This course was taught 13 times over the course of 4 years, both online and in-person, by the Netherlands eScience Center
160
+
(Netherlands, https://www.esciencecenter.nl/) and Helmholtz-Zentrum Dresden-Rossendorf (Germany, https://www.hzdr.de/).
161
+
Apart from the core group of contributors, the workshop was also taught at at least 3 independent institutes, namely:
159
162
University of Wisconson-Madison (US, https://www.wisc.edu/), University of Auckland (New Zealand, https://www.auckland.ac.nz/),
160
163
and EMBL Heidelberg (Germany, https://www.embl.org/sites/heidelberg/).
164
+
165
+
An up-to-date list of workshops that the authors are aware of having using this lesson can be found in a `workshops.md` file in the GitHub repository (https://github.com/carpentries-incubator/deep-learning-intro/blob/main/workshops.md).
166
+
161
167
In general, adoption of the lesson material by the instructors not involved in the project went well.
162
168
The feedback gathered from our own and others' teachings was used to polish the lesson further.
163
169
@@ -193,6 +199,13 @@ The results from these 2 workshops are a good representation of the general feed
193
199
Table 2: Quality of the different episodes of the workshop as rated by students from 2 workshops taught at the Netherlands eScience Center.
194
200
The results from these 2 workshops are a good representation of the general feedback we get when teaching this workshop.
195
201
202
+
## Carpentries Lab review process
203
+
Prior to submitting this paper the lesson went through the substantial review in the process of becoming an official Carpentries Lab (https://carpentries-lab.org/) lesson. This led to a number of improvements to the lesson. In general the accessibility and user-friendliness improved, for example by updating alt-texts and using more beginner-friendly and clearer wording. Additionally, the instructor notes were improved and many missing explanations of important deep learning concepts were added to the lesson.
204
+
205
+
Most importantly, the reviewers pointed out that the CIFAR-10 [@noauthor_cifar-10_nodate] dataset that we initially used does not have a license. We were surprised to find out that this dataset, that is one of the most widely used datasets in the field of machine learning and deep learning, is actually unethically scraped from the internet without permission from image owners. As an alternative we now use 'Dollar street 10' [@van_der_burg_dollar_2024], a dataset that was adapted for this lesson from The Dollar Street Dataset (@gaviria_rojas_dollar_2022). The Dollar Street Dataset is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations. In addition, it is a great entry point to teach learners about ethical AI and bias in datasets.
206
+
207
+
You can find all details of the review process on GitHub: https://github.com/carpentries-lab/reviews/issues/25.
208
+
196
209
# Conclusion
197
210
This lesson can be taught as a stand-alone workshop to students already familiar with machine learning and Python.
198
211
It can also be taught in a broader curriculum after an introduction to Python programming (for example: @azalee_bostroem_software_2016)
@@ -208,6 +221,7 @@ Nidhi Gowdra (University of Auckland, New Zealand, https://www.auckland.ac.nz/),
208
221
Renato Alves and Lisanna Paladin (EMBL Heidelberg, Germany, https://www.embl.org/sites/heidelberg/),
209
222
that piloted this workshop at their institutes.
210
223
We thank the Carpentries for providing such a great framework for developing this lesson material.
224
+
We thank Sarah Brown, Johanna Bayer, and Mike Laverick for giving us excellent feedback on the lesson during the Carpentries Lab review process.
211
225
We thank all students enrolled in the workshops that were taught using this lesson material for providing us with feedback.
0 commit comments