diff --git a/documentation/report/Appendices/Appendix_B.tex b/documentation/report/Appendices/Appendix_B.tex
index 0d14f5a..f9c0c2b 100644
--- a/documentation/report/Appendices/Appendix_B.tex
+++ b/documentation/report/Appendices/Appendix_B.tex
@@ -12,7 +12,18 @@ \subsection{Supplementary Information}
 \subsubsection{File moving service}
 \label{subsec:FileService}
 
-In this section it is in detail described how the service of filemoving is constructed. As Windows is used as the operating system of the sorting machine, the development was done with the .NET framework\footnote{For further information on the .NET framework, see \url{https://docs.microsoft.com/en-us/dotnet/framework/get-started/overview}} in the programming language C\#. The package provided is called Topshelf.\footnote{For further information on Topshelf, see \url{https://github.com/Topshelf/Topshelf}} Topshelf is a service hosting framework for building Windows services using .NET. With the package, it is possible to develop a console application in the development phase, compile it as a service, and install it later via the console. Previously, it was not possible to debug services during the development phase. The function of the service is based on the FileSystemWatcher object from the System.IO namespace.\footnote{For further information on the SystemFileWatcher, see \url{https://docs.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=netframework-4.8}} In the main program, a list of files in the source folder is kept. Files that are older than one hour are moved to the target folder on the external drive. The selected files are moved by a function that is called, when an event is triggered. The event is triggered by the FileSystemWatcher after subscribing to different flags. Shortly after initialization, the service was adjusted because removing the images from the C disk straight away caused the sorting program to stop. The problem is solved by keeping the most recent 1000 images and moving older images to the external disk.
+In this section it is in detail described how the service of filemoving is constructed. As Windows is used as the operating system of the sorting machine, the development was done with the .NET framework\footnote{For further information on the .NET framework, see \url{https://docs.microsoft.com/en-us/dotnet/framework/get-started/overview} (visited on 04/24/2020)} in the programming language C\#. The package provided is called Topshelf.\footnote{For further information on Topshelf, see \url{https://github.com/Topshelf/Topshelf} (visited on 04/24/2020)} Topshelf is a service hosting framework for building Windows services using .NET. With the package, it is possible to develop a console application in the development phase, compile it as a service, and install it later via the console. Previously, it was not possible to debug services during the development phase. The function of the service is based on the FileSystemWatcher object from the System.IO namespace.\footnote{For further information on the SystemFileWatcher, see \url{https://docs.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=netframework-4.8} (visited on 04/24/2020)} In the main program, a list of files in the source folder is kept. Files that are older than one hour are moved to the target folder on the external drive. The selected files are moved by a function that is called, when an event is triggered. The event is triggered by the FileSystemWatcher after subscribing to different flags. Shortly after initialization, the service was adjusted because removing the images from the C disk straight away caused the sorting program to stop. The problem is solved by keeping the most recent 1000 images and moving older images to the external disk.
+
+\subsubsection{Benefits of Data Set Creation with Tensorflow}
+\label{subsec:BenefitsDataSet}
+
+In the following, TensorFlow's own binary storage format \texttt{TFRecord} is introduced. Motivated by faster learning times and avoiding loss of information, and in accordance with the large amount of data collected, this format was chosen.
+
+The file format is optimized for images and text data. These are stored in tuples which always consist of file and label. In our case, the difference in reading time is significant, because the data is stored in the network and not on a SSD on the local PC. The serialized file format allows the data to be streamed through the network efficiently. Therefore, this storage format facilitates the mix and match of data sets and network architectures. Another advantage is that the file is transportable over several systems. 
+
+Working with these files simplifies the next steps of image transformations. With the \mbox{\texttt{tf.data}} API complex, input pipelines from simple and reusable components are created, even for large data sets. The preferred pipeline for our asparagus project can apply complex transformations to the images and combine them into stacks for training, testing and validating in arbitrary ratios. A data set can be changed, e.g.\ by using different labels or by transformations like mapping, repeating, batching, and many others.
+
+Besides the described functional transformations of the input pipeline under \mbox{\texttt{tf.data.dataset}}, an iterator gives sequential access to the elements in the data set. The iterator stays at the current position and allows to call the next element as a tuple of tensors. Initializable iterators go through the data set in parallel. In addition, different parameters are passed to start the call. This is especially handy when searching for the right parameters in parallel.
 
 \newpage
 
diff --git a/documentation/report/Chapters/Classification.tex b/documentation/report/Chapters/Classification.tex
index 78dbe2e..813ca63 100644
--- a/documentation/report/Chapters/Classification.tex
+++ b/documentation/report/Chapters/Classification.tex
@@ -6,50 +6,50 @@
 \section{Classification}
 \label{ch:Classification}
 
-Given the structure of our data set, namely image data which is partially annotated with corresponding class labels and partially hand-labeled with corresponding feature labels, different machine learning and computer vision methods were chosen to tackle the problem of image classification.
+Given the structure of our data set, different machine learning and computer vision methods were chosen to tackle the problem of image classification.
 
 \bigskip
-Image classification refers to the method of identifying to which category an image belongs to, according to its visual information. Classification problems can be divided into three different types: binary, multiclass and multi-label. Moreover, there are different methods on how to approach image classification. Those can be divided into three main groups: supervised learning, semi-supervised learning and unsupervised learning~\citep{har2003constraint}. 
+Image classification refers to the method of identifying to which category an image belongs to, according to its visual information. Classification problems can be divided into three different types: binary, multiclass and multi-label. Moreover, there are different methods on how to approach image classification. Those can be divided into three main groups: supervised learning, semi-supervised learning and unsupervised learning~\citep{har2003constraint}.
 
-Binary classification can be used to decide whether a certain feature is present in an asparagus spear, or not. Multiclass classification solves this problem as well, but is also applicable to data with more than two classes. As the class label results from a combination of the presence of certain features and the absence of others, it is also reasonable to go for a multi-label classification approach. In this approach, several features are learned simultaneously without being mutually exclusive.
+Binary classification can be used to decide whether a certain feature is present or not. Multiclass classification solves this problem as well, but is also applicable to data with more than two classes. As the class label results from a combination of the presence of certain features and the absence of others, it is also reasonable to go for a multi-label classification approach. In this approach, several features are learned simultaneously without being mutually exclusive.
 
 During our group work, algorithms of the different classification types as well as of the different learning types were applied.
-In the long run, an integrated model was aimed at that is predicting all features of a single asparagus spear, from which the final class label can be inferred. However, as intermediate steps towards that goal, the focus was to optimize models on identifying the presence of the features described in~\autoref{sec:AutomaticFeatureExtraction}. 
+In the long run, an integrated model was aimed at predicting all features of a single asparagus spear, from which the final class label can be inferred. However, as intermediate steps towards that goal, the focus was to optimize models on identifying the presence of the features described in~\autoref{sec:AutomaticFeatureExtraction}.
 
 \bigskip
-This chapter gives a general background of the different approaches chosen for our image classification problem, as well as a detailed overview of the concrete implementations of the models and the mechanisms of their hyperparameters. 
+This chapter gives a general background of the different approaches chosen for our image classification problem, as well as a detailed overview of the concrete implementations of the models and the mechanisms of their hyperparameters.
 
 
 \subsection{Supervised learning}
 \label{sec:SupervisedLearning}
 
-In machine learning, there are different approaches for an application to be trained on a set of data~\citep{geron2019hands,bishop2006pattern}. Depending on the level of supervision that the system receives during the training phase, the learning process is grouped into one of four major categories~\citep{geron2019hands}. One of these categories is supervised learning. For supervised learning approaches, the training data includes not only the input but also its corresponding target labels. The objective is to find a mapping between object \((x)\) and label \((t)\) when a set of both \((x,t)\) is provided as training data to the application~\citep{olivier2006semi}. An advantage of supervised learning is that the problem is well defined and the model can be evaluated in respect to its performance on labeled data~\citep{daume2012course,olivier2006semi}. In other words, the labels are used as a direct basis for the model optimizing function during training.
+In machine learning, there are different approaches for an application to be trained on a set of data~\citep{geron2019hands,bishop2006pattern}. Depending on the level of supervision that the system receives during the training phase, the learning process is grouped into one of three major categories~\citep{geron2019hands}. One of these categories is supervised learning. For supervised learning approaches, the training data includes not only the input but also its corresponding target labels. The objective is to find a mapping between object \((x)\) and label \((t)\) when a set of both \((x,t)\) is provided as training data to the application~\citep{olivier2006semi}. An advantage of supervised learning is that the problem is well defined and the model can be evaluated in respect to its performance on labeled data~\citep{daume2012course,olivier2006semi}. In other words, the labels are used as a direct basis for the model’s optimization function during training.
 
 Supervised learning spans over a large set of different methods, from decision trees and random forests, to \acrfullpl{svm}  and \acrlongpl{ann}~\citep{caruana2006comparison,geron2019hands}.
 
-A classical task for supervised learning systems is the classification of received data and mapping it to one of a finite number of categories~\citep{bishop2006pattern}.
-The disadvantage of supervised learning is the effort of receiving enough labeled data. It can be challenging to obtain fully labeled data because labeling experts are needed to classify the data which is usually a time consuming and expensive task~\citep{zhu05survey,figueroa2012predicting}.
+A classical task for supervised learning systems is the classification of received data which means  mapping the input data to one of a finite number of categories~\citep{bishop2006pattern}.
+The disadvantage of supervised learning is the effort of receiving enough labeled data. It can be challenging to obtain fully labeled data of high quality~\citep{zhu05survey,figueroa2012predicting}.
 
 \bigskip
-In the following sections, different supervised learning methods were chosen to solve the classification task using the data that was manually labeled for features as described in~\autoref{ch:Dataset}. In \ref{subsec:FeatureEngineering}~\nameref{subsec:FeatureEngineering}, an approach using an \acrshort{mlp} is described for feature classification. The section of \ref{subsec:SingleLabel}~\nameref{subsec:SingleLabel} is concerned with labeling the input images with a \acrshort{cnn} in a binary setup for their designated features. In section \ref{subsec:MultiLabel}~\nameref{subsec:MultiLabel}, a neural network is trained on the data to label it not only for one feature but all features at the same time. In the fourth section, \ref{subsec:HeadNetwork}~\nameref{subsec:HeadNetwork}, a \acrshort{cnn} is used to train solely on the head image data for the features flower and rusty head. 
+In the following sections, different supervised learning methods were implemented to solve the classification task using the data that was manually labeled for features as described in~\autoref{ch:Dataset}. In \ref{subsec:FeatureEngineering}~\nameref{subsec:FeatureEngineering}, an approach using an \acrshort{mlp} is described for feature classification. The section of \ref{subsec:SingleLabel}~\nameref{subsec:SingleLabel} is concerned with labeling the input images with a \acrshort{cnn} in a binary setup for their designated features. In section \ref{subsec:MultiLabel}~\nameref{subsec:MultiLabel}, a neural network is trained on the data to label it not only for one feature but all features at the same time. In the fourth section, \ref{subsec:HeadNetwork}~\nameref{subsec:HeadNetwork}, a \acrshort{cnn} is used to train solely on the head image data for the features flower and rusty head.
 Finally, in \ref{subsec:FeaturesToLabels}~\nameref{subsec:FeaturesToLabels}, a random forest approach is described to map the features of the image data to their class label.
 
 
 \subsubsection{Prediction based on feature engineering}
 \label{subsec:FeatureEngineering}
 
-Besides approaches that directly use images as an input one may use high level feature engineering. That is, one can retrieve sparse representations that contain relevant information in a condensed form and apply classical machine learning classifiers such as \acrshortpl{mlp} to predict labels~\citep{zheng2018feature}. These classifiers are comparatively simple, fast to train and only few network hyperparameters have to be defined. One may argue that this is one of the major benefits as compared to networks of higher complexity (e.g.\ deep \acrshortpl{cnn}). As a consequence, finding suitable parameters for \acrshortpl{mlp} (i.e.\ the number of hidden layers and neurons per layer) is comparatively easy.\footnote{The challenge of finding appropriate network parameters is well known in the deep learning community: ``Designing and training a network using backprop requires making many seemingly arbitrary choices $[...]$. These choices can be critical, yet there is no foolproof recipe for deciding them because they are largely problem and data dependent’’~\citep[p.~9]{lecun2012efficient}. The requirement to specify hyperparameters is a disadvantage of neural networks (including \acrshortpl{mlp}) as compared to parameter free methods~\citep{scikit2019neural}. Due to the combinatorial explosion, the challenge of finding suitable parameter settings is harder for more complex networks as more options must be considered.} Beside of that retrieving sparse representations highly reduces the amount of variance and hence the required amount of labeled data.
+Besides approaches that directly use images as an input one may use high level feature engineering. That is, one can retrieve sparse representations that contain relevant information in a condensed form and apply classical machine learning classifiers such as \acrshortpl{mlp} to predict labels~\citep{zheng2018feature}. Note that we use the ambiguous term MLP in the strict sense i.e.\ for networks that comprise fully connected layers only and speak of CNNs if the network contains one or more convolutional layers. As we used MLPs for sparse features and CNNs for images we oppose these typical approaches for the respective domains of data.  \acrshortpl{mlp} classifiers for sparse features are considered comparatively simple as only few network hyperparameters have to be defined and typically comprise a limited number of neurons (see below). Besides retrieving sparse features can make the learning task very simple and allows for a higher degree of control over what is measured.
 
-The simplicity of \acrshortpl{mlp} also means that the best suitable structure could be found more easily as compared to deepleaning \acrshortpl{cnn}. In \acrshortpl{cnn} for deep learning, suitable means are required to avoid the vanishing gradient problem~\citep{wang2019vanishing}. Further, kernel sizes, strides, the number of kernels and other parameters must be defined. Therefore, designing shallow \acrshortpl{mlp} appears to be easier: As less decisions must be made, there is less one could possibly do wrong. Underfitting can potentially be due to an unsuitable network design or because predictions are impossible due to incongruencies or missing information in the sparse data set~\citep{lecun2012efficient}. 
+The simplicity of shallow \acrshortpl{mlp} means that the best suitable structure could be found more easily as compared to deep learning \acrshortpl{cnn} that require a special structure to address the vanishing gradient problem and have many other hyperparameters such as the size and number of kernels~\citep{wang2019vanishing}.  As a consequence, finding suitable hyperparameters for \acrshortpl{mlp} (i.e.\ the number of hidden layers and neurons per layer) is comparatively easy.\footnote{The challenge of finding appropriate network parameters is well known in the deep learning community: \enquote{Designing and training a network using backprop requires making many seemingly arbitrary choices $[...]$. These choices can be critical, yet there is no foolproof recipe for deciding them because they are largely problem and data dependent}’~\citep[p.~9]{lecun2012efficient}. The requirement to specify hyperparameters is a disadvantage of neural networks (including \acrshortpl{mlp}) as compared to parameter free methods~\citep{scikit2019neural}. Due to the combinatorial explosion, the challenge of finding suitable parameter settings is harder for more complex networks as more options must be considered.} This is beneficial because underfitting can potentially be due to an unsuitable network design ~\citep{lecun2012efficient}. 
 
-An extensive search in hyperparameter space is practicable for simple \acrshortpl{mlp}. This is because of them having few parameters only and because training on sparse representations limits the number of neurons in the networks. Taken together this results in fast training that allows for many experiments. If the learning task is simple enough to be accomplished by \acrshortpl{mlp} (e.g.\ finding combinations of partial angles that correspond to the impression of curvature), one may hence speculate that underfitting can rather be explained by incongruencies or missing information in the labels than a result of issues in network design and training parameters. 
+An extensive search in hyperparameter space is possible for simple \acrshortpl{mlp} for sparse features, because they have few network hyperparameters and because networks for training on sparse representations typically have a limited number of neurons. This becomes clear in our example of learning the binary feature curvature using 18 partial angles as the input. The dimensionality of the input layer of networks that are trained on dense features (here: three images of size $1040\times1376$ with three color channels i.e.\ more than twelve million values) are substantially larger. The reduced dimensionality of sparse features could even be considered as the defining criterion for the term. For our example, a network that uses the raw input requires an input layer with more than twelve million neurons as compared to only 531 neurons in the whole  \acrshortpl{mlp} for partial angles. Taken together this results in fast training of a network that is trained on sparse features - typically a MLP \footnote{MLPs can be used for sparse data because there is no necessity to further limit the size of  the network by use of convolutional layers and it is not always the case that a locality criterion holds.}. This allows for many experiments to find best suitable network hyperparameters. 
 
-This classical machine learning approach which relies on feature engineering is applied to predict features based on color and partial angles of asparagus spears because of these benefits.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/mlps\_and\_feature\_engineering}}
+The use of \acrshortpl{mlp} in combination with feature engineering certainly also has drawbacks as compared to the direct application of deeper  \acrshortpl{cnn}. Most strikingly one is at risk of discarding relevant information when defining the features and has to find and implement ways to reliably measure them. Because of the benefits we nonetheless tested the performance of shallow \acrshortpl{mlp} using color histograms and partial angles of asparagus spears.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/mlps\_and\_feature\_engineering} (as of 11/27/2020)}
 
 \bigskip
 \textbf{Violet and rust prediction based on color histograms} 
 
-The feature violet is based on the distribution of sufficiently intense color hues in the violet color range. The initial approach of measuring the feature faces at least two drawbacks. First, it requires defining two thresholds. Second, the impression of a violet asparagus spear could possibly be affected by the combination of colors that are potentially outside the violet range or that are too pale to be considered (see~\autoref{subsec:Violet}). The same holds for the features rusty body and rusty head. Hence, in a second approach histograms were computed for foreground pixels after transforming the images to palette images with 256 color hues (see \autoref{sec:Preprocessing}). The resulting representation is a sparse descriptor that allows to predict color features using explicitly defined rules or trainable machine learning models.
+The feature violet is based on the distribution of sufficiently intense color hues in the violet color range. The initial approach of automatically measuring the feature as described in ~\autoref{sec:AutomaticFeatureExtraction} faces at least two drawbacks. First, it requires defining two thresholds. Second, the impression of a violet asparagus spear could possibly be affected by the combination of colors that are potentially outside the violet range or that are too pale to be considered (see~\autoref{subsec:Violet}). The same holds for the features rusty body and rusty head. Hence, in a second approach histograms were computed for foreground pixels after transforming the images to palette images with 256 color hues (see \autoref{sec:Preprocessing}). The resulting representation is a sparse descriptor that allows to predict color features using explicitly defined rules or trainable machine learning models.
 
 \begin{figure}[!htb]
 	\centering
@@ -85,7 +85,7 @@ \subsubsection{Prediction based on feature engineering}
 	\label{fig:FeatureEngineeringPaletteColor}
 \end{figure}
 
-The sample set contains a histogram for each labeled asparagus. Twenty percent are randomly assigned to the evaluation set. A simple \acrshort{mlp} with four hidden layers and 128 neurons in each of them was trained on the resulting normalized histograms of palette colors (ReLU activation / sigmoid activation in the final layer). Hyperparameters were optimized and the network was trained for a total of 500 epochs as the learning curve indicated convergence at this point.
+The sample set contains a histogram for each labeled asparagus. Twenty percent are randomly assigned to the evaluation set. A simple \acrshort{mlp} with four hidden layers and 128 neurons in each of them was trained on the resulting normalized histograms of palette colors (ReLU activation / sigmoid activation in the final layer). Hyperparameters were optimized and the network was trained for a total of 500 epochs as the learning curve indicates convergence at this point.
 
 \bigskip
 \textbf{Curvature prediction based on partial angles} 
@@ -100,6 +100,8 @@ \subsubsection{Prediction based on feature engineering}
 	\label{fig:FeatureEngineeringNetStructureCurve}
 \end{figure}
 
+The receiver operating characteristic reveals that the prediction is of better quality for violet and curvature prediction as compared to rust prediction which is reflected in a smaller area under the curve for the latter. Possibly this reflects that rather small brown spots were considered rust by some raters but not by others. 
+
 \begin{table}[!h]
 	\centering
 	\resizebox{\columnwidth}{!}{%
@@ -117,32 +119,30 @@ \subsubsection{Prediction based on feature engineering}
 	\label{tab:performance_angle_based}
 \end{table}
 
-The receiver operating characteristic reveals that the prediction is of better quality for violet and curvature prediction as compared to rust prediction which is reflected in a smaller area under the curve for the latter. Possibly this reflects that rather small brown spots were considered rust by some raters but not by others. 
-
-Considering the low agreement in labeling, high values for the specificity and sensitivity of the classifier are not expected. A likely explanation for rather low values is that the model generalizes deviating and potentially contradicting perceptual color-concepts that we had when attributing labels which in return affects the reliability of the data. Arguably, only little information was discarded by computing the sparse representations that served as an input for training. However, information about irregularities in the outline are not reflected in the partial angles but might contribute to the perception of curvature. The same holds for the spatial distribution of colored pixels which might contain additional information regarding the detection of rust and violet. Nonetheless, the major criteria are captured. As \acrshortpl{mlp} are suitable to establish non-linear mappings and as the task of mapping high level features (such as partial angles) to human estimates appears to be rather simple, one may speculate that there is rather little potential to improve the predictive quality using other techniques. By introducing a bias, the sensitivity of the classifier can be adjusted at the cost of more false positives. Here, introducing a bias means that the threshold that is used to convert the floating point outputs of a neural network to booleans that indicate whether a feature is present or not is set to values other than 0.5. The possibility of making the classifier more or less sensitive appears to be a good option to be implemented as a feature for customization by the user in asparagus sorting machines.
+Considering the low agreement in labeling, high values for the specificity and sensitivity of the classifier are not expected. A likely explanation for rather low values is that the model generalizes deviating and potentially contradicting perceptual color-concepts that we had when attributing labels. Arguably, only little information was discarded by computing the sparse representations that served as an input for training. However, information about irregularities in the outline are not reflected in the partial angles but might contribute to the perception of curvature. The same holds for the spatial distribution of colored pixels which might contain additional information regarding the detection of rust and violet. Nonetheless, the major criteria are captured. As \acrshortpl{mlp} are suitable to establish non-linear mappings and as the task of mapping high level features (such as partial angles) to human estimates appears to be rather simple, one may speculate that there is rather little potential to improve the predictive quality using other techniques. By introducing a bias, the sensitivity of the classifier can be adjusted at the cost of more false positives. Here, introducing a bias means that the threshold that is used to convert the floating point outputs of a neural network to booleans that indicate whether a feature is present or not is set to values other than 0.5. The possibility of making the classifier more or less sensitive appears to be a good option to be implemented as a feature for customization by the user in asparagus sorting machines.
 
 \begin{figure}[!hb]
-	\centering
-	\includegraphics[scale=0.7]{Figures/chapter04/fe_curve.png}
-	\decoRule
-	\caption[Feature Engineering Learning Curve For Angle Based Prediction]{\textbf{Learning Curve For Angle Based Prediction}~~~The depiction shows the loss per training episode for the \acrshort{mlp} trained on partial angles of the centerline of asparagus spears.}
-	\label{fig:FeatureEngineeringCurve}
+    \centering
+    \includegraphics[scale=0.7]{Figures/chapter04/fe_curve.png}
+    \decoRule
+    \caption[Feature Engineering Learning Curve For Angle Based Prediction]{\textbf{Learning Curve For Angle Based Prediction}~~~The depiction shows the loss per training episode for the \acrshort{mlp} trained on partial angles of the centerline of asparagus spears.}
+    \label{fig:FeatureEngineeringCurve}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.6]{Figures/chapter04/fe_roc.png}
-	\caption[Feature Engineering ROC Curve]{\textbf{ROC for MLPs Trained on High Level Features}~~~The depiction shows the \acrfull{roc} for the classifiers that were trained on the features retrieved via feature engineering with \acrshortpl{mlp}. This allows to compare the performance. A larger area under the \acrshort{roc} curve indicates better performance while a curve close to the diagonal line indicates poor results.}
-	\label{fig:FeatureEngineeringROC}
+    \centering
+    \includegraphics[scale=0.6]{Figures/chapter04/fe_roc.png}
+    \caption[Feature Engineering ROC Curve]{\textbf{ROC for MLPs Trained on High Level Features}~~~The depiction shows the \acrfull{roc} for the classifiers that were trained on the features retrieved via feature engineering with \acrshortpl{mlp}. This allows for comparison of the performance. A larger area under the \acrshort{roc} curve indicates better performance while a curve close to the diagonal line indicates poor results.}
+    \label{fig:FeatureEngineeringROC}
 \end{figure}
 
 
 \subsubsection{Single-label classification}
 \label{subsec:SingleLabel}
 
-In the following chapter, a \acrlong{cnn} is described, which is used for single-label classification on features.\footnote{It can be found under \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/single-label-CNN}} The approach was tested on the 13319 hand-labeled data images. 13 models were created, each predicting one feature.~\footnote{The 13 features to predict are: fractured, hollow, flower, rusty head, rusty body, bent, violet, very thick, thick, medium thick, thin, very thin, and not classifiable.}
+In the following chapter, a \acrlong{cnn} is described, which is used for single-label classification on features.\footnote{It can be found at \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/single-label-CNN} (as of 11/27/2020).} The approach was tested on the 13319 hand-labeled data images. 13 models were created, each predicting one feature.~\footnote{The 13 features to predict are: fractured, hollow, flower, rusty head, rusty body, bent, violet, very thick, thick, medium thick, thin, very thin, and not classifiable.}
 
-A general model structure was needed as inspiration for the \acrshort{cnn}. For example, the \acrfull{vgg}~networks with varying depth seem to be a good choice for image classification as their \acrshort{vgg}16 won the ImageNet challenge of 2014 and is often implemented for image classification tasks~\citep{hassan2018vgg,vgg2014original}. However, there are two major drawbacks to using them. That is, they are slow to train and in need of a lot of memory storage due to their depth and the amount of fully-connected nodes \citep{hassan2018vgg,zhang2015accelerating}. Part of these problems also arise with even deeper networks like ResNet~\citep{resnet2016original,hassan2019resnet}. Thus, AlexNet is chosen as a blueprint for the \acrshort{cnn} because it is small in relation to other networks while still performing comparatively good~\citep{hassan2019alexnet,alexnet2012original,geron2019hands}. As the variance in the data images is relatively small it is assumed that not as many layers are needed as employed in deeper networks like \acrshort{vgg}~\citep{geron2019hands}.
+A general model structure was needed as inspiration for the \acrshort{cnn}. For example, the \acrfull{vgg}~networks with varying depth seem to be a good choice for image classification as their \acrshort{vgg}16 won the ImageNet challenge of 2014 and is often implemented for image classification tasks~\citep{hassan2018vgg,vgg2014original}. However, there are two major drawbacks to using them. That is, they are slow to train and in need of a lot of memory storage due to their depth and the amount of fully-connected layers \citep{hassan2018vgg,zhang2015accelerating}. Part of these problems also arise with even deeper networks like ResNet~\citep{resnet2016original,hassan2019resnet}. Thus, AlexNet is chosen as a blueprint for the \acrshort{cnn} because it is small in relation to other networks while still performing comparatively good~\citep{hassan2019alexnet,alexnet2012original,geron2019hands}. As the variance in the data images is relatively small it is assumed that a smaller number of layers is sufficient as compared to deeper networks like \acrshort{vgg}~\citep{geron2019hands}.
 
 \begin{figure}[!htb]
 	\centering
@@ -152,10 +152,10 @@ \subsubsection{Single-label classification}
 \end{figure}
  
 \bigskip
-The network comprises four hidden layers: a convolutional layer, followed by a pooling layer, a second convolutional layer, and a dense layer. The input is an array of multiple horizontally stacked images with no background removed and reduced by a factor of six. This input is trained on a set of binary labels containing information on whether the respective feature label applies to the current image. The output of the network gives a prediction on each entered image gated by a sigmoid function on a range between zero and one. The rounded integer values of this output give a prediction of the apparent feature label.
+The network comprises four hidden layers: a convolutional layer, followed by a pooling layer, a second convolutional layer, and a dense layer. The input is an array of multiple horizontally stacked images with background and reduced by a factor of six. This input is trained on a set of binary labels containing information on whether the respective feature label applies to the current image. The output of the network gives a prediction on each entered image gated by a sigmoid function on a range between zero and one. The rounded integer values of this output give a prediction of the apparent feature label.
 For the training phase of the model, the Adam optimizer is used because of its general acceptance as the state of the art optimizer for backpropagation~\citep{bushaev2018adam,kingma2014adam}. As a loss function, binary cross-entropy is used as it promises good results for binary single-label classification tasks~\citep{geron2019hands,godoy2018understanding,dertat2017applied}.
  
-When training \acrlongpl{ann}, it can be difficult to find clear guidelines on how to implement an architecture such that an optimal training performance is given~\citep{heaton2015aifh,geron2019hands,bettilyon2018classify}. Hence, the idea was to start with the simplest form of a \acrshort{cnn} and then gradually increase the complexity of the network. While AlexNet provides a good baseline for an image classification network, its architecture is still assumed to be unnecessarily complex for the given task. First, the architecture was reduced to the minimum number of layers and parameters needed for a \acrshort{cnn}. Over the period of training optimization, various processing steps and hyperparameters were implemented and compared according to their performance. During this process, the data was split between 12000 samples for training data and 1319 samples of validation data in order to have a reasonable overview on the possible test performance and to directly check for overfitting. The data used as test data was randomly chosen from the whole data set.\footnote{Only the models trained on detecting the features hollow, flower, rusty head, rusty body, bent, and violet use the same validation set. Like this, a better comparison can be made between them. For the other features, the validation set was always randomly composed anew before model training.}
+When training \acrlongpl{ann}, it can be difficult to find clear guidelines on how to implement an architecture such that an optimal training performance is given~\citep{heaton2015aifh,geron2019hands,bettilyon2018classify}. Hence, the idea was to start with the simplest form of a \acrshort{cnn} and then gradually increase the complexity of the network. While AlexNet provides a good baseline for an image classification network, its architecture is still assumed to be unnecessarily complex for the given task. First, the architecture was reduced to the minimum number of layers and parameters needed for a \acrshort{cnn}. Over the period of training optimization, various processing steps and hyperparameters were implemented and compared according to their performance. During this process, the data was split between 12000 samples for training data and 1319 samples for validation data in order to have a reasonable overview on the possible test performance and to directly check for overfitting. The data used as test data was randomly chosen from the whole data set.\footnote{Only the models trained on detecting the features hollow, flower, rusty head, rusty body, bent, and violet use the same validation set. Like this, a better comparison can be made between them. For the other features, the validation set was always randomly composed anew before model training.}
 
 \bigskip
 In the following, the development of the hyperparameters over time is explained.
@@ -170,11 +170,11 @@ \subsubsection{Single-label classification}
  
 Both convolutional layers of the model are built with batch normalization. The gradients were inspected visually and the results give no reason for assuming exploding or vanishing gradients~\citep{pascanu2012understanding}.
  
-A next step was to weight the loss function because of the largely unbalanced data~\citep{he2009learning,batista2004study}. However, this did not lead to any changes. The model still tended to make an unbalanced prediction by classifying all values as negative samples. Another idea is to reduce the data set to make the number of images with the regarding feature present even to the number of images where it is absent. This would mean to throw away valuable data, which can otherwise provide information about negative cases to support the model in its training~\citep{batista2004study}. It was decided to keep all data images and instead balance the data by multiplying the minority of samples to match the number of contrary samples. As there was no feature positively exceeding a presence of 50\% in the data, solely positive labels were oversampled. The balancing was only performed on the training data, while the test data was not changed.
+A next step was to weight the loss function because of the largely unbalanced data~\citep{he2009learning,batista2004study}. However, this did not lead to any changes. The model still tended to make an unbalanced prediction by classifying all values as negative samples. Another idea is to reduce the data set to make the number of images with the regarding feature present even to the number of images where it is absent. This would mean to throw away valuable data, which can otherwise provide information about negative cases to support the model in its training~\citep{batista2004study}. It was decided to keep all images and instead balance the data by multiplying the minority of samples to match the number of contrary samples. As there was no feature positively exceeding a presence of 50\% in the data, solely positive labels were oversampled. The balancing was only performed on the training data, while the test data was not changed.
  
 To prevent overfitting of the negative data samples, \(L_2\) regularization was applied.
  
-To improve training performance, some kinds of data augmentation were tested \citep{brownlee2019augmentation} like horizontal flipping or small changes in the angle (up to 5\textdegree ) but they are not used in the final version.
+To improve training performance, different kinds of data augmentation were tested \citep{brownlee2019augmentation} like horizontal flipping or small changes in the angle (up to 5\textdegree ) but not used in the final version.
 
 \bigskip
 Around 2700 to 5400 training steps are performed for the training, translating roughly to 120 epochs (with an exception for the model of feature fractured, which is trained for 60 epochs). Due to the balancing of the data, the number of training data (and therefore the ratio of training steps to epochs) varied when training the \acrshort{cnn} on the different features.
@@ -216,7 +216,7 @@ \subsubsection{Single-label classification}
 		\hline
 	\end{tabular}%
 	}
-	\caption[Single-Label CNN Classification Results]{\textbf{Single-Feature Label Classification Results}~~~In this table, the sensitivity, specificity, validation accuracy, balanced accuracy, and the number of training steps are given for each feature model after training on 12000 original hand-labeled data images. The numbers in brackets indicate the best result for the model in relation to its balanced accuracy. For most features, training lasted for 120 epochs. However, the number of data samples (and thus training steps) varies between features because of the data balancing.}
+	\caption[Single-Label CNN Classification Results]{\textbf{Single-Feature Label Classification Results}~~~In this table, the sensitivity, specificity, validation accuracy, balanced accuracy, and the number of training steps are given for each feature model after training on 12000 hand-labeled data images. The numbers in brackets indicate the best result for the model in relation to its balanced accuracy. For most features, training lasted for 120 epochs. However, the number of data samples (and thus training steps) varies between features because of the data balancing.}
 	\label{tab:SingleLabelResults}
 \end{table}
  
@@ -228,7 +228,7 @@ \subsubsection{Single-label classification}
 	\label{fig:SingleLabelROC}
 \end{figure}
 
-The \acrshort{cnn} is trained on 13 features separately, resulting in 13 trained models. For some features, there are no labels in the csv-file but they can be calculated from the parameters length and width of an asparagus. That is, the features very thick, thick, medium, thin, and very thin are all calculated from the thickness measured by the automatic feature extraction algorithm for width, according to the boundaries for each class label (for reference, see \autoref{tab:AsparagusLabels} and \autoref{fig:LabelTree} in \autoref{sec:BackgroundSortingAsparagus}, and also \autoref{subsec:Width}). For the feature fractured, the length was set to a threshold of 210 mm, with all asparagus of smaller length labeled as fractured. Additionally, asparagus labeled as not classifiable was included for training the model that is detecting the feature fractured. The reason is that fractured asparagus without a head part was previously labeled as not classifiable (see the feature descriptions in \autoref{subsec:Length}). The feature not classifiable was also trained on separately. Further, for all other features, not classifiable samples were removed before training to prevent a bias in the occurrence of false positives.\footnote{Not classifiable asparagus was not sorted for the presence of any other features (see~\autoref{subsec:NotClassifiable}). If a sample is sorted by a model, e.g.\ that is detecting the presence of feature bent, and the sample shows the sorting criteria, i.e.\ it is bent, the model will classify it as positive but its feature label will be negative. This will disturb the model and, thus, it was decided to exclude not classifiable samples with an exception for training on feature fractured and for feature not classifiable.}
+The \acrshort{cnn} is trained on 13 features separately, resulting in 13 trained models. For some features, there are no labels in the csv-file but they can be calculated from the parameters length and width of an asparagus. That is, the features very thick, thick, medium, thin, and very thin are all calculated from the thickness measured by the automatic feature extraction algorithm for width, according to the boundaries for each class label (for reference, see \autoref{tab:AsparagusLabels} and \autoref{fig:LabelTree} in \autoref{sec:BackgroundSortingAsparagus}, and also \autoref{subsec:Width}). For the feature fractured, the length was set to a threshold of 210 mm, with all asparagus of smaller length labeled as fractured. Additionally, asparagus labeled as not classifiable was included for training the model that is detecting the feature fractured. The reason is that fractured asparagus without a head part was previously labeled as not classifiable (see the feature descriptions in \autoref{subsec:Length}). The feature not classifiable was also trained on separately. Further, for all other features, not classifiable samples were removed before training to prevent a bias in the occurrence of false positives.
 
 \begin{figure}[!htb]
 	\centering
@@ -310,20 +310,20 @@ \subsubsection{Single-label classification}
 In~\autoref{tab:SingleLabelResults} the 13 features are listed on which the \acrshort{cnn} architecture was trained 13 times. It further shows the results for the sensitivity, specificity, validation accuracy, and balanced accuracy of each feature after a certain number of training steps, indicated in the last column of the table.
 
 Sensitivity is a measure to assess the performance of the model in labelling positive samples correctly, while the specificity shows how correctly the model predicts negative samples. The validation accuracy is the accuracy of the validation set which is a representative subset of the entire data set. The balanced accuracy of a feature label is the mean of the sum of its sensitivity and specificity. It represents the accuracy of the feature label if positive and negative samples in the data set were evenly balanced.
-Additionally, the performance of six models is calculated in a \acrshort{roc} curve in~\autoref{fig:SingleLabelROC}.\footnote{For an explanation of \acrshort{roc} curve, see \autoref{fig:FeatureEngineeringROC} in the previous section.} Models that determine features which indicate the thickness and length, as well as the feature not classifiable are excluded from the curve.
+Additionally, the performance of six models is calculated in a \acrshort{roc} curve in~\autoref{fig:SingleLabelROC}.\footnote{For an explanation of \acrshort{roc} curve, see \autoref{fig:FeatureEngineeringROC} in the previous section.} Models that determine features which indicate the thickness and length, as well as the feature not classifiable are excluded.
 In the following discussion, it is referred to the best result of a model (depicted in brackets in~\autoref{tab:SingleLabelResults}) and not the last result.
 
 The results reveal that for every feature, the sum of sensitivity and specificity exceeds 1.  This corresponds to a balanced accuracy over 50\%, which is better than chance level. For all features, the balanced accuracy is above 65\%. Best results are achieved for features that indicate the thickness of an asparagus. The feature very thick reaches the best results with 98\% sensitivity, 99\% specificity, and a balanced accuracy of 98.5\%. Besides features for thickness, the best prediction is observed for the feature fractured, which relies on the parameter length. It reaches a sensitivity of 88\%, a specificity of 99.8\%, and a balanced accuracy of 94\%.
-On average, for the hand-labeled features hollow, flower, rusty head, rusty body, bent, and violet a balanced accuracy above 72\% is reached. Feature rusty head performs worst with 52\% sensitivity, 81\% specificity, and 67\% balanced accuracy. In general, the specificity of all features is relatively high. Most features reach a specificity above 90\%, except for the features flower (83\%), rusty head (81\%), rusty body (80\%), and bent (73\%). For all features, the sensitivity is above 50\%. After visual inspection of test loss and training accuracy, none of the models showed any form of overfitting in their respective range of training steps.
-Further, example images of wrong classification are shown for feature hollow in~\autoref{fig:ExampleImagesHollow} and for feature bent in~\autoref{fig:ExampleImagesBent}. Additional example images of false negative and false positive classification for the other features can be found in the appendix in~\autoref{subsec:AdditionalSingleLabelCNN}. For feature fractured, training and test loss as well as training and test accuracy are plotted as an exemplar in the appendix in~\autoref{fig:ExamplePlotsFractured}.\footnote{The models and their log files for accuracy and loss can be found at~\url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/single-label-CNN/asparanet}.}
+On average, for the hand-labeled features a balanced accuracy above 72\% is reached. The feature rusty head performs worst with 52\% sensitivity, 81\% specificity, and 67\% balanced accuracy. In general, the specificity of all features is relatively high. Most features reach a specificity above 90\%, except for the features flower (83\%), rusty head (81\%), rusty body (80\%), and bent (73\%). For all features, the sensitivity is above 50\%. After visual inspection of test loss and training accuracy, none of the models showed any form of overfitting in their respective range of training steps.
+Further, example images of wrong classification are shown for feature hollow in~\autoref{fig:ExampleImagesHollow} and for feature bent in~\autoref{fig:ExampleImagesBent}. Additional example images of false negative and false positive classification for the other features can be found in the appendix in~\autoref{subsec:AdditionalSingleLabelCNN}. For feature fractured, training and test loss as well as training and test accuracy are plotted as an exemplar in the appendix in~\autoref{fig:ExamplePlotsFractured}.\footnote{The models and their log files for accuracy and loss can be found at~\url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/single-label-CNN/asparanet} (as of 11/27/2020).}
  
 \bigskip
-The results indicate that the \acrshort{cnn} architecture is able to learn every feature. Features relying on the parameters of length and width achieve a good performance, with both validation accuracy and balanced accuracy above 90\%. The prediction of features like hollow or flower was expected to be more difficult. However, the balanced accuracy of both is above 75\%, with feature hollow even reaching a balanced accuracy of 88\%. This shows that the model predicts them relatively well. Features that depend on color (like rusty body, rusty head, and violet) do not reach equivalent results. It should be further tested whether an increase in model depth might lead to better results for features depending on color or for features depending on a more complex shape (like the feature bent).
+The results indicate that the \acrshort{cnn} architecture is able to learn every feature. Features relying on the parameters of length and width achieve a good performance, with both validation accuracy and balanced accuracy above 90\%. The prediction of features like hollow or flower are expected to be more difficult. However, the balanced accuracy of both is above 75\%, with feature hollow even reaching a balanced accuracy of 88\%. This shows that the model predicts them relatively well. Features that depend on color (like rusty body, rusty head, and violet) do not reach equivalent results. It should be further tested whether an increase in model depth might lead to better results for features depending on color or for features depending on a more complex shape.
  
 An inspection of false positive and false negative images at the end of the training process suggests that training performance might be influenced by mislabeled data to a certain extent.
 The random images for feature hollow in \autoref{fig:ExampleImagesHollow} propose that the model might use the thickness of an asparagus as an indicator. Some of the samples look like hollow asparagus that could have been labeled incorrectly.
-For the feature bent, the randomly selected examples in~\autoref{fig:ExampleImagesBent} might reveal a labeling bias by the human annotators. When the feature is only slightly present, it seems to become a random choice whether the sample was labeled as positive or negative by the human labelers. If true, this can make it difficult for the machine to perform above a certain level of accuracy. It might also be that the difference between the samples is not prominent enough to the model.
-Again, these images are just randomly chosen examples and can mean that the model is simply not able to label them correctly. However, the results can also suggest that correct labeling might sometimes be difficult for the model because of an inconsistency in the labeling behavior of the human labelers (see \autoref{subsec:Reliability}).
+For the feature bent, the randomly selected examples in~\autoref{fig:ExampleImagesBent} might reveal a labeling bias by the human annotators. When the feature is only slightly present, it seems to become a random choice whether the sample was labeled as positive or negative by the human annotators. If true, this can make it difficult for the machine to perform above a certain level of accuracy. It might also be that the difference between the samples is not prominent enough to the model.
+The results can also suggest that correct labeling might sometimes be difficult for the model because of an inconsistency in the labeling behavior of the human labelers (see \autoref{subsec:Reliability}).
  
 On a general note, the architecture of the model is very flexible. It can be applied to many tasks (i.e.\ predicting different features) without much preprocessing of the image data beforehand. Further, the model is quite small, which makes it fast and robust for practical applications. However, instead of having the same architecture for all features, more precise adjustment of each model to its feature is needed.
  
@@ -333,91 +333,90 @@ \subsubsection{Single-label classification}
 \subsubsection{Multi-label classification}
 \label{subsec:MultiLabel}
 
-Building on the standard single-label classification we were further interested in how well a model, that predicts several feature labels at the same time, performs. A multi-label classification model hereby gets an image as the input and learns to predict the presence or absence of  the feature labels.
+Building on the standard single-label classification we were further interested in how well a model, that predicts several feature labels at the same time, performs. A multi-label classification model hereby gets an image as the input and learns to predict the presence or absence of the feature labels.
 For this model, we use a small  \acrshort{cnn} as described below and the features that we labeled by hand. Each of the six features (hollow, flower, rusty head, rusty body, bent and violet) is encoded by a binary output in the target vector, indicating whether the asparagus exhibits the feature in question or not.
 
 \bigskip
 Multi-label classification is a useful tool for classification problems in which several classes are assigned to a single input. In contrast to a multiclass classification, where the model is supposed to predict the most likely class for an input, the multi-label classification makes a prediction for each class separately, determining whether the class is present in the image or not. While the different classes are mutually-exclusive in the multiclass classification, they can be related in the multi-label classification. Further, there is no limit on how many classes can be depicted in one image. It is possible that all or none of the classes are present.
 
-Multi-label classification tasks can be thought of as consisting of different sub tasks. Therefore, the problem can be transformed to multiple binary classification tasks. In this transformation, a new model for each feature is trained, which are then combined to give a single output. That means that all features are independent of one another because they are learned separately. This can be seen as one of the major drawbacks as it is not always clear whether features are related, but in many cases they are.
+Multi-label classification tasks can be thought of as consisting of different sub tasks. Therefore, the problem can be transformed to multiple binary classification tasks. In this transformation, a new model for each feature is trained, which are then combined to give a single output. That means that all features are independent of one another because they are learned separately. This can be seen as one of the major drawbacks as in many cases features are related.
 Therefore, we decided to not only use single-label classification as described in ~\autoref{subsec:SingleLabel} but to explore the possibilities of multi-label classification.
 
-A second approach to transform a multi-label classification task is to interpret each possible combination of features as on class. Hereby, the problem is redefined as a multiclass classification task. For a classification problem with six features that means there are \(2^{6} = 64\) classes to be learned. The problems with this approach are, on the one hand, the exponentially increasing number of classes, and on the other hand, the sparsity of samples per class. In many cases, some of the classes are highly underrepresented or even empty. For that reason, we decided not to elaborate this approach further and implement a model for multi-label classification without transforming the task to a multiclass problem.
+A second approach to transform a multi-label classification task is to interpret each possible combination of features as one class. Hereby, the problem is redefined as a multiclass classification task. For a classification problem with six features that means there are \(2^{6} = 64\) classes to be learned. The problems with this approach are, on the one hand, the exponentially increasing number of classes, and on the other hand, the sparsity of samples per class. In many cases, some of the classes are highly underrepresented or even empty. For that reason, we decided not to elaborate this approach further and implement a model for multi-label classification without transforming the task to a multiclass problem.
 
 \bigskip
-Inspiration for the model gave a blogpost~\citep{blogpostMulti} which aims to classify images of the MNIST fashion data set in the context of multi-label, rather than multiclass classification. The author altered the data set in such a way that each input image contains four randomly selected items from the MNIST fashion data set. The model then learns to predict which classes are present in the image. The target vector has ten values, one for each class, which are either 0 or 1 depending on whether that class can be found in the input image or not.
-
+Inspiration for the model gave a blogpost~\citep{blogpostMulti} which aims to classify images of the MNIST fashion data set in the context of multi-label, rather than multiclass classification.
 This model was chosen as inspiration for two main reasons. Firstly, it tackles a similar problem as ours and the number of classes is similar. Secondly, the model uses a data set of similar size. Despite the rather small data set in comparison to many other image classification problems, good results with an accuracy of 95 -- 96\% were reached~\citep{blogpostMulti}. This leads us to think, it might be a model with a good complexity for our problem too, as it is complex enough to model the underlying distribution, but not too complex for the medium-sized data set.
 
 \begin{table}[!htb]
-	\centering
-	\includegraphics[scale=0.8]{Figures/chapter04/multilabel_structure.png}
-	\decoRule
-	\caption[Multi-Label Model Summary]{\textbf{Multi-Label Model Summary}~~~The summary of the multi-label classification model is shown. It describes which layers are implemented, how the output changes in each layer and how many parameters are trained in each layer and in total.}
-	\label{tab:MultilabelStructure}
+    \centering
+    \includegraphics[scale=0.8]{Figures/chapter04/multilabel_structure.png}
+    \decoRule
+    \caption[Multi-Label Model Summary]{\textbf{Multi-Label Model Summary}~~~The summary of the multi-label classification model is shown. It describes which layers are implemented, how the output changes in each layer and how many parameters are trained in each layer and in total.}
+    \label{tab:MultilabelStructure}
 \end{table}
 
 \bigskip
 A classical \acrshort{cnn} was chosen for the multi-label classification task. It consists of five blocks of convolution layers with max pooling layers each followed by a global average pooling layer and a dense layer.
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[width=0.95\textwidth]{Figures/chapter04/multilabel_net_structure.png}
-	\decoRule
-	\caption[Multi-Label Net Structure]{\textbf{Multi-Label Net Structure}~~~The depiction shows the network structure of the multi-label \acrshort{cnn}.}
-	\label{tab:MultilabelNetStructure}
+    \centering
+    \includegraphics[width=0.95\textwidth]{Figures/chapter04/multilabel_net_structure.png}
+    \decoRule
+    \caption[Multi-Label Net Structure]{\textbf{Multi-Label Net Structure}~~~The depiction shows the network structure of the multi-label \acrshort{cnn}.}
+    \label{tab:MultilabelNetStructure}
 \end{figure}
 
 In contrast to multiclass classification models, where usually a softmax activation function is used in the last layer together with a categorical cross-entropy loss, the multi-label classification model uses a sigmoid activation function and a binary cross-entropy loss.
 
-As the input of the model, a concatenated image of the three perspectives of each spear is used in order to maximize the information the model gets. This yields input images that look like the three asparagus spears are laying side by side. Further, the images are downscaled by a factor of six to facilitate training (see~\autoref{sec:AsparagusDataSet}).
+As the input of the model, a concatenated image of the three perspectives of each spear is used in order to maximize the information the model gets. This yields input images where three asparagus spears are laying side by side. Further, the images are downscaled by a factor of six to facilitate training (see~\autoref{sec:AsparagusDataSet}).
 
-The output of the model is a vector of length six in which each position encodes one of the six hand-labeled features (hollow, flower, rusty head, rusty body, bent and violet). Each feature can either be present in the input or not, which leads to a 1 or 0 in the target vector, respectively.
+As stated above, the output of the model is a vector of length six in which each position encodes one of the six hand-labeled features (hollow, flower, rusty head, rusty body, bent and violet). Each feature can either be present in the input or not, which leads to a 1 or 0 in the target vector, respectively.
 
 Three loss functions are tested to improve the model's performance. The first two losses are in-built functions from keras, namely binary cross-entropy loss and hamming loss. The latter uses the fraction of the wrong labels to the total number of labels. Additionally, a custom loss function was implemented, that penalizes false negatives stronger than false positives. The motivation for this custom loss was the fact that the two labels 0 and 1 are highly unbalanced. As previously stated, there are noticeably more 0s than 1s in many classes. To be more precise, the model can reach an accuracy of 77\% by labeling all features as 0. By penalizing this error more, we intended to counteract the unbalanced data set. But at the end, the binary cross-entropy loss remains the one with the best results.
 
-Further, it was tested whether regularization would improve the performance of the model on the validation data by preventing overfitting. For this, the model was trained by adding \(L_1\) or \(L_2\) regularization, respectively, to all five of the convolutional layers.  Hereby, a kernel regularization was implemented with a value of 0.01.
+Further, it was tested whether regularization would improve the performance of the model on the validation data by preventing overfitting. For this, the model was trained by adding \(L_1\) or \(L_2\) regularization, respectively, to all five of the convolutional layers. Hereby, a kernel regularization was implemented with a value of 0.01.
 
 \(L_1\) and \(L_2\) regularization can both be interpreted as constraints to the optimization that have to be considered when minimizing the loss term. The main difference between the two is that \(L_1\) regularization reduces the coefficient of irrelevant features to zero, which means they are removed completely. Hence, \(L_1\) regularization allows for sparse models and can be seen as a selection mechanism for features. The inputs to this model are images that consist of a large number of pixels, and additionally a large portion of those pixels are black, because the background was removed. Therefore, it appears to be a good idea to reduce the number of features taken into account in the early layers. \(L_2\) regularization, on the contrary, does not set coefficients to zero, but punishes large coefficients more than smaller ones. This way the error is better distributed over the whole vector.
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.37]{Figures/chapter04/multilabel_crossentropy.png}
-	\decoRule
-	\caption[Multi-Label Binary Cross-Entropy Loss]{\textbf{Binary Cross-Entropy Loss}~~~These graphs show the evaluation of the training with binary cross-entropy loss. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
-	\label{fig:MultilabelCrossentropy}
+    \centering
+    \includegraphics[scale=0.37]{Figures/chapter04/multilabel_crossentropy.png}
+    \decoRule
+    \caption[Multi-Label Binary Cross-Entropy Loss]{\textbf{Binary Cross-Entropy Loss}~~~These graphs show the evaluation of the training with binary cross-entropy loss. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
+    \label{fig:MultilabelCrossentropy}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.37]{Figures/chapter04/multilabel_hamming.png}
-	\decoRule
-	\caption[Multi-Label Hamming Loss]{\textbf{Hamming Loss}~~~These graphs show the evaluation of the training with hamming loss. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
-	\label{fig:MultilabelHammingLoss}
+    \centering
+    \includegraphics[scale=0.37]{Figures/chapter04/multilabel_hamming.png}
+    \decoRule
+    \caption[Multi-Label Hamming Loss]{\textbf{Hamming Loss}~~~These graphs show the evaluation of the training with hamming loss. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
+    \label{fig:MultilabelHammingLoss}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.37]{Figures/chapter04/multilabel_costum.png}
-	\decoRule
-	\caption[Multi-Label Custom Loss]{\textbf{Custom Loss}~~~These graphs show the evaluation of the training with custom loss that punishes falsely classified ones more than falsely classified 0s. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
-	\label{fig:MultilabelCostumLoss}
+    \centering
+    \includegraphics[scale=0.37]{Figures/chapter04/multilabel_costum.png}
+    \decoRule
+    \caption[Multi-Label Custom Loss]{\textbf{Custom Loss}~~~These graphs show the evaluation of the training with custom loss that punishes falsely classified ones more than falsely classified zeros. The model was trained over 50 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
+    \label{fig:MultilabelCostumLoss}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.37]{Figures/chapter04/multilabel_L1.png}
-	\decoRule
-	\caption[Multi-Label \(L_1\) Regularization]{\textbf{\(L_1\) Regularization}~~~These graphs show the evaluation of the training with \(L_1\) regularization. As the loss function the binary cross-entropy loss was used. The model was trained over 25 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
-	\label{fig:MultilabelL1Regularization}
+    \centering
+    \includegraphics[scale=0.37]{Figures/chapter04/multilabel_L1.png}
+    \decoRule
+    \caption[Multi-Label \(L_1\) Regularization]{\textbf{\(L_1\) Regularization}~~~These graphs show the evaluation of the training with \(L_1\) regularization. As the loss function the binary cross-entropy loss was used. The model was trained over 25 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
+    \label{fig:MultilabelL1Regularization}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.37]{Figures/chapter04/multilabel_L2.png}
-	\decoRule
-	\caption[Multi-Label \(L_2\) Regularization]{\textbf{\(L_2\) Regularization}~~~These graphs show the evaluation of the training with \(L_2\) regularization. As the loss function the binary cross-entropy loss was used. The model was trained over 25 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
-	\label{fig:MultilabelL2Regularization}
+    \centering
+    \includegraphics[scale=0.37]{Figures/chapter04/multilabel_L2.png}
+    \decoRule
+    \caption[Multi-Label \(L_2\) Regularization]{\textbf{\(L_2\) Regularization}~~~These graphs show the evaluation of the training with \(L_2\) regularization. As the loss function the binary cross-entropy loss was used. The model was trained over 25 epochs and the accuracy and loss was measured. Further, the false/true positive/negative rates were determined.}
+    \label{fig:MultilabelL2Regularization}
 \end{figure}
 
 As shown in \autoref{fig:MultilabelCrossentropy}, \autoref{fig:MultilabelHammingLoss} and \autoref{fig:MultilabelCostumLoss}, all the different approaches explained above show a similar behavior in accuracy and loss values. The training and validation accuracy increase slowly but steadily with the training accuracy always being a little higher than the validation accuracy. The training loss decreases rapidly, while the validation loss only decreases very little and shows random fluctuations. This can be an indicator for overfitting. Usually, \(L_1\) and \(L_2\) regularization are used to prevent overfitting, but in our case it did not improve the results, as shown in \autoref{fig:MultilabelL1Regularization} and \autoref{fig:MultilabelL2Regularization}.
@@ -425,14 +424,14 @@ \subsubsection{Multi-label classification}
 When looking at the sensitivity and specificity, it becomes apparent that both increase during the training process, while the false negative and false positive rates decrease with the same slope. The false positive and false negative rates are mirror images to the true positive and true negative rates with the mirroring axis at the 50\% mark. It can be observed that the rates change rapidly in the first two to four epochs, after which the change progresses slowly in the same direction with no greater disturbances. The model trained with the \(L_2\) loss is an exception as it does not show these large changes in either of the rates.
 
 \bigskip
-When comparing the three different loss functions, it is noticeable that the binary cross-entropy loss has significantly larger accuracy values than the hamming loss and the custom loss. The values of the binary cross-entropy loss start at 75\%, while they start at around 30\% for the other two loss functions. The behavior of the curves and the (mis-)classification rates, however, are very similar in all three approaches. The specificities start off very high with values around 78.46\% for the binary cross-entropy loss, 78.73\% for the hamming loss and 74.74\% for the custom loss, and increase further during the training. The highest values are reached with the binary cross-entropy loss (91.41\%), closely followed by the hamming loss (91.37\%) and the custom loss (88.75\%). The sensitivity values start off lower, at around 37\% to 42\%, and increase rapidly in the first few epochs, after which the rates proceed to increase but with a narrower slope. They reach values of up to 67.27\% with the binary cross-entropy loss, 68.17\% with the hamming loss and 68.17\% with the custom loss. As stated above, the false negative and false positive rates show the same slope but in the opposite direction.
+When comparing the three different loss functions, it is noticeable that the binary cross-entropy loss has significantly larger accuracy values than the hamming loss and the custom loss. The values of the binary cross-entropy loss start at 75\% and reach the highest accuracy with 87\%, while they start at around 30\% and only reach values between 45\% to 47\% for the other two loss functions. The behavior of the curves and the (mis-)classification rates, however, are very similar in all three approaches. The specificities start off very high with values around 78.46\% for the binary cross-entropy loss, 78.73\% for the hamming loss and 74.74\% for the custom loss, and increase further during the training. The highest values are reached with the binary cross-entropy loss (91.41\%), closely followed by the hamming loss (91.37\%) and the custom loss (88.75\%). The sensitivity values start off lower, at around 37\% to 42\%, and increase rapidly in the first few epochs, after which the rates proceed to increase but with a narrower slope. They reach values of up to 67.27\% with the binary cross-entropy loss, 68.17\% with the hamming loss and 68.17\% with the custom loss. As stated above, the false negative and false positive rates show the same slope but in the opposite direction.
 
 The accuracy values of the models that are trained with \(L_1\) or \(L_2\) regularization, respectively, do not change over the epochs. The same holds for the validation loss. The training loss decreases in the first few epochs and remains stable thereafter.
 While the (mis-)classification rates of the model trained with \(L_1\) regularization behave similarly to the ones trained with no regularization, the rates of the model trained with \(L_2\) regularization show a smaller increase and lack the fast change in the first epochs.
 
 The slopes of all curves indicate that the model is learning, because they are increasing in the case of the accuracy, sensitivity and specificity and decreasing in the case of the loss, false positive and false negative rates until the end of training. Hence, one might think that a longer training period will lead to better results. However, the training loss decreases very rapidly while the validation loss does not. This suggests overfitting of the model, a problem which gets worse when increasing the training steps. Therefore, a longer training period most likely will not increase performance unless overfitting is prevented. As shown in the result section, neither \(L_1\) nor \(L_2\) regularization alone were able to prevent overfitting.
 
-Another common practice that can be tested is the drop-out, in which a certain amount of nodes are left out in different backpropagation steps. This way the model learns to not rely on a small number of nodes but distribute the information between all nodes available. Hence, the coefficients remain smaller. Another method to prevent overfitting is to reduce the model's complexity. A model with fewer parameters to train, is less prone to overfitting. A fitting degree of complexity should be found to model the data sufficiently good without losing the possibility of generalization.
+Another common practice that can be tested in the future is the drop-out, in which a certain amount of nodes are left out in different backpropagation steps. This way the model learns to not rely on a small number of nodes but distribute the information between all nodes available. Hence, the coefficients remain smaller. Another method to prevent overfitting is to reduce the model's complexity. A model with fewer parameters to train, is less prone to overfitting. A fitting degree of complexity should be found to model the data sufficiently good without losing the possibility of generalization.
 
 Accuracy alone might not be a good indicator to evaluate a multi-label model~\citep{gibaja2015}. As it highly depends on the loss function, it may have misleading results. This can be seen in the comparison between the three different loss functions. Although the sensitivity and specificity show similar values, the accuracy values suggest that the binary cross-entropy loss outperforms the other two loss functions by far. The accuracy of the model trained with the binary cross-entropy loss has an accuracy more than twice as high. But when looking at the slope of the curve, it appears that the model with the binary cross-entropy loss does not perform better than the other two models. All three have an increase of accuracy of roughly 10\% and a similar sensitivity and specificity. This indicates that the slope of the accuracy function can be considered to evaluate the training process of the model, but the real values should be interpreted with caution.
 
@@ -462,29 +461,11 @@ \subsubsection{A dedicated network for head-related features}
 	\includegraphics[width=0.95\textwidth]{Figures/chapter04/head_net_structure.png}
 	\decoRule
 	\caption[Head Features Net Structure]{\textbf{Head Features Net Structure}~~~The depiction shows the network structure of the \acrshort{cnn} specifically aimed at head features.}
-	\label{tab:HeadNetStructure}
+	\label{fig:HeadNetStructure}
 \end{figure}
 
 \bigskip
-A simple feedforward \acrshort{cnn} was trained on the images.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/dedicated\_head\_network}} The features flower and rusty head are chosen as target categories. Hence, the  model is an example for multi-label classification. The network comprises the input layer, three convolutional layers with kernel size two, a fully connected layer with 128 neurons as well as the output layer. For the final layer, the sigmoid activation function is applied while the hidden layers have ReLU activations. A dropout layer was added to avoid overfitting. The network was trained using \acrfull{mse} as an error function. The development of loss in the learning curve indicates convergence after 40 epochs (see \autoref{fig:HeadCurve}).
-
-\begin{table}[!b]
-	\centering
-	\resizebox{\columnwidth}{!}{%
-	\begin{tabular}{lrrrrrr}
-		{} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
-		\noalign{\smallskip}
-		\hline
-		\noalign{\smallskip}	
-		flower     &            0.04 &            0.06 &           0.08 &           0.82 &         0.55 &         0.95 \\
-		rusty head &            0.02 &            0.13 &           0.03 &           0.83 &         0.19 &         0.98 \\
-		\noalign{\smallskip}
-		\hline
-	\end{tabular}%
-	}
-	\caption[Head Features CNN Performance]{\textbf{Performance of Head Features CNN}~~~Performance of the \acrshort{cnn} trained on asparagus heads.}
-	\label{tab:performance_measures_head_based}
-\end{table}
+A simple feedforward \acrshort{cnn} was trained on the images.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/supervised/dedicated\_head\_network} (as of 11/27/2020)} The features flower and rusty head are chosen as target categories. Hence, the  model is an example for multi-label classification. The network comprises the input layer, three convolutional layers with kernel size two, a fully connected layer with 128 neurons as well as the output layer. For the final layer, the sigmoid activation function is applied while the hidden layers have ReLU activations. A dropout layer was added to avoid overfitting. The network was trained using \acrfull{mse} as an error function. The development of loss in the learning curve indicates convergence after 40 epochs (see \autoref{fig:HeadCurve}).
 
 \begin{figure}[!htb]
 	\centering
@@ -508,50 +489,68 @@ \subsubsection{A dedicated network for head-related features}
 
 The \acrshort{roc} curve indicates how the classifiers respond to the introduction of a bias and shows the overall prediction quality. In \autoref{fig:HeadROC} the area under the curve is small for the feature rusty head. Beside incongruencies in the labels, this is possibly due to the choice of the size of the head region. It might be the case that brown spots in regions other than the cropped part were considered as an indicator for a rusty head when attributing labels. Improvements by increasing the cropped head region appear to be possible.
 
+\begin{table}[!htb]
+	\centering
+	\resizebox{\columnwidth}{!}{%
+	\begin{tabular}{lrrrrrr}
+		{} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
+		\noalign{\smallskip}
+		\hline
+		\noalign{\smallskip}	
+		flower     &            0.04 &            0.06 &           0.08 &           0.82 &         0.55 &         0.95 \\
+		rusty head &            0.02 &            0.13 &           0.03 &           0.83 &         0.19 &         0.98 \\
+		\noalign{\smallskip}
+		\hline
+	\end{tabular}%
+	}
+	\caption[Head Features CNN Performance]{\textbf{Performance of Head Features CNN}~~~Performance of the \acrshort{cnn} trained on asparagus heads.}
+	\label{tab:performance_measures_head_based}
+\end{table}
+
 
 \subsubsection{From hand-labeled features to class labels}
 \label{subsec:FeaturesToLabels}
 
-Approximately 200 asparagus spears per class label are pre-sorted\footnote{Pre-sorted in this context means that the asparagus spears were first sorted by the sorting machine and, if needed, re-sorted manually by professional workers.} and serve as ground truth mappings between input images and output class labels. We manually annotated the images with features (see \autoref{sec:ManualLabeling}). This allows us to divide the classification process into two steps: In the first step we predict feature values from images and in a second step we predict class labels from those feature values.
+Approximately 200 asparagus spears per class label serve as ground truth mappings between input images and output class labels. We manually annotated the images with features (see \autoref{sec:ManualLabeling}). This allows us to divide the classification process into two steps: In the first step we predict feature values from images and in a second step we predict class labels from those feature values.
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[width=0.60\textwidth]{Figures/chapter04/ftl_pipeline.png}
-	\decoRule
-	\caption[Feature to Class Label Pipeline]{\textbf{Feature to Class Label Pipeline}~~~The figure shows the flow of data and possible visualizations of our end to end solution app. We developed an app into which the user can load the asparagus images and the corresponding annotations. The app then visualizes the class label distribution. The user then can choose a model to predict the features of the asparagus pieces and inspect predictions for individually selected samples. Next, the user chooses a model that predicts class labels based on features of the asparagus. The app then presents a classification report, a confusion matrix and the possibility to inspect individual samples to visualize the prediction performance.}
-	\label{fig:FeatureEngineeringNetStructure}
+    \centering
+    \includegraphics[width=0.60\textwidth]{Figures/chapter04/ftl_pipeline.png}
+    \decoRule
+    \caption[Feature to Class Label Pipeline]{\textbf{Feature to Class Label Pipeline}~~~The figure shows the flow of data and visualizations of our end to end solution app. We developed an app into which the user can load the asparagus images and the corresponding annotations. The app then visualizes the class label distribution. The user can choose a model to predict the features of the asparagus pieces and inspect predictions for individually selected samples. Next, the user chooses a model that predicts class labels based on features of the asparagus. The app then presents a classification report, a confusion matrix and the possibility to inspect individual samples to visualize the prediction performance.}
+    \label{fig:FeatureEngineeringNetStructure}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.4]{Figures/chapter04/ftl_streamlit_app.png}
-	\decoRule
-	\caption[Screenshot of the Streamlit App]{\textbf{Screenshot of Streamlit App}~~~Screenshot of the streamlit app showing the three images of one asparagus spear with the corresponding labeled features.}
-    \label{fig:FeaturetoLabelStreamlitApp}
-\end{figure}    
+    \centering
+    \includegraphics[scale=0.4]{Figures/chapter04/ftl_streamlit_app.png}
+    \decoRule
+    \caption[Screenshot of the Streamlit App]{\textbf{Screenshot of Streamlit App}~~~Screenshot of the streamlit app showing the three images of one asparagus spear with the corresponding labeled features.}
+   \label{fig:FeaturetoLabelStreamlitApp}
+\end{figure}   
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.55]{Figures/chapter04/ftl_visualization.png}
-	\decoRule
-	\caption[Distribution of Class Labels]{\textbf{Distribution of Class Labels}~~~Absolute number of the asparagus spears for which a ground truth class label is available.}
-	\label{fig:FeatureToLabelVisualization}
+    \centering
+    \includegraphics[scale=0.55]{Figures/chapter04/ftl_visualization.png}
+    \decoRule
+    \caption[Distribution of Class Labels]{\textbf{Distribution of Class Labels}~~~Absolute number of the asparagus spears for which a ground truth class label is available.}
+    \label{fig:FeatureToLabelVisualization}
 \end{figure}
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.4]{Figures/chapter04/ftl_confusion_recall_random_forest.png}
-	\decoRule
-	\caption[Random Forest Classifier Confusion Matrices]{\textbf{Random Forest Classifier Confusion Matrices}~~~Confusion matrices showing the absolute and relative number of true positives of the random forest model.}
-	\label{fig:FeatureToLabelRandomForest}
+    \centering
+    \includegraphics[scale=0.4]{Figures/chapter04/ftl_confusion_recall_random_forest.png}
+    \decoRule
+    \caption[Random Forest Classifier Confusion Matrices]{\textbf{Random Forest Classifier Confusion Matrices}~~~Confusion matrices showing the absolute and relative number of true positives of the random forest model.}
+    \label{fig:FeatureToLabelRandomForest}
 \end{figure}
 
 %\begin{table}[!ht]
-%	\centering
-%	\includegraphics[scale=0.5]{Figures/chapter04/ftl_classification_report.png}
-%	\decoRule
-%	\caption[Classification Report of the Random Forest Classifier]{\textbf{Classification Report of the Random Forest Classifier}~~~The classification report shows key metrics for the trained random forest classifier.}
-%	\label{tab:FeatureToLabelReport}
+%    \centering
+%    \includegraphics[scale=0.5]{Figures/chapter04/ftl_classification_report.png}
+%    \decoRule
+%    \caption[Classification Report of the Random Forest Classifier]{\textbf{Classification Report of the Random Forest Classifier}~~~The classification report shows key metrics for the trained random forest classifier.}
+%    \label{tab:FeatureToLabelReport}
 %\end{table}
 
 \bigskip
@@ -559,7 +558,7 @@ \subsubsection{From hand-labeled features to class labels}
 
 The user can also inspect the distribution of the selected data (\autoref{fig:FeatureToLabelVisualization}) and the absolute and relative number of correctly and incorrectly classified asparagus spears in a confusion matrix (\autoref{fig:FeatureToLabelRandomForest}).
 
-The precision and recall are measures on how \enquote{useful and complete} the results are \citep{wiki:precisionrecall}. Precision is the ratio between true positives and the set of all true and false positives.Recall is the ratio between the true positives and the set of all true positives and false negatives. The confusion matrix (\autoref{fig:FeatureToLabelRandomForest}) gives us insight about which kind of errors the models make. We can observe that classes such as I~A~Anna, Dicke, Hohle, Rost, Köpfe, and Suppe can be recalled well (relative recall $\geq$ 0.8), while other classes, such as II~A and II~B have much lower recall ratings (relative recall $\leq$ 0.6). That means that II~A and II~B are the most commonly mislabeled classes (cf.~\autoref{tab:FeatureToLabelReport}).
+The precision and recall are measures on how \enquote{useful and complete} the results are \citep{wiki:precisionrecall}. Precision is the ratio between true positives and the set of all true and false positives. Recall is the ratio between the true positives and the set of all true positives and false negatives. The confusion matrix (\autoref{fig:FeatureToLabelRandomForest}) gives us insight about which kind of errors the models make. We can observe that classes such as I~A~Anna, Dicke, Hohle, Rost, Köpfe, and Suppe can be recalled well (relative recall $\geq$ 0.8), while other classes, such as II~A and II~B have much lower recall ratings (relative recall $\leq$ 0.6). That means that II~A and II~B are the most commonly mislabeled classes (cf.~\autoref{tab:FeatureToLabelReport}).
 
 \begin{table}[!h]
     \centering
@@ -625,13 +624,13 @@ \subsection{Unsupervised learning}
 Unsupervised learning is a class of machine learning techniques that deals with unlabeled data. More specifically, they work without a known goal, a reward system or prior training, and are usually used to find structure within data. Dimension reduction algorithms and clustering algorithms have been identified as the two main classes of unsupervised machine learning algorithms which are used in image categorization~\citep{olaode2014}. 
 
 \bigskip
-Multivariate data sets are generally high dimensional. However, it is common that some parts of that variable space are more filled with data points than others. A large part of the high dimensional variable space is not used. In order to recognize a structure or pattern in the data, it is necessary to reduce the number of dimensions. For this, both linear and non-linear approaches can be applied. Linear unsupervised learning methods for which also descriptive statistics can be acquired are e.g.\ \acrfull{pca}, Non-negative matrix factorization, and Independent Component Analysis~\citep{olaode2014}. Some examples for non-linear approaches are Kernel \acrshort{pca}~\citep{olivier2006semi}, Isometric Feature Mapping,  Local Linear Embedding, and Local Multi-Dimensional Scaling. We employed the linear dimension reduction algorithm \acrshort{pca} as well as autoencoders for nonlinear dimension reduction.
+Multivariate data sets are generally high dimensional. However, it is common that some parts of that variable space are more filled with data points than others. A large part of the high dimensional variable space is not used. In order to recognize a structure or pattern in the data, it is necessary to reduce the number of dimensions. For this, both linear and non-linear approaches can be applied. Linear unsupervised learning methods for which also descriptive statistics can be acquired are e.g.\ \acrfull{pca}, non-negative matrix factorization, and Independent Component Analysis~\citep{olaode2014}. Some examples for non-linear approaches are Kernel \acrshort{pca}~\citep{olivier2006semi}, Isometric Feature Mapping,  Local Linear Embedding, and Local Multi-Dimensional Scaling. We employed the linear dimension reduction algorithm \acrshort{pca} as well as autoencoders for nonlinear dimension reduction.
 
 
 \subsubsection{Principal Component Analysis}
 \label{subsec:PCA}
 
-\acrlong{pca} was chosen, because it is one of the standard unsupervised machine learning methods. Moreover, it is a linear, non-parametric method and a widespread application to extract relevant information from high dimensional data sets. The goal is to reduce the complexity of the data by only a minor loss of information~\citep{shlens2014}. Besides being a dimension reduction algorithm, \acrshort{pca} can also be useful to visualize or compress the data, filter noise, or extract features.
+\acrlong{pca} was performed, because it is one of the standard unsupervised machine learning methods. Moreover, it is a linear, non-parametric method and a widespread application to extract relevant information from high dimensional data sets. The goal is to reduce the complexity of the data by only a minor loss of information~\citep{shlens2014}. Besides being a dimension reduction algorithm, \acrshort{pca} can also be useful to visualize or compress the data, filter noise, or extract features.
 
 \bigskip
 Our initial aim was to reduce the dimension of our data set for further models. As \acrshort{pca} was applied at the beginning of the data inspection, we also had the aim to visualize our data in a three-dimensional space, in order to get a better understanding of the data distribution. The images that comprise our original data set have a high quality, and instead of only reducing the pixel size, we aimed to reduce the information contained in named depictions. It was achieved by analyzing the principal components in a first step, followed by projecting all relevant images into the lower dimensional space. This information could serve as the input to supervised machine learning algorithms or as a simple lookup scheme to retrieve the label of the example with the most similar low dimensional representation.
@@ -645,18 +644,18 @@ \subsubsection{Principal Component Analysis}
 However, there are several different ways on how to do so. First of all, we performed the \acrshort{pca} on black and white images. When working on black and white images, only one data point per pixel is given. Therefore, performing a \acrshort{pca} is less computationally expensive, and finding structure is easier. However, as we need to be able to recognize violet as well as rust, and therefore be able to differentiate between color nuances, it was decided to work on colored images.
 
 \bigskip
-There are different possibilities regarding our project which can be considered as useful ways on how to perform a \acrshort{pca}. First, it can be performed on images of different classes at the same time – similar to capturing several images of several people in one database, on which one \acrshort{pca} is performed. In our case this would mean that a \acrshort{pca} is applied to a data set of several input images of all 13 class labels. The second way would be to perform a \acrshort{pca} separately on each class. This way, an \enquote{Eigenasparagus} in each class would be calculated, and distances between the Eigenasparagus of different classes could be measured. Thirdly, \acrshort{pca} can be employed feature-wise. In this case, the data set would consist of a collection of images with a certain feature present vs a collection of the same size with the feature absent.
+There are various possibilities on what to apply the \acrshort{pca}. First, it can be performed on images of different classes at the same time – similar to capturing several images of several people in one database, on which one \acrshort{pca} is performed. In our case this would mean that a \acrshort{pca} is applied to a data set of several input images of all 13 class labels. The second way would be to perform a \acrshort{pca} separately on each class. This way, an \enquote{Eigenasparagus} in each class would be calculated, and distances between the Eigenasparagus of different classes could be measured. Thirdly, \acrshort{pca} can be employed feature-wise. In this case, the data set would consist of a collection of images with a certain feature present in contrast to a collection of the same size with the feature absent.
 
 \bigskip
 After trying several different approaches, we decided to perform our final \acrshort{pca} on sliced RGB images with background. The images were labeled for their features with the hand-label app, as this yielded the best results. An amount of 400 pictures per feature was used to perform a binary \acrshort{pca} for each feature (either the feature is absent or present).
 
-The 200 pictures where a certain feature is present as well as the 200 pictures where a certain feature is absent are extracted via loops over the csv-file, where all hand-labeled information is stored as well as the path to the labeled pictures. For each feature a matrix is created, storing 200 pictures with the present feature and 200 pictures without the feature. E.g., m\textunderscore hollow is the matrix created for the feature hollow (shape = 400, img\textunderscore shape[0] $\times$ img\textunderscore shape[1] $\times$ img\textunderscore shape[2]). The first 200 entries in the matrix are pictures of hollow asparagus. The last 200 pictures show asparagus, which is not hollow. These matrices were calculated for the features hollow, flower, rusty head, rusty body, bent, violet, length, and width. The data points of those 400 images in 2D space can be seen in~\autoref{fig:PCAscatter}.
+For each feature a matrix is created, storing 200 pictures with the present feature and 200 pictures without the feature. E.g.\, m\textunderscore hollow is the matrix created for the feature hollow (shape = 400, img\textunderscore shape[0] $\times$ img\textunderscore shape[1] $\times$ img\textunderscore shape[2]). The first 200 entries in the matrix are pictures of hollow asparagus. The last 200 pictures show asparagus, which is not hollow. These matrices were calculated for the features hollow, flower, rusty head, rusty body, bent, violet, length, and width. The data points of those 400 images in 2D space can be seen in~\autoref{fig:PCAscatter}.
 
 \begin{figure}[!htb]
 	\centering
 	\includegraphics[width=0.98\textwidth]{Figures/chapter04/pca_process.png}
 	\decoRule
-	\caption[PCA Process]{\textbf{PCA Process}~~~Before the \acrshort{pca} can be conducted, 200 images per feature are identified by the \texttt{get\textunderscore feature\textunderscore ids} function. Then, a matrix for each feature is calculated. The \acrshort{pca} is conducted feature-wise. Thereafter, an Eigenspace for each feature is calculated. To evaluate the performance of the \acrshort{pca}, a verification process was performed. Therefore, new feature-labeled asparagus pictures are taken as input and by the help of all Eigenspaces, they evaluate if the feature is present \enquote{1} or absent \enquote{0}. The result is compared to the label in order to evaluate the performance of the Eigenspace. To improve the algorithm, a pipeline of calculations can be implemented such that a feature vector is calculated. This feature-vector gives a class prediction. This is compared to the known label to the image.}
+	\caption[PCA Process]{\textbf{PCA Process}~~~Before the \acrshort{pca} can be conducted, 200 images per feature are identified by the \texttt{get\textunderscore feature\textunderscore ids} function. Then, a matrix for each feature is calculated. The \acrshort{pca} is conducted feature-wise. Thereafter, an Eigenspace for each feature is calculated. To evaluate the performance of the \acrshort{pca}, a verification process was performed. Therefore, new feature-labeled asparagus pictures are taken as input and by the help of all Eigenspaces, they evaluate if the feature is present \enquote{1} or absent \enquote{0}. The result is compared to the label in order to evaluate the performance of the Eigenspace. To improve the algorithm, a pipeline of calculations can be implemented such that a feature vector is calculated. This feature-vector gives a class prediction. This is compared to the known label of the image.}
 	\label{fig:PCAprocess}
 \end{figure}
 
@@ -664,7 +663,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_length_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature length.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature length.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -689,7 +688,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_hollow_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature hollow.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature hollow.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -714,7 +713,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_bent_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature bent.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature bent.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -739,7 +738,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_violet_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature violet.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature violet.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -764,7 +763,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_width_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature width.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature width.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -789,7 +788,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_rustybody_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature rusty body.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature rusty body.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -814,7 +813,7 @@ \subsubsection{Principal Component Analysis}
 	\centering
 	\begin{subfigure}{0.7\textwidth}
 		\includegraphics[width=0.9\linewidth]{Figures/chapter04/pca_flower_graph.png} 
-		\caption{This plots shows the magnitude of the first ten eigenvalues for the feature flower.}
+		\caption{This plot shows the magnitude of the first ten eigenvalues for the feature flower.}
 	\end{subfigure}
 	\vspace{20pt}
 	
@@ -836,9 +835,9 @@ \subsubsection{Principal Component Analysis}
 %\end{figure}
 
 \bigskip
-For all these features a \acrshort{pca} is calculated by first standardizing the matrix pixel-wise (dimensions: $1340\times346\times3$), calculating the covariance matrix, and then extracting the ordered eigenvalues. The principal components are calculated multiplying these eigenvectors with the standardized matrix. The feature space, the principal components, and the standardized matrices are saved to later perform a classification function. The highest ten eigenvalues are plotted to visually decide where to set the threshold of how many principal components will be further included. The first ten eigenvalues and the first ten Eigenasparagus for each feature can be seen in \autoref{fig:PCAlength} -- \autoref{fig:PCAflower}.
+For all these features a \acrshort{pca} is calculated by first standardizing the matrix pixel-wise (dimensions: $1340\times346\times3$), calculating the covariance matrix, and then extracting the ordered eigenvalues. The principal components are calculated by multiplying these eigenvectors with the standardized matrix. The feature space, the principal components, and the standardized matrices are saved to later perform a classification function. The highest ten eigenvalues are plotted to visually decide where to set the threshold of how many principal components will be further included. The first ten eigenvalues and the first ten Eigenasparagus for each feature can be seen in \autoref{fig:PCAlength} -- \autoref{fig:PCAflower}.
 
-The classification function is a control function, which performs on unseen images and predicts if a feature is absent or present. It reads in a new picture of one asparagus, which is not part of the asparagus database, meaning not within the 400 images that were used to calculate the \acrshort{pca}. Then, it searches for the asparagus that is most similar to it in the feature matrices (which are 200 pictures of asparagus carrying the feature, 200 pictures of asparagus without the feature).  This comparison is made with the reduced representation of the original pictures.
+The classification function is a control function, which performs on unseen images and predicts if a feature is absent or present. It reads in a new picture of one asparagus, which is not part of the 400 images that were used to calculate the \acrshort{pca}. Then, it searches for the asparagus that is most similar to it in the feature matrices (which are 200 pictures of asparagus carrying the feature, 200 pictures of asparagus without the feature). This comparison is made with the reduced representation of the original pictures.
 
 In greater detail, the input picture is first centered by subtracting the mean asparagus and then the picture is projected into the corresponding feature-space. That means the picture is translated into the lower dimensional space, in order to compare it to the known 400 pictures. The comparison is made through calculating the distance between the single centered Eigenasparagus and the 400 pictures in the feature space, by using the cdist function of the SciPy package. The smallest distance is considered as the most similar asparagus. If the index of the most similar asparagus is smaller than 200 we know that the feature is present, if it is above 200 the feature is absent. By comparing this to the information of the single asparagus spear, we know if the new asparagus has the same feature as its closest asparagus in the feature space, or not. By doing this for several images, we can already presume if the two features are likely to be easily separable or not. By evaluating this, we have a measure of how well our used principal components capture the distinguishing information of each feature.
 
@@ -855,22 +854,22 @@ \subsubsection{Principal Component Analysis}
 The scatterplots in~\autoref{fig:PCAscatter} show the data of each feature lined up along the axes of the first two principal components of each feature.
 
 \begin{figure}[h]
-	\centering
-	\includegraphics[scale=0.35]{Figures/chapter04/pca_scatterplot.png}
-	\decoRule
-	\caption[Feature-Wise Scatterplots]{\textbf{Feature-wise Scatterplots}~~~These scatterplots show the 200 data points where the feature is present in blue and the 200 data points where the feature is absent in orange. The data is projected into a 2D subspace, which is spanned by the first two principal components of each feature.}
-	\label{fig:PCAscatter}
+    \centering
+    \includegraphics[scale=0.35]{Figures/chapter04/pca_scatterplot.png}
+    \decoRule
+    \caption[Feature-Wise Scatterplots]{\textbf{Feature-wise Scatterplots}~~~These scatterplots show the 200 data points where the feature is present in blue and the 200 data points where the feature is absent in orange. The data is projected into a 2D subspace, which is spanned by the first two principal components of each feature.}
+    \label{fig:PCAscatter}
 \end{figure}
 
-For the classification function, we used the same ten images, to test each feature. The classification works best for the features length and hollow (10/10 classified correctly) and then width (8/10 classified correctly). It performs around chance-level for flower (6/10 classified correctly), violet, and rusty body (5/10 classified correctly), and extremely poor for bent (2/10 classified correctly).   
+For the classification function, we used the same ten images, to test each feature. The classification works best for the features length (10/10 classified correctly) and hollow (8/10 classified correctly, sensitivity 33.3\%, specificity 100%) and then width (8/10 classified correctly, sensitivity 100\%, specificity 60\%)\footnote{sensitivity in this analysis refers to the feature present correctly identified and specificity the feature not present correctly identified}. It performs around chance-level for flower (6/10 classified correctly, sensitivity 50\%, specificity 62.5\%), violet (5/10 classified correctly, sensitivity 0\%, specificity 55.5\%), and rusty body (5/10 classified correctly, sensitivity 100\%, specificity 44.4\%), and extremely poor for bent (2/10 classified correctly, sensitivity 16.6\%, specificity 25\%).
 
 From the images of the first ten principal components, we can visually assume that there is information about the length and shape stored in the first principal component, as a clear asparagus spear can be seen. The following images leave a lot of room for interpretation, about what information is contained there.
 
 We performed the \acrshort{pca} on each feature separately to extract the principal components. It is interesting to see that the pictures of the different features are all very similar, see \autoref{fig:PCAlength} -- \autoref{fig:PCAflower}. One reason for this might be that many of the 400 input pictures for each feature are overlapping between the remaining features. Another reason might be that even though the images vary between features, the general information of all asparagus images is very similar.
 
 \bigskip
-From the results of the classification function, one can see that there are large differences between the features in how well our \acrshort{pca} performs (20\% -- 100\%).
-One reason for this could be that certain features are simply more difficult to distinguish than others. Another reason for this large variation can be that certain features are also more difficult to label consistently (see~\autoref{subsec:AgreementMeasures}), and that the results are due to inconsistencies within the data. One indicator that this is a considerable reason is that the performance of the width and length features, which is information that is not hand-labeled, is very high. Moreover, the poorest results can be observed for the features bent and rusty body. Those are the features, for which the agreement measures show the largest discrepancies between annotators (see~\autoref{subsec:Reliability}).  
+From the results of the classification function, one can see that there are large differences between the features in how well our \acrshort{pca} performs. Arguably the testing set only comprised a small amount of images such that the tested categories are unbalanced that could cause ambiguous results.
+One reason for this could be that certain features are simply more difficult to distinguish than others. Additionally asparagus spears appear to be similar above all classes, thus major variance is mostly reduced to the position or orientation of the spear. Further preprocessing of the images could eradicate this problem. Another reason for this large variation can be that certain features are also more difficult to label consistently (see~\autoref{subsec:AgreementMeasures}), and that the results are due to inconsistencies within the data. One indicator that this is a considerable reason is that the performance of the width and length features, which is information that is not hand-labeled, is very high. Moreover, the poorest results can be observed for the features bent and rusty body. Those are the features, for which the agreement measures show the largest discrepancies between annotators (see~\autoref{subsec:Reliability}). 
 
 \bigskip
 Another reason why the results are partly only moderate, is that RGB image data possesses complicated structures and by representing it in a linear, low dimensional feature space, it might be that simply too much information is lost. Even though there are papers reporting good results of \acrshort{pca} on image data~\citep{turk1991face,lata2009}, there are other papers claiming that nonlinear dimension reduction algorithms are needed for this kind of image data~\citep{olaode2014}.
@@ -879,22 +878,22 @@ \subsubsection{Principal Component Analysis}
 \subsubsection{Autoencoder}
 \label{subsec:Autoencoder}
 
-Beside \acrshort{pca}, there are further techniques for dimension reduction. An alternative that can be employed to deduce sparse representations and automatically extract features by learning from examples are autoencoders.\footnote{See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/semisupervised/variational\_auto\_encoder}}
+Beside \acrshort{pca}, there are further techniques for dimension reduction. An alternative that can be employed to deduce sparse representations and automatically extract features by learning from examples are autoencoders.\footnote{See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/semisupervised/variational\_auto\_encoder} (as of 11/27/2020)}
 
 \bigskip
-Simple autoencoders, in which the decoder and encoder consist of \acrshortpl{mlp}, were already proposed as an alternative to \acrshort{pca} in the early days of \acrlong{ann} when computational resources were still comparatively limited~\citep{kramer1991nonlinear}. Today one can choose from a multitude of network architectures and designs that all have one property in common: A bottleneck layer. For image classification it is common practice to use convolutional autoencoders. There are numerous papers about applications in various domains. Examples range from medical images to aerial radar measurements~\citep{chen2017deep}. The autoencoders employed include not only shallow networks but more recently the benefits of deep autoencoders were demonstrated as well ~\citep{geng2015high}. In addition, more complex architectures combining autoencoders with generative adversarial models were proposed lately ~\citep{bao2017cvae}. In many cases the purpose of autoencoders is dimension reduction and feature extraction. Hence the activation of the bottleneck layer neurons (the output of the encoder) is of main interest. In other cases, autoencoders are used to map images from one domain to another. For example, camera recordings can be mapped to pixel-wise labeled images. In this approach the labeled image can be retrieved from the decoder’s output layer after successfully training the network ~\citep{iglovikov2018ternausnet}. In short, there are many possible ways to apply autoencoders and also many possible architectures to realize them.
+Simple autoencoders, in which the decoder and encoder consist of \acrshortpl{mlp}, were already proposed as an alternative to \acrshort{pca} in the early days of \acrlong{ann} when computational resources were still comparatively limited~\citep{kramer1991nonlinear}. Today one can choose from a multitude of network architectures and designs that all have one property in common: A bottleneck layer. For image classification it is common practice to use convolutional autoencoders. There are numerous papers about applications in various domains. Examples range from medical images to aerial radar measurements~\citep{chen2017deep}. The autoencoders employed include not only shallow networks but more recently the benefits of deep autoencoders were demonstrated as well~\citep{geng2015high}. An example that became famous because of its good performance is the deep \acrshort{cnn} U-Net that includes contracting paths that bypass deeper layers of the encoder-decoder assembly ~\citep{ronneberger2015u}. In addition, more complex architectures combining autoencoders with generative adversarial models were proposed lately ~\citep{bao2017cvae}. In other cases, autoencoders are used to map images from one domain to another. For example, camera recordings can be mapped to pixel-wise labeled images. In this approach the labeled image can be retrieved from the decoder’s output layer after successfully training the network ~\citep{iglovikov2018ternausnet}. In short, there are many possible ways to apply autoencoders and also many possible architectures to realize them.
 
-This motivates the question of how autoencoders work. As mentioned, all autoencoders have a bottleneck layer. If applied for dimension reduction, autoencoders are usually used to predict the input -- in this case the image -- with the input itself. The sparse representation corresponds to the activation of the latent layer for a given input. Autoencoders consist of an encoder that contains the initial layers as well as the bottleneck layer, and a decoder that maps the respective latent space back to the image. The desired mapping function of the input to a sparse representation is generated as a by-product of optimization in end-to-end training, as weights of the decoder are trained such that meaningful features are extracted. The main difference to \acrshort{pca} is that nonlinear functions can be approximated. Feedforward neural networks such as the encoder of an autoencoder are non-linear function approximators. Networks with multiple layers are especially well-known for establishing named nonlinear correlation. Hence, autoencoders allow for non-linear mappings to the latent space~\citep{kramer1991nonlinear}. This means that in the latter, multiple features may be represented in a two dimensional space. It shows that compared to \acrshort{pca}, where one dimension typically corresponds to one feature, more information can be represented in fewer dimensions. Different properties of the input are mapped to different areas of the latent space.
+If applied for dimension reduction, autoencoders are usually used to predict the input -- in this case the image -- with the input itself. The sparse representation corresponds to the activation of the latent layer for a given input. Autoencoders consist of an encoder that contains the initial layers as well as the bottleneck layer, and a decoder that maps the respective latent space back to the image. The desired mapping function of the input to a sparse representation is generated as a by-product of optimization in end-to-end training, as weights of the decoder are trained such that meaningful features are extracted. The main difference to \acrshort{pca} is that nonlinear functions can be approximated. Feedforward neural networks such as the encoder of an autoencoder are non-linear function approximators. Networks with multiple layers are especially well-known for establishing named nonlinear correlation. Hence, autoencoders allow for non-linear mappings to the latent space~\citep{kramer1991nonlinear}. This means that in the latter, multiple features may be represented in a two dimensional space. It shows that compared to \acrshort{pca}, where one dimension typically corresponds to one feature, more information can be represented in fewer dimensions. Different properties of the input are mapped to different areas of the latent space.
 
 \bigskip
 For this project, we used a certain kind of autoencoder as basis, namely a \acrfull{vae} for unsupervised data exploration. In \acrshortpl{vae}, a special loss function is used that ensures features in the latent space are mapped to a compact cluster of values. This allows for interpolation between samples by moving on a straight line. Regions between points in the latent space lie within the data and hence reconstructions of the decoder are more realistic~\citep{keras2020vae}. Other than that, \acrshortpl{vae} share most properties with regular autoencoders. The location of a point in latent space refers to a compressed representation of the input. These can be interpreted as features of the input.
 
 \begin{figure}[h]
-	\centering
-	\includegraphics[scale=0.8]{Figures/chapter04/autoencoder_latent_asparagus.png}
-	\decoRule
-	\caption[Latent Asparagus Space for MLP-VAE and Reconstruction Manifold]{\textbf{Latent Asparagus Space for MLP-VAE and Reconstruction Manifold}~~~The depiction illustrates the mapping by the encoder and decoder of the \acrshort{vae}: On the left you can see scatterplots illustrating the activation of latent layer neurons for the test data set (the mapping by the encoder). Image-features are mapped to the respective latent asparagus space. There is a scatterplot for each feature of interest where colors indicate positive (yellow) and negative (purple) samples. On the right a manifold of decoded images is shown. The axes relate to the points sampled in latent asparagus space that correspond to the reconstructions (mapping by the decoder).}
-	\label{fig:AutoencoderLatentSpace}
+    \centering
+    \includegraphics[scale=0.8]{Figures/chapter04/autoencoder_latent_asparagus.png}
+    \decoRule
+    \caption[Latent Asparagus Space for MLP-VAE and Reconstruction Manifold]{\textbf{Latent Asparagus Space for MLP-VAE and Reconstruction Manifold}~~~The depiction illustrates the mapping by the encoder and decoder of the \acrshort{vae}: On the left you can see scatterplots illustrating the activation of latent layer neurons for the test data set (the mapping by the encoder). Image-features are mapped to the respective latent asparagus space. There is a scatterplot for each feature of interest where colors indicate positive (yellow) and negative (purple) samples. On the right a manifold of decoded images is shown. The axes relate to the points sampled in latent asparagus space that correspond to the reconstructions (mapping by the decoder).}
+    \label{fig:AutoencoderLatentSpace}
 \end{figure}
 
 Different \acrshortpl{vae} were tested in the scope of the project. First, a simple \acrshort{vae} with a \acrshort{mlp} as decoder was implemented. The second approach was a comparatively shallow convolutional \acrshort{vae}. The third approach relates to a convolutional \acrshort{vae} with a deeper encoder that was later used as a basis to design the networks for semi-supervised learning. The third approach is more complex but does not improve the mapping of the properties of asparagus spears to a two dimensional latent asparagus space. Thus, only the results for the second of the three mentioned networks are reported in the following.
@@ -904,16 +903,15 @@ \subsubsection{Autoencoder}
 Similar to the application of \acrshort{pca}, batches of images that contain only one perspective are used as input to the network. The downsampled data set is used. Images have to be padded as the implementation does not work with inputs of arbitrary shape. This is because the deconvolutional layers of the decoder can only increase dimensionality by an integer factor: The filters that are used for the deconvolution in the given network double the tensor dimensionality. An increase of the vertical dimension from 34 to 68 and finally to the desired 136 pixels is achieved in the last three layers of the network (which is impossible for the original height of 134 pixels). The input shape, which also defines the shape of the output layer, must be divisible by two without remainder twice.
 
 \bigskip
-\autoref{fig:AutoencoderLatentSpace} shows the results. It demonstrates that the features short, thick, and thin are mapped to separable clusters. As a tendency the feature bent correlates with a region in the lower periphery, as indicated especially by the deconstruction depicted on the right side of \autoref{fig:AutoencoderLatentSpace}. The other features (violet, rust and flower) are not mapped adequately, thus, they are not visible in the reconstruction. This shows that only some features of interest are mapped to the latent space and used to decode images. Reconstructions of autoencoders are known to miss many details \citep{kramer1991nonlinear}. One may speculate that better results can be achieved using larger input images as we applied autoencoders on downscaled images. 
+\autoref{fig:AutoencoderLatentSpace} shows the results. It demonstrates that the features short, thick, and thin are mapped to separable clusters. As a tendency the feature bent correlates with a region in the lower periphery, as indicated especially by the deconstruction depicted on the right side of \autoref{fig:AutoencoderLatentSpace}. The other features (violet, rust and flower) are not mapped adequately, thus, they are not visible in the reconstruction. This shows that only some features of interest are mapped to the latent space and used to decode images. Reconstructions of autoencoders are known to miss many details \citep{kramer1991nonlinear}. One may speculate that better results can be achieved using larger input images as we applied autoencoders on downscaled images.
 
-The possibility to generate, for example, more or less bent asparagus spears may help to define a precise decision boundary and classify images accordingly. As a potential feature for asparagus sorting machines, this would allow the user to customize the definition of features such as bent to their own taste. For this approach to be viable for all features, however, the network performance appears to be insufficient. Some features are poorly separated by the network.
+The possibility to generate, for example, more or less bent asparagus spears may help to define a precise decision boundary and classify images accordingly. As a potential feature for asparagus sorting machines, this would allow the user to customize the definition of features such as bent to their own taste. For this approach to be viable for all features, however, the network performance appears to be insufficient. Some features are separated poorly by the network.
 
 
 \subsection{Semi-supervised learning}
 \label{sec:SemiSupervisedLearning}
 
-We collected more than 100000 samples. Considering the uniform appearance of the images this represents a substantial amount. However, labels had to be manually generated. This was done for only around 10\% of the samples. As a consequence, there is only a small subset of data with attributed labels. Smaller amounts of labeled data mean that predictions can be successful only if the variance in the source values is limited. Hence, for high dimensional data such as images, sparse representations are desirable. Extracting features automatically instead of relying on manual feature engineering is a strategy that is especially 
-appealing if large amounts of unlabeled data are available.
+Only a small subset of the data (around 10\%) is attributed with labels. Smaller amounts of labeled data mean that predictions can be successful only if the variance in the source values is limited. Hence, for high dimensional data such as images, sparse representations are desirable. Extracting features automatically instead of relying on manual feature engineering is a strategy that is especially appealing if large amounts of unlabeled data are available.
 
 In semi-supervised learning features are extracted from the input images in an unsupervised fashion. If labels are available for a sample they are used to ensure that the extracted features correspond to the target categories  ~\citep{keng2017semi}. Hence, semi-supervised learning promises better results by using not only labeled samples but also unlabeled data points of partially labeled data sets.
 
@@ -921,84 +919,84 @@ \subsection{Semi-supervised learning}
 \subsubsection{Semi-supervised autoencoder}
 \label{subsec:VariationalAutoencoder}
 
-In the previous chapter, methods and results for unsupervised learning are presented. One example is a convolutional autoencoder. In this section, it is shown how convolutional autoencoders with additional soft constraints in the loss function can be used for semi-supervised learning.\footnote{See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/semisupervised/semi_supervised_autoencoders}} Instead of using unsupervised methods to compute another data set of sparse representations and use the latter to predict labels, in semi-supervised learning sparse representations are retrieved and mapped to latent features at the same time~\citep{keng2017semi}. Bottleneck layer activations represent automatically extracted features. For semi-supervised learning one tries to enforce that latent layer activations of autoencoders correlate with the target categories.
+In the previous chapter, methods and results for unsupervised learning are presented. One example is a convolutional autoencoder. In this section, it is shown how convolutional autoencoders with additional soft constraints in the loss function can be used for semi-supervised learning.\footnote{See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/classification/semisupervised/semi_supervised_autoencoders} (as of 11/27/2020)} Instead of using unsupervised methods to compute another data set of sparse representations and use the latter to predict labels, in semi-supervised learning sparse representations are retrieved and mapped to latent features at the same time~\citep{keng2017semi}. Bottleneck layer activations represent automatically extracted features. For semi-supervised learning one tries to enforce that latent layer activations of autoencoders correlate with the target categories.
 
 \begin{figure}[!htb]
 	\centering
 	\includegraphics[scale=0.8]{Figures/chapter04/semi_supervised_network.png}
 	\decoRule
-	\caption[Semi-Supervised Learning Network Structures]{\textbf{Network Structures for Semi-Supervised Learning}~~~The depiction illustrates the network structure for the autoencoders used for semi-supervised learning. Left: The structure for the semi-supervised convolutional \acrshort{vae}. Right: The structure for the semi-supervised convolutional autoencoder. Further explanations can be found in the text.}
+	\caption[Semi-Supervised Learning Network Structures]{\textbf{Network Structures for Semi-Supervised Learning}~~~The depiction illustrates the network structure for the autoencoders used for semi-supervised learning. Left: The structure for the semi-supervised convolutional \acrshort{vae}. Right: The structure for the semi-supervised convolutional autoencoder.}
 	\label{fig:SemiSupervisedNetworkStructures}
 \end{figure}
 
 \bigskip
-The general network structure is derived from the convolutional autoencoder used for unsupervised learning in \autoref{subsec:Autoencoder}. The feedforward  \acrshort{cnn} is replaced by a network that has proven to be suitable to detect at least some features with sufficient adequacy when trained on asparagus heads (see~\autoref{subsec:HeadNetwork}). It comprises three convolutional layers with 32 kernels, respectively. The first layer has a kernel size of $2\times2$, and the two subsequent layers have a kernel size of $3\times3$. A max pooling layer with stride size two is added mainly to reduce the number of total neurons while maintaining a high number of kernels. Additionally, a dropout layer is added to avoid overfitting. In contrast to other implementations for semi-supervised learning~\citep{keng2017semi} the same network is used for the prediction of labels (when they exist for the current batch) as for the encoder that retrieves the sparse representation for reconstructing images. We chose the same decoder as in the \acrlong{vae} presented in the previous \autoref{subsec:Autoencoder}. The effects of a bypass layer that contains neurons not being subject to the label layer loss (see~\autoref{fig:SemiSupervisedNetworkStructures}) were tested. As accuracy did not improve, it was later dismissed. Two variations of the network were tested: A convolutional \acrshort{vae} for semi-supervised learning and a convolutional autoencoder for semi-supervised learning. The architecture for both networks can be seen in \autoref{fig:SemiSupervisedNetworkStructures}.
+The general network structure is derived from the convolutional autoencoder used for unsupervised learning in \autoref{subsec:Autoencoder}. The feedforward \acrshort{cnn} is replaced by a network that has proven to be suitable to detect at least some features with sufficient adequacy when trained on asparagus heads (see~\autoref{subsec:HeadNetwork}). It comprises three convolutional layers with 32 kernels, respectively. The first layer has a kernel size of $2\times2$, and the two subsequent layers have a kernel size of $3\times3$. A max pooling layer with stride size two is added mainly to reduce the number of total neurons while maintaining a high number of kernels. Additionally, a dropout layer is added to avoid overfitting. In contrast to other implementations for semi-supervised learning~\citep{keng2017semi} the same network is used for the prediction of labels (when they exist for the current batch) as for the encoder that retrieves the sparse representation for reconstructing images. We chose the same decoder as in the \acrlong{vae} presented in the previous \autoref{subsec:Autoencoder}. The effects of a bypass layer that contains neurons not being subject to the label layer loss (see~\autoref{fig:SemiSupervisedNetworkStructures}) were tested. As accuracy did not improve, it was later dismissed. Two variations of the network were tested: A convolutional \acrshort{vae} for semi-supervised learning and a convolutional autoencoder for semi-supervised learning. The architecture for both networks is shown in \autoref{fig:SemiSupervisedNetworkStructures}.
 
 A challenge results from training with multiple inputs. As deep learning frameworks usually require a connected graph that links inputs to outputs, a trick is used in order to handle that two tensors -- images as well as labels -- are given as an input. A dummy layer is introduced where all information derived from the labels is multiplied by zero. The output vector is concatenated with the bottleneck of the encoder. As it contains no variance, training and, more importantly, validation accuracies remain unaffected even though information about the categories to be predicted is added on the input side. Nevertheless, the labels are part of the network graph and could hence be used in the loss function.
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.35]{Figures/chapter04/semi_supervised_latent_asparagus.png}
-	\decoRule
-	\caption[Semi-Supervised Convolutional VAE Latent Asparagus Space]{\textbf{Latent Asparagus Space for Semi-Supervised VAE}~~~The depiction illustrates the mapping by the encoder and decoder of the \acrshort{cnn}-\acrshort{vae} with additional constraints in the loss function for semi-supervised learning: On the left one can see scatterplots illustrating the activation of latent layer neurons for the test data set (the mapping by the encoder). Image-features are mapped to the respective latent asparagus space. Mind that there is a scatterplot for each feature of interest where colors indicate positive and negative samples. On the right, a manifold of decoded images can be found. The axis relates to the points sampled in latent asparagus space that correspond to the reconstructions (mapping by the decoder).}
-	\label{fig:SemiSupervisedLatentSpace}
+    \centering
+    \includegraphics[scale=0.35]{Figures/chapter04/semi_supervised_latent_asparagus.png}
+    \decoRule
+    \caption[Semi-Supervised Convolutional VAE Latent Asparagus Space]{\textbf{Latent Asparagus Space for Semi-Supervised VAE}~~~The depiction illustrates the mapping by the encoder and decoder of the \acrshort{cnn}-\acrshort{vae} with additional constraints in the loss function for semi-supervised learning: On the left one can see scatterplots illustrating the activation of latent layer neurons for the test data set (the mapping by the encoder). Image-features are mapped to the respective latent asparagus space. Mind that there is a scatterplot for each feature of interest where colors indicate positive and negative samples. On the right, a manifold of decoded images can be found. The axis relates to the points sampled in latent asparagus space that correspond to the reconstructions (mapping by the decoder).}
+    \label{fig:SemiSupervisedLatentSpace}
 \end{figure}
 
-A custom conditional loss function is used. If labels are present for the current batch, the custom loss is equal to a combined loss that comprises the reconstruction loss and the label loss. Here, reconstruction loss refers to the pixel-wise loss that is used for the main task of the network -- namely, the mapping of input images back to the same images (fed into the network as target values to the output layer). The label loss is used with the goal of mapping label layer activations to the actual labels. It is low if activations in the sigmoid transformed label layer match the target values i.e.\ the sum of the error layer is low. The loss that is due to labels is multiplied by a custom factor $(k)$. In addition it is defined such that it scales with the current pixel-wise reconstruction loss and converges to a constant $(c)$. These values are chosen with the aim of increasing the contribution of the label loss to the combined loss, especially in late stages of training.
+A custom conditional loss function is used. If labels are present for the current batch, the custom loss is equal to a combined loss that comprises the reconstruction loss and the label loss. Here, reconstruction loss refers to the pixel-wise loss that is used for the main task of the network -- namely, the mapping of input images back to the same images. The label loss is used with the goal of mapping label layer activations to the actual labels. It is low if activations in the sigmoid transformed label layer match the target values i.e.\ the sum of the error layer is low. The loss for the labels is multiplied by a custom factor $(k)$. In addition, it is defined such that it scales with the current pixel-wise reconstruction loss and converges to a constant $(c)$. These values are chosen with the aim of increasing the contribution of the label loss to the combined loss, especially in late stages of training.
 
 \begin{table}[!htb]
-	\centering
-	\resizebox{\columnwidth}{!}{%
-	\begin{tabular}{lrrrrrr}
-		{} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
-		\noalign{\smallskip}
-		\hline
-		\noalign{\smallskip}
-		bent     &            0.04 &            0.37 &           0.04 &           0.55 &         0.10 &         0.94 \\
-		violet     &            0.00 &            0.08 &           0.00 &           0.92 &         0.00 &         1.00 \\
-		rusty body &            0.14 &            0.27 &           0.20 &           0.39 &         0.42 &         0.73 \\
-		fractured         &            0.00 &            0.02 &           0.00 &           0.98 &         0.00 &         1.00 \\
-		thick         &            0.00 &            0.07 &           0.00 &           0.93 &         0.00 &         1.00 \\
-		thin          &            0.00 &            0.14 &           0.16 &           0.70 &         0.54 &         0.99 \\
-		\noalign{\smallskip}
-		\hline
-	\end{tabular}%
-	}
-	\caption[Semi-Supervised Convolutional VAE Performance]{\textbf{Convolutional VAE Performance}~~~Performance of the semi-supervised convolutional \acrshort{vae}.}
-	\label{tab:performance_convolutional_vae}
+    \centering
+    \resizebox{\columnwidth}{!}{%
+    \begin{tabular}{lrrrrrr}
+        {} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
+        \noalign{\smallskip}
+        \hline
+        \noalign{\smallskip}
+        bent     &            0.04 &            0.37 &           0.04 &           0.55 &         0.10 &         0.94 \\
+        violet     &            0.00 &            0.08 &           0.00 &           0.92 &         0.00 &         1.00 \\
+        rusty body &            0.14 &            0.27 &           0.20 &           0.39 &         0.42 &         0.73 \\
+        fractured         &            0.00 &            0.02 &           0.00 &           0.98 &         0.00 &         1.00 \\
+        thick         &            0.00 &            0.07 &           0.00 &           0.93 &         0.00 &         1.00 \\
+        thin          &            0.00 &            0.14 &           0.16 &           0.70 &         0.54 &         0.99 \\
+        \noalign{\smallskip}
+        \hline
+    \end{tabular}%
+    }
+    \caption[Semi-Supervised Convolutional VAE Performance]{\textbf{Convolutional VAE Performance}~~~Performance of the semi-supervised convolutional \acrshort{vae}.}
+    \label{tab:performance_convolutional_vae}
 \end{table}
 
 \begin{table}[!htb]
-	\centering
-	\resizebox{\columnwidth}{!}{%
-	\begin{tabular}{lrrrrrr}
-		{} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
-		\noalign{\smallskip}
-		\hline
-		\noalign{\smallskip}
-		bent     &            0.02 &            0.34 &           0.07 &           0.57 &         0.18 &         0.96 \\
-		violet     &            0.00 &            0.08 &           0.00 &           0.92 &         0.00 &         1.00 \\
-		rusty body &            0.23 &            0.19 &           0.28 &           0.30 &         0.59 &         0.57 \\
-		fractured         &            0.00 &            0.01 &           0.01 &           0.97 &         0.57 &         1.00 \\
-		thick         &            0.00 &            0.06 &           0.00 &           0.93 &         0.04 &         1.00 \\
-		thin          &            0.02 &            0.07 &           0.23 &           0.68 &         0.76 &         0.97 \\
-		\noalign{\smallskip}
-		\hline
-	\end{tabular}%
-	}
-	\caption[Semi-Supervised Convolutional Autoencoder Performance]{\textbf{Convolutional Autoencoder Performance}~~~Performance of the Semi-Supervised Convolutional Autoencoder.}
-	\label{tab:performance_semi_supervised_autoencoder}
+    \centering
+    \resizebox{\columnwidth}{!}{%
+    \begin{tabular}{lrrrrrr}
+        {} &  False positive &  False negative &  True positive &  True negative &  Sensitivity &  Specificity \\
+        \noalign{\smallskip}
+        \hline
+        \noalign{\smallskip}
+        bent     &            0.02 &            0.34 &           0.07 &           0.57 &         0.18 &         0.96 \\
+        violet     &            0.00 &            0.08 &           0.00 &           0.92 &         0.00 &         1.00 \\
+        rusty body &            0.23 &            0.19 &           0.28 &           0.30 &         0.59 &         0.57 \\
+        fractured         &            0.00 &            0.01 &           0.01 &           0.97 &         0.57 &         1.00 \\
+        thick         &            0.00 &            0.06 &           0.00 &           0.93 &         0.04 &         1.00 \\
+        thin          &            0.02 &            0.07 &           0.23 &           0.68 &         0.76 &         0.97 \\
+        \noalign{\smallskip}
+        \hline
+    \end{tabular}%
+    }
+    \caption[Semi-Supervised Convolutional Autoencoder Performance]{\textbf{Convolutional Autoencoder Performance}~~~Performance of the Semi-Supervised Convolutional Autoencoder.}
+    \label{tab:performance_semi_supervised_autoencoder}
 \end{table}
 
 \bigskip
-The results for the semi-supervised \acrshort{vae} are illustrated in~\autoref{tab:performance_convolutional_vae} and visualized by~\autoref{fig:SemiSupervisedLatentSpace}. As one can immediately see, the feature thin is adequately mapped to a decisive region in latent space. For the other features, no such clear cut clustering is visible. Reconstructions indicate that the main purpose of the network of predicting the input images is accomplished successfully although the reconstructions look rather uniform. Values for accuracy and sensitivity indicate only poor performance. Sensitivity is only above zero for the features rusty body (0.42), thin (0.54) and bent (0.1).
+The results for the semi-supervised \acrshort{vae} are illustrated in~\autoref{tab:performance_convolutional_vae} and visualized by~\autoref{fig:SemiSupervisedLatentSpace}. As one can immediately see, the feature thin is adequately mapped to a decisive region in latent space. For the other features, no such clear cut division is visible. Reconstructions indicate that the main purpose of the network of predicting the input images is accomplished successfully although the reconstructions look rather uniform. Values for accuracy and sensitivity indicate only poor performance. Sensitivity is only above zero for the features rusty body (0.42), thin (0.54) and bent (0.1).
 
-Compared to the variational autoencoder for semi-supervised learning, the simple  convolutional autoencoder for semi-supervised learning performs better. However, there are substantial potentials for improvements. \autoref{tab:performance_semi_supervised_autoencoder} shows a summary of results. Violet detection is not successful at all, as indicated by a sensitivity of zero. For the other features that the network was trained on, mediocre results are achieved. Thickness detection shows little sensitivity (0.04), however a high specificity (1.0). Better results in sensitivity exist for bent (0.18) rusty body (0.42), and short spears (0.6). The specificity for rusty body is lower (0.6) as compared to named other features (1.0 and 0.96). Thin spears are detected in 76\% of all cases and few false positives characterize the detection of named feature (0.97 specificity).
+Compared to the variational autoencoder, the simple  convolutional autoencoder performs better for semi-supervised learning. However, there are substantial potentials for improvements. \autoref{tab:performance_semi_supervised_autoencoder} shows a summary of results. Violet detection is not successful at all, as indicated by a sensitivity of zero. For the other features that the network was trained on, mediocre results are achieved. Thickness detection shows little sensitivity (0.04), however a high specificity (1.0). Better results in sensitivity exist for bent (0.18) rusty body (0.42), and short spears (0.6). The specificity for rusty body is lower (0.6) as compared to named other features (1.0 and 0.96). Thin spears have the highest sensitivity (0.76) and a high specificity as well (0.97).
 
 \bigskip
 The approach for semi-supervised learning presented here faces two challenges. First, the networks are trained only on one perspective although labels are attributed per asparagus spear -- that is, for three concatenated images only one label exists. Information that might only be visible on one or two out of the three perspectives cannot be mapped to the desired target category. Image-wise labels would be desirable to improve the approach.
 
-Second, reconstructions using convolutional autoencoders contain little detail. However, small details in the image, such as small brown spots that indicate rust, play an important role for classification. These features are not sufficiently reconstructed by the \acrshort{vae}. Arguably, they are hence not reflected in the sparse representation that corresponds to latent layer activations. One may speculate that a substantial increase of the network size would help to reconstruct more details and hence extract more features. As \acrfullpl{gan} are known to generate more detailed images~\citep{bao2017cvae} they could possibly be adopted for semi-supervised asparagus classification with greater success. However, this is a question that must be answered empirically.
+Second, reconstructions using convolutional autoencoders contain little detail. However, small details in the image, such as small brown spots that indicate rust, play an important role for classification. These features are not sufficiently reconstructed by the \acrshort{vae}. Hence, they are not reflected in the sparse representation that corresponds to latent layer activations. One may speculate that a substantial increase of the network size would help to reconstruct more details and thereby extract more features. As \acrfullpl{gan} are known to generate more detailed images~\citep{bao2017cvae} they could possibly be adopted for semi-supervised asparagus classification with greater success. However, this is a question that must be answered empirically.
 
 \bigskip
-In summary, one may conclude that automatic retrieval of sparse representations with autoencoders appears as an alternative to manual feature engineering (rule based retrieval of sparse representations)  if a large data set is available and only a subset contains labeled samples. However, more research is necessary to find best suitable network structures for asparagus classification.
+In summary, one may conclude that automatic retrieval of sparse representations with autoencoders appears as an alternative to manual feature engineering if a large data set is available and only a subset contains labeled samples. However, more research is necessary to find best suitable network structures for asparagus classification.
\ No newline at end of file
diff --git a/documentation/report/Chapters/Conclusion.tex b/documentation/report/Chapters/Conclusion.tex
index 4cddb06..3f44878 100644
--- a/documentation/report/Chapters/Conclusion.tex
+++ b/documentation/report/Chapters/Conclusion.tex
@@ -4,27 +4,29 @@
 \section{Conclusion}
 \label{ch:Conclusion}
 
-In the scope of our project, we sucessfully adopted various classical as well as deep learning based computer vision approaches to classify asparagus spears according to their descriptive features or class labels. After collecting images, labeling, and processing the data, different trainable models were implemented from the fields of supervised learning, unsupervised learning and semi-supervised learning.
+In the scope of our project, we successfully adopted various classical as well as deep learning based computer vision approaches to classify asparagus spears based on their descriptive features or class labels. After collecting images, labeling, and processing the data, different trainable models were implemented from the fields of supervised learning, unsupervised learning and semi-supervised learning.
 
 \bigskip
-We could prove that computer vision and machine learning based techniques are practicable for the asparagus classification problem.
+We could prove that computer vision and machine learning based techniques are practicable for the asparagus classification problem as they can be performed in a suitable time and our exploratory analysis yields promising results.
 
-Our explorative study gave the possibility of a better overview on which techniques seem promising for asparagus classification and which might not need to be pursued in the future. As a next step, we would concentrate on using binary feedforward \acrshortpl{cnn} because they offer a flexible and simple solution. Additional fine-tuning of the architecture to fit the individual needs of the to-be-predicted features (or class labels) and further preprocessing of the data will improve the network (as demonstrated in \autoref{subsec:FeatureEngineering}). Considering the amount of labeled data that is available, unsupervised and semi-supervised approaches gave some insight into the data but, in the end, they were more effortful to implement while not promising much better results. If there had been less labeled data, these approaches might have been more useful than approaches relying on labeled data.
+Our explorative study gave the possibility of a better overview on which techniques seem promising for asparagus classification and which might not need to be pursued in the future. Considering the amount of labeled data that is available, unsupervised and semi-supervised approaches gave some insight into the data but, in the end, they were more effortful to implement while not promising much better results than the supervised approaches. If there had been less labeled data, these approaches might have been more useful than approaches relying on labeled data.
 
-Whether we have succeeded in improving the currently running sorting algorithm can not be said, yet. In cooperation with the local asparagus farm Gut Holsterfeld and the manufacturer of the asparagus sorting machine Autoselect ATS II, a method for evaluation can now be developed. 
+Whether we have succeeded in improving the currently running sorting algorithm can not be said, yet. In cooperation with the local asparagus farm Gut Holsterfeld and the manufacturer of the asparagus sorting machine Autoselect ATS II, an implementation into the existing system can take place now and a method for evaluation can be defined.
 
 
 \bigskip
-Due to the direct start of the harvesting season at the beginning of the project, the sorting machine and the hardware setup had to be used as available (see also \autoref{sec:DiscussionMethodology}). However, improvements were discussed with the manufacturer of the machine.
+Due to the direct start of the harvesting season at the beginning of the project, the sorting machine and the hardware setup had to be used as available (see also \autoref{sec:DiscussionMethodology}). However, possible improvements were discussed with the manufacturer of the machine.
 
-A second camera to capture the head of the asparagus is needed in particular\footnote{This is already the case for newer versions of the Autoselect ATS II like the one at Querdel’s Hof.} which is reflected in our results. Especially for features like flower or rusty head, an additional head camera helps greatly with the classification (as shown in \nameref{subsec:HeadNetwork}). Another camera taking an image from the bottom of the spear could improve the detection of the feature hollow.\footnote{These cameras are already part of new asparagus sorting hardware like the one mentioned here: \url{https://www.neubauer-automation.de/uk/asparagus-sorting-machine-espaso-technicaldata.php}}Additional perspectives of the hole asparagus spears give more usable information to determine certain features more accurately. For example differences in lighting and reflection can carry useful information about the shape. 
+A second camera to capture the head of the asparagus is needed in particular\footnote{This is already the case for newer versions of the Autoselect ATS II like the one at Querdel’s Hof.} which is reflected in our results. Especially for features like flower or rusty head, an additional head camera helps greatly with the classification (as shown in \nameref{subsec:HeadNetwork}). Another camera taking an image from the bottom of the spear could improve the detection of the feature hollow.\footnote{These cameras are already part of new asparagus sorting hardware like the one mentioned here: \url{https://www.neubauer-automation.de/uk/asparagus-sorting-machine-espaso-technicaldata.php} (visited on 04/28/2020)} Additional perspectives of the whole asparagus spears provide more information to determine certain features more accurately. For example, differences in lighting and reflection can carry useful information about the shape.
 
-To further improve the setup, it is also conceivable that other sensor systems, such as laser technology, could help to find relevant properties for asparagus classification \citep{bhargava2018fruits}. As mentioned in the \nameref{ch:Discussion}, the most crucial component for improving future asparagus sorting with the Autoselect ATS II is a further development of the setup. This requires hardware changes to the sorting machine and is therefore beyond the scope of our project.
+As mentioned in the \nameref{ch:Discussion}, the most crucial component for improving future asparagus sorting with the Autoselect ATS II is a further development of the setup. To improve the setup, it is conceivable that other sensor systems, such as laser technology to measure the exact size, could help to find relevant properties for asparagus classification \citep{bhargava2018fruits}. As this requires hardware changes to the sorting machine it is beyond the scope of our project.
 
 \bigskip
 Another claim that we started to address in the \nameref{sec:DiscussionMethodology}
-is that the asparagus classification varies between different farms. Further it also varies within one farm throughout the harvesting season. For example, one request of the farmer of Gut Holsterfeld is to have the possibility to sort asparagus in a higher category if weather conditions reduce the usual amount of high quality asparagus. In order to meet this requirement, manual adjustment of parameters and a smooth transition for several features is needed. At the moment this is impossible with most contemporary neural networks. Since there is the temporal, sequential component of the harvesting season, it may be worthwhile to consider \acrshortpl{lstm} in combination with  \acrshortpl{cnn}. According to the literature there are still many wide possibilities, for example applying bayesian networks learning.
+is that the asparagus classification varies between different farms. Further, it also varies within one farm throughout the harvesting season. For example, one request of the farmer of Gut Holsterfeld is to have the possibility to sort asparagus in a better category if weather conditions reduce the usual amount of high quality asparagus. In order to meet this requirement, manual adjustment of parameters and a smooth transition for several features is needed. At the moment this is impossible with most contemporary neural networks. Since there is the temporal, sequential component of the harvesting season, it may be worthwhile to consider \acrshortpl{lstm} in combination with \acrshortpl{cnn}. According to the literature, there is still a wide range of possibilities, for example applying bayesian networks learning.
 
+\bigskip
+If this work will be followed by another project, the collected images, the different datasets and also the hand label app can be used for further approaches. Access to all images and the datasets can be gained through the university server.\footnote{ Until 01/31/2021, it can be accessed via \\ /net/projects/scratch/summer/valid\textunderscore until\textunderscore 31\textunderscore January\textunderscore 2021/jzerbe.} The documentation of the different datasets and the hand label app can be found in this report. The code can be found in our github repository.\footnote{ at \url{https://github.com/CogSciUOS/asparagus} (as of 11/27/2020)} A time-consuming part of our project was to collect data and explore different computer vision and machine learning approaches. A follow-up project could directly focus on developing and evaluating end-to-end pipelines for asparagus classification. Our results suggest using a binary feedforward \acrshortpl{cnn} would be a promising starting point for future work, because we have enough labeled data, and they offer a flexible and simple solution. Additional fine-tuning of the architecture to fit the individual needs of the to-be-predicted features (or class labels) and further preprocessing of the data will improve the network (as demonstrated in \autoref{subsec:FeatureEngineering}). The exploratory approaches from this project should be looked at for inspiration, but the unsupervised and semi-supervised approaches are not suggested to be continued. If additional labeled data is wanted, the hand label app can be used.\footnote{The code can be found at our Github repository at \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/labeling/hand\textunderscore label\textunderscore assistant} (as of 11/27/2020).} To finalize the project, the finished algorithm should be integrated into the sorting machine and evaluated in the real life setting.
 
 \bigskip
-In conclusion, we can confirm that modern approaches from computer vision and machine learning bear huge potential for the improvement of asparagus classification. Effective means of agile development allow for efficient collaboration in the production of the respective implementations. We demonstrated that the algorithms we selected could be used not only for scientific purposes but also in industrial applications. We strongly believe, machine learning approaches can help to improve the classification of asparagus into commercial quality classes.
\ No newline at end of file
+In conclusion, we can confirm that modern approaches from computer vision and machine learning bear huge potential for the improvement of asparagus classification. We demonstrated that the algorithms we selected and adapted from scientific purposes could also be used in industrial applications. We strongly believe, machine learning approaches can help to improve the classification of asparagus into commercial quality classes.
\ No newline at end of file
diff --git a/documentation/report/Chapters/Dataset.tex b/documentation/report/Chapters/Dataset.tex
index b388ce2..30a5d06 100644
--- a/documentation/report/Chapters/Dataset.tex
+++ b/documentation/report/Chapters/Dataset.tex
@@ -7,36 +7,36 @@ \section{Preprocessing and data set creation}
 In this chapter, the different preparatory steps for the recorded data are described, including the creation of a data set which is usable for any machine learning or computer vision approach to analyze the image data.
 
 In~\ref{sec:Preprocessing}~\nameref{sec:Preprocessing} the data is assessed and simplified for any further processing. The second section,~\ref{sec:AutomaticFeatureExtraction}~\nameref{sec:AutomaticFeatureExtraction}, deals with the creation of feature scripts that were researched and implemented for an automatic recognition of single features.
-The results were combined in an application which is described in detail in~\ref{sec:LabelApp}~\nameref{sec:LabelApp}. In~\ref{sec:ManualLabeling}~\nameref{sec:ManualLabeling}, the process of hand-labeling the images for their feature class with the label application is described, followed by a section analyzing the results and comparing the overall agreement of the labelers. The last section~\ref{sec:AsparagusDataSet}~\nameref{sec:AsparagusDataSet} concludes with the creation of the final data set, used for the later training of the neural networks and other approaches to detect the label of a spear from its three images.
+The results were combined in an application which is described in detail in~\ref{sec:LabelApp}~\nameref{sec:LabelApp}. In~\ref{sec:ManualLabeling}~\nameref{sec:ManualLabeling}, the process of hand-labeling the images for their features with the label application is described, followed by a section analyzing the results and comparing the overall agreement of the labelers. The last section~\ref{sec:AsparagusDataSet}~\nameref{sec:AsparagusDataSet} concludes with the creation of the final data set, used for the training of the neural networks and other approaches to detect the label of a spear from its three images.
 
 \begin{figure}[!hb]
-	\centering
-	\includegraphics[width=0.98\textwidth]{Figures/chapter03/working_steps.png}
-	\decoRule
-	\caption[Working Steps from Data Collection to Classification]{\textbf{Working Steps from Data Collection to Classification}~~~All in all, 578226 unlabeled images were collected. Additionally, 13271 images with a class label were collected (see \autoref{tab:LabeledClassNumber} for the number of images per class label). The collected images go through different preprocessing steps. The preprocessed images are then taken as input to the computer vision based feature extraction algorithms. The preprocessed images and the computer vision based, extracted measures are taken as input to the hand-label app. By the help of the application, features are manually labeled by seven annotators. Agreement measures are calculated to compare agreement across annotators. The preprocessed images, plus the computer vision based features, plus the manual labels are taken to create different datasets, which are further used for the different classification approaches.}
-	\label{fig:WorkingSteps}
+    \centering
+    \includegraphics[width=0.98\textwidth]{Figures/chapter03/working_steps.png}
+    \decoRule
+    \caption[Working Steps from Data Collection to Classification]{\textbf{Working Steps from Data Collection to Classification}~~~Overall, 578226 unlabeled images were collected. Additionally, 13271 images with a class label were collected (see \autoref{tab:LabeledClassNumber} for the number of images per class label). The collected images go through different preprocessing steps. The preprocessed images are then taken as input to the various algorithms implemented in this project. Further, the preprocessed images combined with the computer vision based, extracted features (see \autoref{sec:AutomaticFeatureExtraction}) are taken as input to the hand-label app. With the help of the application, features are manually labeled by seven annotators. Agreement measures are calculated to compare agreement across annotators. The preprocessed images, the computer vision based features, and the manual labels are taken to create different datasets, which are further used for the different classification approaches.}
+    \label{fig:WorkingSteps}
 \end{figure}
 
 
 \subsection{Preprocessing}
 \label{sec:Preprocessing}
 
-Before implementing any approach that allows to predict a label to an asparagus spear, the recorded image data has to go through multiple preprocessing steps.
+Before implementing any approach, the recorded image data has to go through multiple preprocessing steps.
 
 The goal of this preprocessing is to reduce the variance by removing the background, shifting the asparagus spear to the center of the image patch and rotating it upwards. Correct orientation is especially important to make approaches such as \acrshort{pca} applicable and facilitates direct measuring of features. Background removal facilitates the measurement of features like width and height of the spears, because the position of foreground pixels per column can be easily evaluated (see \autoref{sec:AutomaticFeatureExtraction}).
 
-In the following, the different preprocessing steps are elaborated in detail.\footnote{See \url{https://github.com/CogSciUOS/asparagus/blob/FinalProject/preprocessing/perform\_preprocessing.py}}
+In the following, the different preprocessing steps are elaborated in detail.\footnote{See \url{https://github.com/CogSciUOS/asparagus/blob/FinalProject/preprocessing/perform\textunderscore preprocessing.py} (as of 11/27/2020)}
 
 \bigskip
-As described in~\autoref{sec:DataCollection}, each asparagus can be found in three pictures, one in each of the three positions – left, center and right. The image names are used to find the three relevant images and determine in which position the asparagus is captured. The images are cut into three pieces and renamed in a way that makes clear which images belong together. Each asparagus gets a unique identification number and the three perspectives are denoted with textit{a} for left, textit{b} for center, and textit{c} for right. For example, the texttt{image 42\_b.png} is the center image of the asparagus spear with the identification number 42. 
+As described in~\autoref{sec:DataCollection}, each asparagus is depicted in three pictures, one in each of the three positions \textendash{} left, center and right. The image names are used to find the three relevant images and determine in which position the asparagus is captured. The images are cut into three pieces and renamed in a way that makes clear which images belong together. Each asparagus gets a unique identification number and the three perspectives are denoted with \textit{a} for left, \textit{b} for center, and \textit{c} for right. For example, the \texttt{image 42\textunderscore b.png} is the center image of the asparagus spear with the identification number 42.
 
-Another step is to remove the background of the image. As the conveyor belt is blue, there is a high contrast to the bright asparagus spears which facilitates background removal  (see \autoref{fig:PreprocessingCropping}). Hence, it is possible to mask the asparagus spear using the hue of the HSV representation of each image. All pixels with a blue hue and very dark regions are marked as background through threshold limitation of the value component. This is particularly important for the automatic feature extraction (see~\autoref{sec:AutomaticFeatureExtraction}).
+Another step is to remove the background of the image. As the conveyor belt is blue, there is a high contrast to the bright asparagus spears which facilitates background removal (see \autoref{fig:PreprocessingCropping}). Hence, it is possible to mask the asparagus spear using the hue of the HSV representation of each image. All pixels with a blue hue and very dark regions are marked as background through threshold limitation of the value component. This is particularly important for the automatic feature extraction (see~\autoref{sec:AutomaticFeatureExtraction}).
 
 \begin{figure}[!ht]
 	\centering
 	\includegraphics[width=0.98\textwidth]{Figures/chapter03/preprocessing_pipeline.png}
 	\decoRule
-	\caption[Preprocessing Pipeline]{\textbf{Preprocessing Pipeline}~~~The depiction shows the preprocessing pipeline that was used to generate different datasets. For each asparagus spear three images were saved. Downscaled versions were computed for each image dataset. A: The processing steps to retrieve full-scale images with and without background. B:  The approach to retrieve color palette images and color histograms by feature engineering. C: Processing steps for the retrieval of partial angles. For details on B and C see also \autoref{subsec:FeatureEngineering}.}
+	\caption[Preprocessing Pipeline]{\textbf{Preprocessing Pipeline}~~~The depiction shows the preprocessing pipeline that was used to generate different datasets. For each asparagus spear three images were saved. Downscaled versions were computed for each image dataset. A: The processing steps to retrieve full-scale images with and without background. B: The approach to retrieve color palette images and color histograms by feature engineering. C: Processing steps for the retrieval of partial angles. For details on B and C see also \autoref{subsec:FeatureEngineering}.}
 	\label{fig:PreprocessingPipeline}
 \end{figure}
 
@@ -49,13 +49,13 @@ \subsection{Preprocessing}
 \end{figure}
 
 \bigskip
-Subsequently, each triple of images that depict one piece from different angles is determined and the area that shows the correct piece is cropped. The cropping regions are set to three patches that are a little larger than the compartments of the conveyor belt. They are located such that they cover the spear of interest which can either be at the left, center or right (see \autoref{fig:ExampleImagesAnna}). As it has been found that some tilted spears span across the borders between the conveyor compartments, the cropping window in the images was moved. The new coordinates are determined by shifting the center of the current cropping box horizontally to the center of mass of the contained foreground pixels (see \autoref{fig:PreprocessingCropping}). Repeating the procedure in a second iteration can further improve the result. As this is the case for few examples only we refrained from doing so to increase processing speed. Small parts of the neighboring depiction potentially end up in the region of interest as well. These remainders are removed by masking out all pixels that do not belong to the main blob of connected pixels.
+Subsequently, each triple of images that depict one piece from different angles is determined and the area that shows the correct piece is cropped. The cropping regions are set to three patches that are a little larger than the compartments of the conveyor belt. They are located such that they cover the spear of interest which can either be at the left, center or right (see \autoref{fig:ExampleImagesAnna}). As it has been found that some tilted spears span across the borders between the conveyor compartments, the cropping window in the images was moved. The new coordinates are determined by shifting the center of the current cropping box horizontally to the center of mass of the contained foreground pixels (see \autoref{fig:PreprocessingCropping}). Repeating the procedure in a second iteration can further improve the result. As this is the only case for a small number of examples we refrained from doing so to increase processing speed. Small parts of the neighboring depiction potentially end up in the region of interest as well. These remainders are removed by masking out all pixels that do not belong to the main blob of connected pixels.
 
 \bigskip
 To further reduce the variance the asparagus spears are rotated upwards to reduce the variance in angle. The rotation angle is achieved by binarizing the image into foreground and background pixels, calculating the centerline as the mean pixel location along the vertical axis, and fitting linear regression to the centerline pixels.
 
 \bigskip
-Another preprocessing step, which generates an additional set of images, is quantization using a common color palette. It was mainly employed to allow for the computation of meaningful color histograms.\footnote{ The number of colors in the original 24 bit RGB colorspace is too large to use them as bins for the histograms: The low number of pixels per bin would not allow for meaningful statistics. By reducing the number of colors the problem is solved.}  An appropriate palette and the respective mapping of RGB values to palette colors is determined using clustering in color space. First, a set of RGB tuples is collected by adding pixel values of 10000 asparagus spears. Second, the resulting list of RGB tuples is converted to an image such that a palette can be determined using standard tools for quantization. In the last step a clustering algorithm is employed that determines the position of cluster centers while maximizing the maximal coverage. The resulting cluster centers can be displayed as a list of RGB values which represent the color palette.\footnote{Here we used a standard implementation for image quantization that employs a maximal coverage algorithm ~\citep{pil_quantization}. An optimal solution to the problem of maximum coverage relates to the challenge of distributing a given number of circular areas named facilities such that they cover the largest possible area of the sample space ~\citep{zarandi2011large}. One can interpret the centers of named facilities as cluster centers. For each data point (here: pixel), the closest cluster center is determined and the respective value attributed. This means the data is quantized.} The color palette is used for quantization of the images: Each image of the downscaled data set is transformed to the palette representation. Visual inspection shows little quality loss such that it can be assumed that the relevant information for image classification is well preserved.
+Another preprocessing step, which generates an additional set of images, is quantization using a common color palette. It was mainly employed to allow for the computation of meaningful color histograms.\footnote{The number of colors in the original 24 bit RGB colorspace is too large to use them as bins for the histograms: The low number of pixels per bin would not allow for meaningful statistics. By reducing the number of colors the problem is solved.} An appropriate palette and the respective mapping of RGB values to palette colors is determined using clustering in color space. First, a set of RGB tuples is collected by adding pixel values of 10000 asparagus spears. Second, the resulting list of RGB tuples is converted to an image such that a palette can be determined using standard tools for quantization. In the last step, a clustering algorithm is employed that determines the position of cluster centers while maximizing the maximal coverage. The resulting cluster centers can be displayed as a list of RGB values which represent the color palette.\footnote{Here we used a standard implementation for image quantization that employs a maximal coverage algorithm ~\citep{pil_quantization}. An optimal solution to the problem of maximum coverage relates to the challenge of distributing a given number of circular areas named facilities such that they cover the largest possible area of the sample space ~\citep{zarandi2011large}. One can interpret the centers of named facilities as cluster centers. For each data point (here: pixel), the closest cluster center is determined and the respective value attributed. This means the data is quantized.} The color palette is used for quantization of the images: Each image of the downscaled data set is transformed to the palette representation. Visual inspection shows little quality loss such that it can be assumed that the relevant information for image classification is well preserved.
 
 \bigskip
 Several additional collections of preprocessed images are computed based on the data without background. This holds for downscaled versions as well as for a version that contains the asparagus heads only. To compute the latter, the images are padded to avoid sampling outside of the valid image boundaries and the uppermost foreground row is detected. Subsequently, the center pixel is determined and the image is cropped such that this uppermost central pixel of the asparagus is the center of the uppermost row of the snippet. The resulting partial images of asparagus heads are rotated using the centerline regression approach described above. The approach has proven reliable and the resulting depictions are used to train a dedicated network for head related features (see~\autoref{subsec:HeadNetwork}).
@@ -70,10 +70,10 @@ \subsection{Feature extraction}
 \bigskip
 We decided to label the images for their features rather than final class labels. The main reason for this was to make the labeling process easier for the inexperienced annotator. The boundaries between class labels are not always clear and can be difficult to detect from image data. Deciding whether a single feature is present or absent in an asparagus spear is more straightforward. Further, the training for the hand labeling and communication about special cases is facilitated. Another reason to label features rather than class labels was to break down the classification problem into smaller problems. Additionally, it is possible to detect which features are more difficult to learn than others, which provides meaningful insight into the classification task. Last but not least, deciding on the class labels after the features are detected reliably is a small step and can be easily done with a decision tree or rule-based approach.
 
-Even though the chosen features closely resemble the class labels defined by Gut Holsterfeld, they are different and should not be confused with one another. The 6 features we labeled by hand are as follows: hollow, flower, rusty body, rusty head, bent and violet. Additionally, the length and width was automatically detected as described below and used to set supplementary labels for very thick, medium thick, thick, thin, very thin and for fractured. Further, images that could not be classified thoroughly were labeled as ``not classifiable’’ (e.g.\ when the spear is cut off the image).
+Even though the chosen features closely resemble the class labels defined by Gut Holsterfeld, they are different and should not be confused with one another. The 6 features we labeled by hand are as follows: hollow, flower, rusty body, rusty head, bent and violet. Additionally, the length and width was automatically detected as described below and used to set supplementary labels for very thick, medium thick, thick, thin, very thin, and fractured. Further, images that could not be classified thoroughly were labeled as \enquote{not classifiable} (e.g.\ when the spear is cut off the image).
 
 \bigskip
-In this chapter, the different features as well as their extraction methods will be described. The results that are achieved by computationally extracting the features  are reported alongside future steps that could be taken to improve the results further. For each feature detection method, the images with removed background are used as is displayed in a respective example image shown per feature. Additionally, it is described which features were hand labeled by us. The feature functions, that provide reliable predictions, were integrated into an application, which is described in the subsequent \autoref{sec:LabelApp}, with which human annotators could manually label the unlabeled data.
+In this chapter, the different features as well as their extraction methods will be described. The results that are achieved by computationally extracting the features are reported alongside future steps that could be taken to improve the results further. For each feature detection method, the images with removed background are used as displayed in a respective example image shown per feature. Additionally, it is described which features were hand labeled by us. The feature functions, that provide reliable predictions, were integrated into an application, which is described in the subsequent \autoref{sec:LabelApp}, with which human annotators could manually label the unlabeled data.
 
 
 \subsubsection{Length}
@@ -82,7 +82,7 @@ \subsubsection{Length}
 The length detection described in the following paragraph was later used to automatically calculate the presence of the feature fractured in an image. An asparagus spear includes the feature fractured if it is broken or if it does in any other way not fulfill the required, minimal length of 210 mm (see \autoref{fig:ExampleFractured}). 
 
 \bigskip
-The length detection uses a pixel-based approach. It counts the number of rows from the highest to the lowest pixel that is not a black background pixel and therefore not zero. The asparagus is rotated upwards, as described in~\autoref{sec:Preprocessing}. This is done to improve the results, as the rows between the highest and the lowest pixel are counted and not the pixels themselves. This technique is a simplification, which does not represent curved asparagus very well, because it will have a shorter count than it would have if the pixels were counted along the asparagus spear.
+The length detection uses a pixel-based approach. It counts the number of rows from the highest to the lowest pixel that is not a black background pixel and therefore not zero. The asparagus is rotated upwards, as described in~\autoref{sec:Preprocessing}. This is done to improve the results, as the rows between the highest and the lowest pixel are counted and not the pixels themselves. This technique is a simplification, which does not represent curved asparagus accurately, because it will have a shorter count than it would have if the pixels were counted along the asparagus spear.
 
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \begin{center}
@@ -93,38 +93,36 @@ \subsubsection{Length}
   \label{fig:ExampleFractured}
 \end{wrapfigure}
 
-However in reality, there are not a lot of asparagus spears close to the decision boundary between a fractured spear and a whole spear. Usually, the asparagus is harvested a few centimeters longer than necessary and then cut to the desired length. The only asparagus shorter than that length are the ones that break during the sorting process. Moreover, if they break, they generally break closer to the center of the asparagus rather than at the ends. Therefore, the difference in length detection does not matter for our classification.
+However in reality, there are not a lot of asparagus spears close to the decision boundary between a fractured spear and a whole spear. Usually, the asparagus is harvested a few centimeters longer than necessary and then cut to the desired length. The only asparagus shorter than that length are the ones that break during the sorting process. Moreover, if they break, they generally break closer to the center of the asparagus rather than at the ends. Therefore, the difference in length detection does not matter for our purposes.
 
 \bigskip
-All in all, by visual inspection the length detection yields good results that are very helpful for the hand-label app. The next step would be to train a decision boundary that determines which number of pixels should be the threshold to differentiate between fractured and not fractured. At first, we tried to calculate this threshold by finding a conversion factor from pixel to millimeter, as we know the cut off in millimeters. But this approach appeared to be more difficult than anticipated, because the conversion factor varies in the different image positions. This problem only became apparent after the asparagus season had ended, for which reason we could not reproduce the camera calibrations in retrospective in order to take well-measured images, for example from a chessboard pattern. Accordingly, the threshold needs to be deduced from the data manually or learned with a machine learning approach.
+All in all, by visual inspection the length detection yields good results that are very helpful for the hand-label app. The next step would be to train a decision boundary that determines which number of pixels should be the threshold to differentiate between fractured and not fractured. At first, we tried to calculate this threshold by finding a conversion factor from pixel to millimeter, as we know the cut off in millimeters. But this approach appeared to be more difficult than anticipated, because the conversion factor varies in the different image positions. This problem only became apparent after the asparagus season had ended, for which reason we could not reproduce the camera calibrations in retrospective in order to take well-measured images. Accordingly, the threshold needs to be deduced from the data manually or learned with a machine learning approach.
 
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \centering
   \includegraphics[width=0.15\textwidth]{Figures/chapter03/example_img_thick.png}
-  \caption[Example Image Not Classifiable]{ \textbf{Feature Not Classifiable} \\ Example image for the feature not classifiable. Further, a difference in width of the asparagus is observable.}
+  \caption[Example Image Not Classifiable]{ \textbf{Feature Not Classifiable} \\ Example image for the feature not classifiable. Further, a difference in width of both asparagus is observable but the exact thickness is hard to determine by view alone.}
   \label{fig:ExampleThickness}
-  \vspace{-10pt}
+  \vspace{-15pt}
 \end{wrapfigure}
 
 \subsubsection{Width}
 \label{subsec:Width}
 
-Like the length of a spear, thickness is a feature that is hardly recognizable by view alone. Fortunately, it could also be automatically extracted with a classical approach described in the following section.
+Like the length of a spear, thickness is a feature that is hardly recognizable by view alone (see for example \autoref{fig:ExampleThickness}). Fortunately, it can also be automatically extracted with a classical approach described in the following section.
 
 \bigskip
 The division into different ranges of width can be inferred by the overall thickness of the spear. The feature ‘very thick’ is attributed to asparagus that is more than 26 mm in width. The feature ‘thick’ corresponds to 20 -- 26 mm, the feature ‘medium thick’ to 18 -- 20 mm, and the feature ‘thin’ to 16 -- 18 mm. Every asparagus with less than 16 mm in width is described with the feature ‘very thin’.
 
 \bigskip
-The width detection uses a very similar approach as the length detection. It takes the pixel count from the left-most to the right-most pixel in a certain row as a width measure. But in contrast to the length, the width was measured at several image rows from which the mean width was taken. Since the width detection works reliably, it is integrated in the hand-label app (see \autoref{sec:LabelApp}).
+The width detection uses a very similar approach as the length detection. It takes the pixel count from the left-most to the right-most pixel in a certain row as a width measure. But in contrast to the length, the width is measured at several image rows from which the mean width is taken. Since the width detection works reliably, it is integrated in the hand-label app (see \autoref{sec:LabelApp}).
 
-The algorithm operates as follows: Firstly, the images are binarized into foreground and background, which means setting all pixels that are not zero, and therefore not background, to one. After that, the uppermost foreground pixel is detected and the length is calculated with the length detection function as described above. The length of the asparagus is used to divide it into even parts. This is done by determining a start pixel and dividing the remaining rows that contain foreground pixels by the number of positions one wants to measure at. This way several rows are selected in which the number of foreground pixels is counted. One can interpret each row as a cross-section of the asparagus, therefore the number of foreground pixels is a direct measure for the width. Then, the mean of these counts is calculated and used as the final width value. As the head of the asparagus can be of varying form and does not represent the width of the whole asparagus well, it is excluded from the measure. This is done by selecting a start pixel below the head area instead of naively choosing the uppermost pixel. To be precise, the start pixel is chosen 200 pixels, which corresponds to roughly 25 mm, below the uppermost pixel in order to bypass the head area with certainty. As described in the section \nameref{subsec:Length}, also the width detection might lead to slightly different outcomes on curved asparagus spears than the true values. Again, this difference is regarded as irrelevant in our case.
+The algorithm operates as follows: Firstly, the images are binarized into foreground and background, which means setting all pixels that are not zero, and therefore not background, to one. After that, the uppermost foreground pixel is detected and the length is calculated with the length detection function as described above. The length of the asparagus is used to divide it into even parts. This is done by determining a start pixel and dividing the remaining rows that contain foreground pixels by the number of positions one wants to measure at. This way several rows are selected in which the number of foreground pixels is counted. One can interpret each row as a cross-section of the asparagus, therefore the number of foreground pixels is a direct measure for the width. Then, the mean of these counts is calculated and used as the final width value. As the head of the asparagus can be of varying form and does not represent the width of the whole asparagus well, it is excluded from the measure. This is done by selecting a start pixel below the head area instead of naively choosing the uppermost pixel. To be precise, the start pixel is chosen at 200 pixels, which corresponds to roughly 25 mm, below the uppermost pixel in order to bypass the head area with certainty. As described in the section \nameref{subsec:Length}, also the width detection might lead to slightly different outcomes on curved asparagus spears than the true values. Again, this difference is regarded as irrelevant in our case.
 
 
 \subsubsection{Rust}
 \label{subsec:Rust}
 
-The feature rust is split into the (sub-) features rusty body (see \autoref{fig:ExampleRustyBody}) and rusty head (see \autoref{fig:ExampleRustyHead}), because in the case of rust being only on the body, it is removable by peeling. Rust at the top part of the spear cannot be removed without damaging the head. Thus, rust on the head region is a decisive factor for the quality and later categorization into a price class.
-
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \begin{center}
     \includegraphics[width=0.15\textwidth]{Figures/chapter03/example_img_rustybody.png}
@@ -135,6 +133,8 @@ \subsubsection{Rust}
   \label{fig:ExampleRustyBody}
 \end{wrapfigure}
 
+The feature rust is split into the sub-features rusty body (see \autoref{fig:ExampleRustyBody}) and rusty head (see \autoref{fig:ExampleRustyHead}). Rust only covering the body can be removed. Whereas rust at the top part of the spear cannot be removed without damaging the head. Thus, rust on the head region is a decisive factor for the quality and later categorization into a price class.
+
 If a spear has rust, it is visible as a dark brown color. It often starts at the tips of becoming leaves or at the bottom part. The color is not to be confused with bruises, pressure marks, or a slightly yellow complexion, which can occur in a ripe asparagus. The latter coloring is neglected.
 Rust is set to be present even when only the tip of a leaf shows a dark spot. Other brownish bruises are not classified as rust.
 
@@ -157,7 +157,7 @@ \subsubsection{Rust}
 Visual inspection shows that the rust detection algorithm works well to detect rusty areas and barely misses any rusty parts. The difficulty lies in setting a threshold for the number of pixels needed to be classified as rusty. Only clusters of brown pixels are reliable indicators for rust.  Many pixels with a brown color distributed over the whole spear are not supposed to be classified as rust. It might be the case that a simple pixel count is not sufficient to set a classification threshold. More sophisticated approaches to detect clusters, such as morphological operators, could be beneficial for this feature detection. It remains unsolved to set a robust threshold that works well on the whole data set.
 
 \bigskip
-One problem that cannot be solved algorithmically is dirt in the sorting machine. If the machine is not cleaned thoroughly and regularly, dirt can be falsely classified as rust because it often falls in the same color range. Another problem can be a change of lighting when taking the images. Both issues can be controlled for, but have to be communicated well to the farmers.
+One problem that cannot be solved algorithmically is dirt in the sorting machine. If the machine is not cleaned thoroughly and regularly, dirt can be falsely classified as rust because it obtains the same color range. Another problem can be a change of illumination when taking the images as a small change in illumination can produce a large change in the appearance of a spear.
 
 
 \subsubsection{Violet}
@@ -178,12 +178,12 @@ \subsubsection{Violet}
 \end{wrapfigure}
 
 \bigskip
-According to the UNECE-norm asparagus of the highest quality grade may only be marginally violet or not violet at all ~\citep{unspargelnorm}. Hence it is crucial to sort asparagus pieces according to this binary attribute. In a simple procedure color hues are evaluated. More precisely, this strategy is based on evaluating histograms of color hues that are calculated for foreground pixels of the asparagus images after converting them to the HSV color space. Pale pixels are removed from the selection by thresholding based on the value component of the HSV representation. Finding the optimal threshold has proven difficult because of named subjectivity in color perception. A threshold of 0.3 for the value component is considered a good compromise: If applied, white and slightly rose pixels are masked out. All three perspectives are taken into account to compute a single histogram per asparagus spear. A score is calculated by summing up the number of pixels that lie within the violet color range. A second threshold is used as the decision boundary for violet detection. The direct and intuitive feedback in the hand-label app showed the relation between varying thresholds and the prediction. It could be seen that lowering the threshold also means that the feature extractor becomes more sensitive at the price of a reduced specificity. Best overall matches (accuracies) with the subjective perception are found for very low thresholds. In many cases, however, measurements based on this definition of violet do not match the feature label attributed by human coders.
+According to the UNECE-norm asparagus of the highest quality grade may only be marginally violet or not violet at all ~\citep{unspargelnorm}. Hence it is crucial to sort asparagus pieces according to this binary attribute. In a simple procedure color hues are evaluated. More precisely, this strategy is based on evaluating histograms of color hues that are calculated for foreground pixels of the asparagus images after converting them to the HSV color space. Pale pixels are removed from the selection by thresholding based on the value component of the HSV representation. Finding the optimal threshold has proven difficult because of named subjectivity in color perception. A threshold of 0.3 for the value component is considered a good compromise. All three perspectives are taken into account to compute a single histogram per asparagus spear. A score is calculated by summing up the number of pixels that lie within the violet color range. A second threshold is used as the decision boundary for violet detection. The direct and intuitive feedback in the hand-label app showed the relation between varying thresholds and the prediction. It could be seen that lowering the threshold also means that the feature extractor becomes more sensitive at the price of a reduced specificity. Best overall matches (accuracies) with the subjective perception are found for very low thresholds. In many cases, however, measurements based on this definition of violet do not match the feature label attributed by human coders.
 
-Hence, another sparse descriptor is derived from the input images. Instead of setting thresholds for pale values and calculating the histograms of color hues, this approach relies directly on the colors that are present in the depiction of an asparagus spear. As the 24 bit representations contain a vast amount of color information in relation to the number of pixels, it is, however, unfeasible to use these as input. Instead, the color palette images can be used. Histograms of palette images can serve as the basis to define the feature violet in a way that captures more of the initial color information. At the same time it is simple and understandable enough to allow for customizations by users of sorting algorithms or machines. As a consensus regarding such an explicit definition is hard to achieve and somewhat arbitrary, the descriptor is used to learn implicit definitions of the feature through examples (see~\autoref{subsec:FeatureEngineering}).
+Hence, another sparse descriptor is derived from the input images. This approach relies directly on the colors that are present in the depiction of an asparagus spear. As the 24 bit representations contain a vast amount of color information in relation to the number of pixels, it is, however, unfeasible to use these as input. Instead, the color palette images can be used. Histograms of palette images can serve as the basis to define the feature violet in a way that captures more of the initial color information. At the same time it is simple and understandable enough to allow for customizations by users of sorting algorithms or machines. As a consensus regarding such an explicit definition is hard to achieve and somewhat arbitrary, the descriptor is used to learn implicit definitions of the feature through examples (see~\autoref{subsec:FeatureEngineering}).
 
 \bigskip
-The lack of a formal definition for violet asparagus spears has proven to be a major challenge to approaches of measuring this feature. It has been shown that directly measuring whether an asparagus spear is violet heavily depends on the definition of this feature. It is to mention that color impression is highly subjective across and even within subjects ~\citep{luo2000review}. Effects of meta contrast that make minor variations in color more visible arguably affected the attribution of labels when many similar spears were assessed in succession ~\citep{reeves1981metacontrast}. Using machine learning can help to find the definition that generalizes best over varying color perceptions and retrieve objective rules to measure the degree to which an asparagus is violet. In other words, the task of establishing a rule that is a good compromise for several human attributors is shifted to an optimization algorithm. Hence, machine learning approaches that are trained on human labeled data appear to be more promising.
+The lack of a formal definition for violet asparagus spears has proven to be a major challenge to approaches of measuring this feature. It has been shown that directly measuring whether an asparagus spear is violet heavily depends on the definition of this feature. It is to mention that color impression is highly subjective across and even within subjects ~\citep{luo2000review}. Effects of meta contrast that make minor variations in color more visible affected the attribution of labels when many similar spears were assessed in succession ~\citep{reeves1981metacontrast}. Using machine learning can help to find the definition that generalizes best over varying color perceptions and retrieve objective rules to measure the degree to which an asparagus is violet. In other words, the task of establishing a rule that is a good compromise for several human attributors is shifted to an optimization algorithm. Hence, machine learning approaches that are trained on human labeled data appear to be more promising.
 
 The automatic detection of the feature violet was integrated into the hand-label app as a helper function for the human annotators.
 
@@ -191,7 +191,7 @@ \subsubsection{Violet}
 \subsubsection{Curvature}
 \label{subsec:Curvature}
 
-The curvature score of an asparagus image is expected to automatically detect the presence of the feature bent. The function was used as a help to the human annotators during the manual labeling with the hand-label app.
+The curvature score of an asparagus image is expected to automatically detect the presence of the feature bent. The function was used to help the human annotators during the manual labeling with the hand-label app.
 
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \begin{center}
@@ -203,7 +203,7 @@ \subsubsection{Curvature}
 \end{wrapfigure}
 
 \bigskip
-An asparagus is categorized as having the feature bent, if the shape of the asparagus is curved and not straight (see \autoref{fig:ExampleBent}). 
+An asparagus is categorized as having the feature bent, if the shape of the asparagus is curved rather than straight (see \autoref{fig:ExampleBent}). 
 
 If it is only slightly curved but can otherwise be thought of as straight -- that means fitting next to other straight spears without standing out -- it is labeled as straight. If the spear looks close to the same on all three pictures regarding its shape, it might indicate that it is heavily bent and therefore cannot be turned on the machine’s conveyor belt.
 
@@ -214,12 +214,12 @@ \subsubsection{Curvature}
 Multiple curvature scores can easily be computed based on regression fits to the centerline of an asparagus spear. For example, the parameters of linear or polynomial regression can be interpreted as a description of how bent an asparagus spear is. 
 
 \bigskip
-Deriving sparse descriptions is based on a two-stage approach. In the first stage, the centerline of an asparagus spear is computed because it is considered to be a good description of the curvature of asparagus spears. In each image the asparagus spear is roughly vertically oriented. This means that also for bent spears the head relies within the top center of the image (see \autoref{sec:Preprocessing}). The centerline is computed by binarizing the image into foreground and background and computing the mean of pixel locations along the vertical axis (i.e.\ for each row). The resulting binary representation shows a single pixel line. It serves as the input to the second stage of curvature estimation.
+Deriving sparse descriptions is based on a two-stage approach. In the first stage, the centerline of an asparagus spear is computed because it is considered to be a good description of the curvature of asparagus spears. In each image the asparagus spear is roughly vertically oriented. This means that also for bent spears the head lies within the top center of the image (see \autoref{sec:Preprocessing}). The centerline is computed by binarizing the image into foreground and background and computing the mean of pixel locations along the vertical axis (i.e.\ for each row). The resulting binary representation shows a single pixel line. It serves as the input to the second stage of curvature estimation.
 
 In the second stage, curves are fit to the pixel locations of the centerline. For a simple score, linear regression is employed and the sum of squared errors is thresholded and interpreted as a curvature score. This score is small for perfectly straight asparagus spears and increases the more bent an asparagus is. As an S-shaped asparagus is arguably perceived as bent even when the overall deviations from the center line are small, a second descriptor was computed as the ratio between the error of a linear fit and polynomial regression of degree three. Thresholding values and employing a voting scheme for the results for all three perspectives yields a rule to measure curvature (e.g.\ at least one of the three perspectives indicates that the asparagus is bent). However, it has again proven difficult to set thresholds appropriately to reliably capture the visual impression. Hence, another sparse representation was calculated by dividing the spears into several segments and fitting linear regression to each segment. A \acrfull{mlp} was trained on the resulting 18 angles per asparagus (see~\autoref{subsec:FeatureEngineering}).
 
 \bigskip
-Calculating a score for curvature is fast and efficient. While the respective approach is suitable to define curvature, it does not necessarily meet up with the subjective perception of how bent an asparagus appears. Just like histograms of palette images, curvature scores are the results of feature engineering: The use of extensive domain knowledge to filter relevant features~\citep{zheng2018feature}. They can serve as an input to a machine learning approach that maps this sparse representation to the target categories (see~\autoref{subsec:FeatureEngineering}).
+Calculating a score for curvature is fast and efficient. While the respective approach is suitable to define curvature on a technical level, it does not necessarily meet up with the subjective perception of how bent an asparagus appears. Just like histograms of palette images, curvature scores are the results of feature engineering: The use of extensive domain knowledge to filter relevant features~\citep{zheng2018feature}. They can serve as an input to a machine learning approach that maps this sparse representation to the target categories (see~\autoref{subsec:FeatureEngineering}).
 
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \begin{center}
@@ -238,7 +238,7 @@ \subsubsection{Flower}
 When a bud is in full bloom, it is clearly visible. However, it can be quite difficult to distinguish between an asparagus with clearly cut but closed petals and an asparagus that has just begun to develop a flower. It was decided to label the feature as absent when the asparagus does not clearly show the characteristic flower. With this decision, we aimed to reduce the correspondence error between annotators.
 
 \bigskip
-The implementation of the flower detection function turned out to be difficult to realize. Several approaches have been tested, but none of them generated sufficiently good results. Two main notions were tried. The first approach uses the shape of the head as an indicator for a flower. The idea is that asparagus spears with a flowery head exhibit a less closed head shape. In other words, the head looks less round and has no smooth outline, but shows fringes. The second approach focuses on the structure within the head. Supposedly, asparagus with flowery heads exhibit more edges and lines in the head area. In both cases, it is challenging to find a way to discriminate between asparagus with and without flowery heads. One reason for that is the poor resolution of the camera that is installed in the sorting machine. With a pixel to millimeter ratio of around four to one\footnote{The ratio was not calculated by us but is an information provided by the manufacturer \citep{autoselectanleitung}.}, it is even difficult to detect flowers with the human eye. Likewise, the current software in the machine struggles greatly with the classification of this feature as well.
+The implementation of the flower detection function turned out to be difficult to realize. Several approaches have been tested, but none of them generated sufficiently good results. Two main notions were tried. The first approach uses the shape of the head as an indicator for a flower. The idea is that asparagus spears with a flowery head exhibit a more open head shape. In other words, the head looks less round and has no smooth outline, but shows fringes. The second approach focuses on the structure within the head. Supposedly, asparagus with flowery heads exhibit more edges and lines in the head area. In both cases, it is challenging to find a way to discriminate between asparagus with and without flowery heads. One reason for that is the poor resolution of the camera that is installed in the sorting machine. With a pixel to millimeter ratio of around four to one\footnote{The ratio was not calculated by us but is an information provided by the manufacturer \citep{autoselectanleitung}.}, it is even difficult to detect flowers with the human eye. Likewise, the current software in the machine struggles greatly with the classification of this feature.
 
 \begin{wrapfigure}{!I}{0.35\textwidth}
   \begin{center}
@@ -254,13 +254,13 @@ \subsubsection{Flower}
 \subsubsection{Hollow}
 \label{subsec:Hollow}
 
-It was not possible to us to implement the feature hollow in an automatic, classical computer vision approach. Thus, only labels from the human annotators are available for this feature.
+It was not possible for us to implement the feature hollow in an automatic, classical computer vision approach. Thus, only labels from the human annotators are available for this feature.
 
 \bigskip
 The feature hollow indicates if the spear has a cavity inside.
-This might be expressed by a bulgy center and a line running vertically along the spear’s body. Another, more distinct indicator is when the asparagus looks like two spears fused together, forming a single asparagus (see \autoref{fig:ExampleHollow}). A hollow asparagus can be confused with a very thick asparagus.
+This might be expressed by a bulgy center and a line running vertically along the spear’s body. Another, more distinct indicator is an asparagus that looks like two spears fused together (see \autoref{fig:ExampleHollow}). A hollow asparagus can be confused with a very thick asparagus.
 
-The feature can be easily checked when you have physical access to the asparagus. If the asparagus is actually hollow, it will have a hole at its bottom that is noticeable when turning the spear around. Unfortunately, this cannot be done when only looking at the spears from the side. The feature hollow sometimes occurs without showing a clear line or obvious bulge at its center. Therefore, there is a high risk of wrong classification.
+The feature hollow can occur without showing a clear line or obvious bulge at its center. Therefore, there is a high risk of wrong classification.The feature can be easily checked when you have physical access to the asparagus. If the asparagus is actually hollow, it will have a hole at its bottom. Unfortunately, this cannot be discovered when only looking at the spears from the side.
 
 
 \subsubsection{Not classifiable}
@@ -268,7 +268,7 @@ \subsubsection{Not classifiable}
 
 The feature not classifiable is no feature to an asparagus per se. It is therefore not implemented as an automatic feature extraction approach. However, it is integrated into the hand-label application and can be selected by the human annotators if applicable.
 
-That is, whenever the spear is unrecognizable, the head part of the spear was severed, two spears were present in one picture (as in \autoref{fig:ExampleThickness}), the spear was cut off by the image, or other unusual circumstances occurred, it falls into the category of being unclassifiable.
+That is, whenever the spear is unrecognizable, the head part of the spear is severed, two spears are present in one picture (as in \autoref{fig:ExampleThickness}), the spear is cut off by the image, or other unusual circumstances occurred, it falls into the category of being unclassifiable.
 
 
 \subsection{The hand-label app: A GUI for labeling asparagus}
@@ -286,7 +286,7 @@ \subsubsection{Motivation}
 \bigskip
 The options to reduce the variance and hence the need to attribute labels to a very large number of samples are limited. We employed preprocessing and manual feature engineering to reduce the variance and tested strategies on the algorithmic domain such as unsupervised and semi-supervised learning as they promise to work with relatively few labels (see \autoref{sec:SemiSupervisedLearning} and  \autoref{subsec:FeatureEngineering}). Nonetheless for training and more importantly performance evaluation of machine learning models a substantial amount of labels are required. Otherwise quality metrics such as accuracies, sensitivity and specificity cannot be calculated. Hence labels had to be manually attributed.
 
-Annotating labels manually requires plenty of effort. \blockquote{Data set annotation and/or labeling is a difficult, confusing and time consuming task} \citep[p.~2]{al2018labeling}. Human performance is often acknowledged as the baseline or ``gold standard’’ that image classifiers are evaluated by. Hence, in many scenarios data is labeled by humans such that machine learning algorithms can be applied. This holds especially for image classification.\footnote{For example the performance of GoogLeNet is compared to human level performance using the ImageNet data set \citep{russakovsky2015imagenet}.} In the present case some features were reliably measurable by means of classical computer vision algorithms (e.g.\ the width or the length). For features such as a flower or the evaluation whether or not a spear is affected by rust, this has proven to be difficult (see \autoref{subsec:Rust}). Considering the amount of data that could potentially be labeled, a custom interface is required that allows for time efficient attribution of labels.
+Annotating labels manually requires plenty of effort. \blockquote{Data set annotation and/or labeling is a difficult, confusing and time consuming task} \citep[p.~2]{al2018labeling}. Human performance is often acknowledged as the baseline or \enquote{gold standard}’. Especially for image classification human labeled data is used.\footnote{For example the performance of GoogLeNet is compared to human level performance using the ImageNet data set \citep{russakovsky2015imagenet}.} Considering the amount of data that has to be labeled, a custom interface is required that allows for time efficient attribution of labels.
 
 We decided to attribute labels for each of the previously described features other than the width and the length to at least 10000 asparagus spears (and hence evaluate 30000 images). This means that several ten thousand judgements had to be made which highlights the importance of a tool that allows to make this process as quick as possible.
 
@@ -294,37 +294,33 @@ \subsubsection{Motivation}
 \subsubsection{The Labeling Application}
 \label{subsec:LabelApp}
 
-A custom application was built aiming for a quick and intuitive labeling process.\footnote{ See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/labeling/hand\_label\_assistant}} Questions appear alongside depictions of an asparagus spear and the user answers them using the designated buttons or keys. Once all questions are answered the next asparagus appears automatically and the procedure is repeated. To assist the users in their judgement some automatically extracted features such as the length or width of the asparagus spear as well as a color histogram is displayed.
+A custom application was built aiming for a quick and intuitive labeling process.\footnote{ See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/labeling/hand\_label\_assistant} (as of 11/27/2020)} Questions appear alongside depictions of an asparagus spear and the user answers them using the designated buttons or keys. Once all questions are answered the next asparagus appears automatically and the procedure is repeated. To assist the users in their judgement some automatically extracted features such as the length or width of the asparagus spear as well as a color histogram are displayed.
 
 \begin{figure}[!htb]
-	\centering
-	\includegraphics[scale=0.3]{Figures/chapter03/labelapp_example.png}
-	\decoRule
-	\caption[The Labeling Dialog of the Hand-Label App]{\textbf{The Labeling Dialog of the Hand-Label App}~~~The depiction shows the main dialog used for labeling. In the left part you can see all three available images (perspectives) for the asparagus with the ID 40. The current question that targets at one of the features of interest is displayed in the area below the images. They are phrased such that the user can answer them with yes or no using the respective buttons or keyboard controls. On the right side you can see the results of the automatic feature extraction. The upper right panel shows the histogram of color hues.}
-	\label{fig:LabelAppGUI}
+    \centering
+    \includegraphics[scale=0.3]{Figures/chapter03/labelapp_example.png}
+    \decoRule
+    \caption[The Labeling Dialog of the Hand-Label App]{\textbf{The Labeling Dialog of the Hand-Label App}~~~The depiction shows the main dialog used for labeling. In the left part all three available images (perspectives) for the asparagus are shown with the ID 40. The current question that targets at one of the features of interest is displayed in the area below the images. They are phrased such that the user can answer them with yes or no using the respective buttons or keyboard controls. The results  of the automatic feature extraction are displayed on the right side. The upper right panel shows the histogram of color hues.}
+    \label{fig:LabelAppGUI}
 \end{figure}
 
-The app comprises two user interfaces: A startup window that allows for a preview of asparagus images and the attributed labels (represented by Ui\textunderscore Asparator) and the main labeling interface (Ui\textunderscore LabelDialog) (see \autoref{fig:LabelAppDiagram}). Using the labeling interface is possible only after the user selects the source folder of images and specifies or loads a file that contains the attributed labels. This ensures that the file paths for input and output are set. A dictionary that maps indices to images can be parsed from file names and the minimum index is determined. As such, the label dialog and the respective controller class always reside in a valid initial state. For labeling, the user answers questions that are displayed alongside the images that depict each asparagus spear from three perspectives. This can be done using the respective buttons or the arrow keys (see \autoref{fig:LabelAppGUI}). The result is saved and the next question will automatically appear upon answering. 
+The app comprises two user interfaces: A startup window that allows for a preview of asparagus images and the attributed labels (represented by Ui\textunderscore Asparator) and the main labeling interface (Ui\textunderscore LabelDialog) (see \autoref{fig:LabelAppDiagram}). Using the labeling interface is possible only after the user selects the source folder of images and specifies or loads a file that contains the attributed labels. This ensures that the file paths for input and output are set. A dictionary that maps indices to images can be parsed from file names and the minimum index is determined. As such, the label dialog and the respective controller class always reside in a valid initial state. For labeling, the user answers questions that are displayed alongside the images. The result is saved and the next question will automatically appear upon answering.
 
-Automatic feature detection can be selected as an alternative for manual labeling for specific features. The result is displayed and saved to a file. This flexible approach was chosen as it is initially unclear and disputed in how far automatic feature extraction yields results that meet up with the individual, subjective perception. It also allowed to improve automatic feature extraction methods and to develop a direct intuition for the relation to the data. On top of that, it has proven to be useful for debugging automatic feature extraction methods that initially failed for some images.
+Automatic feature detection can be selected as an alternative for manual labeling for specific features. The result is displayed and saved to a file. This flexible approach was chosen as it was initially unclear and disputed in how far automatic feature extraction yields satisfying results. Further, it allowed to improve automatic feature extraction methods and to develop an intuition for the characteristics of the features. On top of that, it has been proven to be useful for debugging.
 
 \bigskip
-The development of the app was accompanied by three major challenges. First, handling a large data set of several hundred gigabytes that is accessible in a network drive. Second, changing requirements that resulted from group decision processes with respect to automatic feature extraction as well as from unforeseen necessities in (parallel) preprocessing. This required substantial changes of the initial architecture and the reimplementation of parts of the code. The third challenge is related to the question of the handling of internal states of the app. The latter may be further explained in the following. 
-
-Internal states of the app are handled such that it is possible for the user to navigate into invalid states for which no images are available. Note that preprocessing was done such that each asparagus spear has a unique identification number and a specifier for perspectives textit{a}, textit{b} and textit{c} in it’s filename. While generally the identification numbers are in a continuous range from zero to n, some indices are missing. As preprocessing jobs were scheduled to run in parallel and preprocessing failed for few corrupted files, it has proven almost inevitable to end up with some few missing indices although a dictionary of input filename and output filename was passed to each grid job. In addition, the large amount of data did not allow to save all files in a single directory. In summary, this means that one could not simply iterate over asparagus identification numbers (represented by the state of a spin box in the user interface), determine the file path, and display the related images. Instead, parsing file names from a slow network drive is necessary which requires limiting the number of selected images. As GUI elements such as spin boxes and keyboard controls allow for setting an integer, and it was a requirement that this integer relates to the asparagus identification number, one ends up with the following situation: Either one prevents that the asparagus identification number is set or incremented freely to a value that does not exist, or one allows to navigate into an invalid state. The latter solution was considered to be easier and thus implemented.\footnote{The earlier approach showed to have several implementation specific drawbacks. Note for example that upon entering multiple digits in an input field, an event is triggered multiple times. Upon entering the value 10 for the asparagus identification number, one ends up with the value being set to 1 before being set to 10 where 1 relates to a potentially missing identification number. This means the user cannot freely enter IDs because setting them to certain values is impossible.} Hence, all cascades of methods of the app including preprocessing functions that require the respective images as an input were adjusted such that they can handle this case. 
+The development of the app was accompanied by three major challenges. First, handling a large data set of several hundred gigabytes that is accessible in a network drive. Second, changing requirements that resulted from group decisions. This required substantial changes of the initial architecture and the reimplementation of parts of the code. The third challenge is related to the question of the handling of internal states of the app. 
+Internal states of the app are handled such that it is possible for the user to navigate into states for which no images are available. All cascades of methods of the app including preprocessing functions that require the respective images as an input are adjusted such that they can handle this case. The architecture is beneficial as it allows the user to manually iterate over asparagus IDs without being interrupted because an ID relates to an invalid or missing file. It was chosen over alternatives to handle invalid states as it was considered the easiest to implement and meets requirements given by the PyQt framework.
 
 \begin{figure}[!t]
-	\centering
-	\includegraphics[scale=0.3]{Figures/chapter03/label_app_diagram.png}
-	\decoRule
-	\caption[UML Diagram for the Hand-Label App]{\textbf{UML Diagram for the Hand-Label App}~~~The depiction shows the class diagram for the hand-label app in \acrshort{uml}.}
-	\label{fig:LabelAppDiagram}
+    \centering
+    \includegraphics[scale=0.3]{Figures/chapter03/label_app_diagram.png}
+    \decoRule
+    \caption[UML Diagram for the Hand-Label App]{\textbf{UML Diagram for the Hand-Label App}~~~The depiction shows the class diagram for the hand-label app in \acrshort{uml}.}
+    \label{fig:LabelAppDiagram}
 \end{figure}
 
-The app is implemented using the PyQt5 framework\footnote{see \url{https://pypi.org/project/PyQt5/}} while coarsely following the model, view controller principle. Model and controller are not strictly separate and thus no distinct model class or database is used. Instead the labels are managed as a Pandas DataFrame and serialized as a csv-file. Upon state change (i.e.\ index increment), images are loaded from the network drive that is mounted on the level of the operating system. The views are designed using QtDesigner. Four manually coded classes are essential for the architecture of the app: (1) The class HandLabelAssistant in which the PyQt5 app is instantiated, (2) the controllers MainApp and (3) LabelingDialog as well as (4) Features which is member class of the latter. Features is of type QThread and represents the API to the automatic feature extraction methods in FeatureExtraction. Ui\textunderscore Asparator, UiLabelDialog and classes for file dialogs represent the views. A class ImageDisplay is required to display images with the correct aspect ratio. \autoref{fig:LabelAppDiagram} shows the \acrshort{uml} class diagram alongside methods and attributes that are considered relevant to understand the architecture of the app. 
-
-\bigskip
-Developing a custom app for the labeling process required substantial time resources. However, it was found that existing solutions did not meet the specific requirements. Our custom hand-label app allowed us to attribute labels to more than 10000 asparagus spears in a manageable amount of time. Details of the manual labeling process are described in the next section.
+The app is implemented using the PyQt5 framework\footnote{see \url{https://pypi.org/project/PyQt5/} (visited on 04/24/2020)} while coarsely following the model, view controller principle. Model and controller are not strictly separated and thus no distinct model class or database is used. Instead the labels are managed as a Pandas DataFrame and serialized as a csv-file. Upon state change (i.e.\ index increment), images are loaded from the network drive that is mounted on the level of the operating system. The views are designed using QtDesigner. Controller classes, utility classes for feature extraction (Features) and a custom ImageDisplay were manually implemented.  \autoref{fig:LabelAppDiagram} shows the \acrshort{uml} class diagram.
 
 
 \subsection{Manual labeling}
@@ -332,9 +328,6 @@ \subsection{Manual labeling}
 
 In this section, the process and the results of manually labeling the data with the help of the hand-label app is laid out. The labeling criteria which allocate each spear to a single quality class were explained in the subsections of the section \nameref{sec:AutomaticFeatureExtraction}. The outcome of the labeling process and one approach to measure the agreement of the manual labeling will be described in the following subsections.
 
-\bigskip
-The images were labeled for their features by all members of the group with the hand-label app (described in \autoref{sec:LabelApp}). As none of the team members were experts in asparagus labeling, a general guideline for the feature labeling had to be established (see \autoref{sec:AutomaticFeatureExtraction}). The guideline was written in accordance with the owner of the asparagus farm Gut Holsterfeld, Mr. Silvan Schulze-Weddige. He was consulted in all questions regarding the labeling of the asparagus.
-
 \begin{figure}[!htb]
 	\centering
 	\vspace{20pt}
@@ -350,15 +343,16 @@ \subsection{Manual labeling}
 		\includegraphics[width=0.80\linewidth]{Figures/chapter03/diff-img-violet.png}
 		\caption{violet?}
 	\end{subfigure}
-    \caption[Examples of Critical Images]{\textbf{Examples of Critical Images}~~~Three examples of asparagus are shown where the corresponding feature label is difficult to determine -- thus, a critical decision has to be made. Image (A) displays an asparagus that shows a slight S-curve, however, not very strongly. It might be labeled as being bent or as not being bent by the annotator. Images (B) and (C) show asparagus with brown spots which can be judged as rust or as pressure marks (no rust), again depending on the annotator. Additionally, it is not obvious whether the asparagus in (B) exhibits a flower or whether the spear in (C) is of violet color at the head region.}
+    \caption[Examples of Critical Images]{\textbf{Examples of Critical Images}~~~Three examples of asparagus are shown where the corresponding feature label is difficult to determine -- thus, a critical decision has to be made. Image (A) displays an asparagus that shows a slight S-curve, however, not very strong. It might be labeled as being bent or as not being bent by the annotator. Images (B) and (C) show asparagus with brown spots which can be judged either as rust or as pressure marks (no rust), again depending on the annotator. Additionally, it is not obvious whether the asparagus in (B) exhibits a flower or whether the spear in (C) is of violet color at the head region.}
     \label{fig:CriticalExampleImages}
 \end{figure}
 
-The features labeled by the human annotators included the features hollow, flower, rusty head, rusty body, bent, and violet. Additionally, the feature not classifiable could be selected and attributed to spears where a feature detection was not possible (see \autoref{subsec:NotClassifiable}). The features fractured, very thick, thick, medium thick, thin, and very thin were extracted via the automatic feature detection functions described in the respective sections for \nameref{subsec:Length} and \nameref{subsec:Width} which were integrated in the hand-label app. Other features that were integrated into the hand-label app, and which helped as an orientation to the human annotators, were the feature extraction functions for feature bent and feature violet. 
+\bigskip
+The images were labeled for their features by all members of the group with the hand-label app (described in \autoref{sec:LabelApp}). As none of the team members were experts in asparagus labeling, a general guideline for the feature labeling had to be established (see \autoref{sec:AutomaticFeatureExtraction}). The guideline was written in accordance with the owner of the asparagus farm Gut Holsterfeld, Mr. Silvan Schulze-Weddige. He was consulted in all questions regarding the labeling of the asparagus.
 
-General challenges in the manual labeling in front of a computer screen, including the respective image quality and the variance in the agreement of the project members, were expected from the start. As the task relies on the subjective view of individual humans, opinions about the presence or absence of features can diverge. By consulting Mr. Schulze-Weddige on difficult decision-making cases, it became clear that some examples are difficult to classify even for experts (see \autoref{fig:CriticalExampleImages}).
+General challenges in the manual labeling in front of a computer screen, including the respective image quality and the variance in the agreement of the project members, were expected from the start. As the task relies on the subjective view of individual humans, opinions about the presence or absence of features can diverge. By consulting Mr. Schulze-Weddige on difficult cases, it became clear that even for experts some examples are difficult to classify from image data (see \autoref{fig:CriticalExampleImages}).
 
-To tackle the issue and to have an overview of the general agreement of the labeling between group members, a measure was applied, namely the Kappa Agreement. The Kappa Agreement was used to assess the degree of accordance in labeling between the single members and monitor how the labeling agreement developed during the manual labeling process. 
+To tackle the issue and to have an overview of the general agreement of the labeling between group members, a measure was applied, namely the Kappa Agreement. The Kappa Agreement is used to assess the degree of accordance in labeling between the single members and to monitor how the labeling agreement developed during the manual labeling process. 
 
 
 \subsubsection{Labeling outcome}
@@ -366,26 +360,15 @@ \subsubsection{Labeling outcome}
 
 In this section, the process and the results of the labeling with the hand-label app are described.
 
-The labels of all labeled images are stored in a csv-file, as shown in~\autoref{fig:CSVfileOverview}. The first entry is the image identification number. Every feature can be of value 0, value 1 or empty. Whenever a feature is present in an image, the value is set to 1. If the feature is absent, it is set to 0. For images labeled as not classifiable, all feature values remain empty. The image path to every of the three images for one spear is also saved in the label file in a separate column. After the labeling process, the individual csv-files with the labels are merged into one large \texttt{combined\_new.csv} file. The content of this file is later used for the classification of the data with the different approaches (see \autoref{ch:Classification}). It can be found in the study project’s GitHub repository.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/preprocessing/get_data}}
-
 \begin{figure}[!ht]
 	\centering
 	\includegraphics[scale=0.5]{Figures/chapter03/csv_overview.png}
 	\decoRule
-	\caption[Manual Labeling Output CSV-File]{\textbf{Label File}~~~The feature labels extracted by the manual labeling process were saved in a csv file. This image shows the beginning of the file \texttt{combined\_new.csv} in which all label files were later combined as one.}
+	\caption[Manual Labeling Output CSV-File]{\textbf{Label File}~~~The feature labels extracted by the manual labeling process are saved in a csv file. This image shows the beginning of the file \texttt{combined\textunderscore new.csv} in which all label files were later combined as one.}
 	\label{fig:CSVfileOverview}
 \end{figure}
 
-\bigskip
-The manual labeling lasted over the period of November 2019 to January 2020. A session usually consisted of 500 images, with an asparagus spear being viewed in three positions. A session of labeling 500 images took between two and four hours. One minor factor sometimes influencing the time spent for labeling were difficulties with the external \acrshort{ssh} connection to the \acrshort{ikw} storage, as all images are stored on the university servers. The average time spent for labeling one asparagus is around 27 -- 48 seconds.
-
-All in all, 13319 triples of images were labeled for their features. There is a large variance in the presence of the features in the data as can be seen in~\autoref{tab:FeatureRepresentation}. Of the acquired 13319 images, the feature most present in the data is rusty body with 45.5\%, followed by the feature bent with 40\%. Features whose representation is below 10\% include the features violet (7.9\%), hollow (3.3\%), fractured (3.5\%), and very thick (4\%). The feature not classifiable shows least presence with 2.1\%. 
-
-It emerges that many features are only sparsely present in the data. This poses an imbalance in the data that is relevant for later classification tasks and the usage of the data set.
-
-Further, every member of the group participated in the labeling but not everybody labeled the same number of images. Due to this circumstance, the labeling bias of certain members is more present in the data than of others.
-
-The manual labeling was stopped when the amount of classified data exceeded 13000 samples. Reason for this is that we had reached our time limit for labeling samples and needed to begin with training our classification approaches.\footnote{To have an intuition of how much labeled data we might need, we found some of the suggestions in a blogpost helpful which can be visited at \url{https://machinelearningmastery.com/much-training-data-required-machine-learning/}. However, we did not find an exact number when to stop labeling.}
+The labels of all labeled images are stored in a csv-file, as shown in~\autoref{fig:CSVfileOverview}. The first entry is the image identification number. Every feature can be of value 0, value 1 or empty. Whenever a feature is present in an image, the value is set to 1. If the feature is absent, it is set to 0. For images labeled as not classifiable, all feature values remain empty. The image path to every of the three images for one spear is also saved in the label file in a separate column. After the labeling process, the individual csv-files with the labels are merged into one large \texttt{combined\textunderscore new.csv} file. The content of this file is later used for the classification of the data with the different approaches (see \autoref{ch:Classification}). It can be found in the study project’s GitHub repository.\footnote{~See \url{https://github.com/CogSciUOS/asparagus/tree/FinalProject/preprocessing/get_data} (as of 11/27/2020)}
 
 \begin{table}[!hb]
 	\centering
@@ -409,6 +392,17 @@ \subsubsection{Labeling outcome}
 	\label{tab:FeatureRepresentation}
 \end{table}
 
+\bigskip
+The manual labeling took place over the period of November 2019 to January 2020. A session usually consisted of 500 images. A session of labeling 500 images took between two and four hours. The average time spent for labeling one asparagus is around 27 -- 48 seconds.
+
+All in all, 13319 triples of images were labeled for their features. There is a large variance in the presence of the features in the data as shown in~\autoref{tab:FeatureRepresentation}. Of the acquired 13319 images, the feature most present in the data is rusty body with 45.5\%, followed by the feature bent with 40\%. Features whose representation is below 10\% include the features violet (7.9\%), hollow (3.3\%), fractured (3.5\%), and very thick (4\%). The feature not classifiable shows least presence with 2.1\%.
+
+It emerges that many features are only sparsely present in the data. This poses an imbalance in the data that is relevant for later classification tasks and the usage of the data set.
+
+Further, every member of the group participated in the labeling but not everybody labeled the same amount of images. Due to this circumstance, the labeling bias of certain members is more present in the data than of others.
+
+The manual labeling was stopped when the amount of classified data exceeded 13000 samples. We had reached our time limit for labeling samples and needed to begin with training our classification approaches.\footnote{To get an intuition of how much labeled data we might need, we found some of the suggestions in a blogpost helpful which can be visited at \url{https://machinelearningmastery.com/much-training-data-required-machine-learning/} (visited on 04/29/2020). However, we did not find an exact number when to stop labeling.}
+
 
 \subsubsection{Agreement Measures}
 \label{subsec:AgreementMeasures}
@@ -416,9 +410,9 @@ \subsubsection{Agreement Measures}
 When different annotators label data, it is indispensable to verify the degree of agreement among raters. In order to judge how consistently the data is labeled, several statistical methods (inter-rater-reliability) can be applied.
 
 \bigskip
-For the current purpose, different agreement measures, all implemented by scikit-learn, are used. The first is Cohen’s Kappa. It is seen as a more robust measure than a simple agreement percentage, such as a measure of true positives and true negatives, which was traditionally used for those cases~\citep{cohen1960coefficient}. Cohen’s Kappa is more robust, as the rate of agreement occurring by chance is included in the calculation. This method is applicable to compare the rating of two raters on a classification problem. The degree of agreement is always within $-1.0$ and 1.0 inclusive. The higher the Kappa value, the higher the agreement. Values around zero indicate no agreement and negative values indicate negative agreement which can be interpreted as systematic disagreement. Values between 0.41 -- 0.6 are seen as moderate agreement, 0.61 -- 0.8 as substantial agreement, and everything above as almost perfect agreement. All scores below 0.4 are interpreted as unacceptable \citep{mchugh2012interrater}. 
+For the current purpose, different agreement measures, all implemented by scikit-learn, are used. The first is Cohen’s Kappa. It is seen as a more robust measure than a simple agreement percentage, such as a measure of true positives and true negatives, which was traditionally used for those cases~\citep{cohen1960coefficient}. Cohen’s Kappa is more robust, as the rate of agreement occurring by chance is included in the calculation. This method is applicable to compare the rating of two raters on a classification problem. The degree of agreement is always within $-1.0$ and 1.0 inclusive. The greater the Kappa value, the higher the agreement. Values around zero indicate no agreement and negative values indicate negative agreement which can be interpreted as systematic disagreement. Values between 0.41 -- 0.6 are seen as moderate agreement, 0.61 -- 0.8 as substantial agreement, and everything above as almost perfect agreement. All scores below 0.4 are interpreted as unacceptable \citep{mchugh2012interrater}.
 
-Another statistical method used to measure the agreement is the F1 score. The F1 score is used for the evaluation of binary classification. It relies on both precision as well as recall of a test. An F1 score value lies between 0.0 and 1.0 -- the higher the F1 score, the higher the agreement.
+Another statistical method used to measure our agreement is the F1 score. The F1 score is used for the evaluation of binary classification. It relies on both precision as well as recall of a test. An F1 score value lies between 0.0 and 1.0 -- the greater the F1 score, the higher the agreement.
 
 Lastly, we calculated the accuracy measure. For a normalized accuracy score, the values lie between 0.0 and 1.0, and the best performance is 1.0. This measure returns the fraction of correctly classified samples. It is a less robust measure than Cohen’s Kappa score~\citep{mchugh2012interrater}.
 
@@ -426,29 +420,29 @@ \subsubsection{Agreement Measures}
 \subsubsection{Reliability}
 \label{subsec:Reliability}
 
-In order to evaluate the degree of agreement of our data, we made agreement measures at two points in time.\footnote{for our API documentation see \\ \url{https://asparagus.readthedocs.io/en/latest/api/measure\textunderscore agreement.html}}
+In order to evaluate the degree of agreement of our data, we measured the agreement at two different points in time.\footnote{for our API documentation see \url{https://asparagus.readthedocs.io/en/latest/api/measure\textunderscore agreement.html} (as of 11/27/2020)}
 
-The first time, six different annotators assigned feature labels to images out of each class of the pre-sorted asparagus that we obtained by running the spears through the machine twice. We ensured that always two different annotators labeled the same set of images. The Kappa scores varied strongly between groups and features from $-0.03$ to 0.76, while the accuracy scores ranged from 0.49 to 1. We were surprised that the agreement scores are low, even though the raters gave the same label to many of the asparagus spears. This is an acknowledged problem~\citep{powers2012problem,sim2005kappa,feinstein1990high,posterFlight}.
+First, six annotators assigned feature labels to images out of each class of the pre-sorted asparagus that we obtained by running the spears through the machine twice. We ensured that two different annotators labeled the same set of images. The Kappa scores varied strongly between annotator pairs and features from $-0.03$ to 0.76, while the accuracy scores ranged from 0.49 to 1. We were surprised that the agreement scores are low, even though the raters gave the same label to many of the asparagus spears. This is an acknowledged problem~\citep{powers2012problem,sim2005kappa,feinstein1990high,posterFlight}.
 
 \begin{figure}[!ht]
-	\centering
-	\includegraphics[scale=0.55]{Figures/chapter03/kappa_measurewise.png}
-	\decoRule
-	\caption[Agreement Measure-Wise Comparison of all Features]{\textbf{Comparing Measures}~~~The figure shows the agreement measures accuracy, F1 and Cohen’s Kappa, separately for each manually labelled feature. Shown are the box-plots, so the middle line indicates the median, the box indicates the IQR. All scores are aggregated scores over all annotator pairs.}
-	\label{fig:KappaMeasurewise}
+    \centering
+    \includegraphics[scale=0.55]{Figures/chapter03/kappa_measurewise.png}
+    \decoRule
+    \caption[Agreement Measure-Wise Comparison of all Features]{\textbf{Comparing Measures}~~~The figure shows the agreement measures accuracy, F1 and Cohen’s Kappa, separately for each manually labelled feature. Shown are the box-plots, so the middle line indicates the median, the box indicates the IQR. All scores are aggregated scores over all annotator pairs.}
+    \label{fig:KappaMeasurewise}
 \end{figure}
 
 \begin{figure}[!ht]
-	\centering
-	\includegraphics[scale=0.55]{Figures/chapter03/kappa_featurewise.png}
-	\decoRule
-	\caption[Feature-Wise Comparison of Agreement Measure Scores]{\textbf{Comparing Features}~~~The figure shows each feature separately. For each feature, the corresponding accuracy, F1 and Cohen’s Kappa score is given. Shown are the box-plots, so the middle line indicates the median, the box indicates the IQR. All scores are aggregated scores over all annotator pairs.}
-	\label{fig:KappaFeaturewise}
+    \centering
+    \includegraphics[scale=0.55]{Figures/chapter03/kappa_featurewise.png}
+    \decoRule
+    \caption[Feature-Wise Comparison of Agreement Measure Scores]{\textbf{Comparing Features}~~~The figure shows each feature separately. For each feature, the corresponding accuracy, F1 and Cohen’s Kappa score is given. Shown are the box-plots, so the middle line indicates the median, the box indicates the IQR. All scores are aggregated scores over all annotator pairs.}
+    \label{fig:KappaFeaturewise}
 \end{figure}
 
-One reason for our results could be that we compared the agreement class-wise. The occurrence of 1s and 0s per class is therefore very unbalanced. For Kappa scores, if the distribution of 0s and 1s is not balanced, disagreement of the underrepresented value is punished more heavily.\footnote{see also \\ \url{https://stats.stackexchange.com/questions/47973/strange-values-of-cohens-kappa}} Therefore, we decided to repeat our agreement measure feature-wise on non-labeled images, so that the annotators cannot anticipate a specific group label. In order to better understand the reliability of our data, we additionally decided to look at the accuracy score and the F1 score. Beforehand, the team labeled another 50 images all together, clarified classification boundaries again and discussed unclear images (see \autoref{fig:CriticalExampleImages}).
+A reason for the variance in our results could be that we compared the agreement class-wise. Thus, the occurrence of 1s and 0s per class is very unbalanced (for example the class I~A~Anna has very few 1s because the binary features rusty head, rusty body, hollow, bent and violet are not present in this class label). For Kappa scores, if the distribution of 0s and 1s is not balanced, disagreement of the underrepresented value is punished more heavily.\footnote{see also \url{https://stats.stackexchange.com/questions/47973/strange-values-of-cohens-kappa} (visited on 04/24/2020)} Therefore, we decided to repeat the agreement measure feature-wise on non-labeled images, so that the annotators cannot anticipate a specific class label. In order to better understand the reliability of our data, we additionally decided to look at the accuracy score and the F1 score. Beforehand, the team labeled another 50 images all together, clarified classification boundaries again and discussed unclear images (see \autoref{fig:CriticalExampleImages}).
 
-The second time, 50 images were labeled by four annotators. The agreements were measured annotator-pair-wise, and then averaged. The results in the Cohen’s Kappa score vary between and within features, and between annotator pairs as well. The highest aggregated kappa score over all annotator pairs is reached for the features flower (0.79) and hollow (0.79), then violet (0.76), rusty head (0.72), bent (0.55) and lastly for rusty body (0.47).
+The second time, 50 images were labeled by four annotators. The agreements were measured annotator-pair-wise, and then averaged. The results in the Cohen’s Kappa score vary between features, and annotator pairs. The highest aggregated kappa score over all annotator pairs is reached for the features flower (0.79) and hollow (0.79), then violet (0.76), rusty head (0.72), bent (0.55) and lastly for rusty body (0.47).
 For the features flower, rusty head and violet, the interquartile range (IQR) is quite small, whereas the IQR for hollow, bent and rusty body is much larger (see \autoref{fig:KappaMeasurewise} and \autoref{fig:KappaFeaturewise}).
 
 The agreement scores accuracy and F1 yield very similar results. Results are slightly better than the Kappa scores, in total and for each feature. The highest median accuracy score is reached for the feature hollow (0.96), then flower (0.93), then violet (0.92), then rusty head (0.86), then bent (0.75) and then rusty body (0.74). The order is the same for the F1 scores. The median F1 scores lie between (0.71 and 0.97).
@@ -460,19 +454,19 @@ \subsubsection{Reliability}
 \subsection{The asparagus data set}
 \label{sec:AsparagusDataSet}
 
-In the following chapter, the asparagus data sets will be discussed. The data that we collected is restructured and preprocessed into several different versions which are used for the classification approaches. Further, the possibility to save a Tensor or a \texttt{TFRecord} file and the \mbox{\texttt{tf.data}} API and the \mbox{\texttt{tf.data.dataset}} API are introduced. These methods are compared and followed by a recommendation for further work with the data set.
+In the following chapter, the asparagus data sets will be discussed. The data that we collected is restructured and preprocessed into several different versions which are used for the classification approaches. Further, the possibility to save a the dataset as Tensor with the numpy API or as serialized \texttt{TFRecord} files with the \mbox{\texttt{tf.data.dataset}} API are introduced. These methods are compared and followed by a recommendation for further work with the data set.
 
-One peculiarity about our data is its variance. The differences in the features which decide the class label of an asparagus are very small. All in all, the asparagus spears look very similar. This differentiates our classification task from other examples in the literature which usually have a high inter-class difference. We decided to reduce the variance within the images even further by removing all differences that are not relevant for the classification task (see Chapter ~\ref{sec:Preprocessing}). By doing so we facilitate the learning process and help the model focus on the features that matter.
+One peculiarity about our data is its variance. The differences in the features which decide the class label of an asparagus are very small. Overall, asparagus spears look very similar. This differentiates our classification task from other examples in the literature which usually have a high inter-class difference.To facilitate the learning process and help the model focus on the most important information, we removed irrelevant differences (eg. in the background) from the images. (see Chapter ~\ref{sec:Preprocessing}).
 
 \subsubsection{Description of our data set(s)}
 \label{subsec:DifferentDataSets}
 
 
-With the help of the preprocessing as described in ~\ref{sec:Preprocessing}~\nameref{sec:Preprocessing}, the raw images are transformed and distributed into folders which serve as input for the networks.
-The different data sets can be distinguished in three ways: the images have the original background or the background is removed, the asparagus spears are moved to the center of the image patch and rotated upwards or not and the images are down sampled to reduce memory consumption or in the original resolution.
+With the help of the preprocessing as described in ~\ref{sec:Preprocessing}~\nameref{sec:Preprocessing}, the raw images are transformed and distributed into different folders which serve as input for the approaches.
+The images in the different folders can be distinguished in three ways: the images either have the original background or the background is removed, the asparagus spears are centered and rotated upwards or not and the images are down sampled to reduce memory consumption or in the original resolution.
 
-Further, a data set is constructed with all the labeled images in a single numpy array, which can be stored and loaded at once. As the three perspectives of each asparagus spear are concatenated horizontally, they appear to lie next to each other in the resulting image. These concatenated images are then combined to the final file. The first out of four dimensions in this file depicts the number of labeled asparagus spears. The second and third dimension represent the height and the width of the images, respectively. Further, the fourth dimension represents the three RGB values.
-In addition, the images are downscaled to facilitate the training process and reduce memory consumption. Each image is downsampled by a factor of six, that means every 6th pixel is used in the reduced image. This factor can be easily changed and a new data set can be created. 
+One data set is constructed with all the labeled images in a single numpy array, which can be stored and loaded at once. As the three perspectives of each asparagus spear are concatenated horizontally, they appear to lie next to each other in the resulting image. These concatenated images are then combined to the final file. The first out of four dimensions in this file depicts the number of labeled asparagus spears. The second and third dimension represent the height and the width of the images, respectively. Further, the fourth dimension represents the three RGB values.
+In addition, the images are downscaled to facilitate the training process and reduce memory consumption. Each image is downsampled by a factor of six, that means every 6th pixel is used in the reduced image. This factor can be easily changed to create a new data set.
 
 An additional data set is generated which contains images of the head region of the asparagus spears exclusively. This data set was used to train a dedicated network for head related features (see~\autoref{subsec:HeadNetwork}).
 
@@ -480,17 +474,11 @@ \subsubsection{Description of our data set(s)}
 \subsubsection{Data set creation with Tensorflow}
 \label{subsec:DataSetTheory}
 
+As pointed out before more than 800 Gigabyte (GB) of preprocessed image data was generated. This big amount of data for training and evaluation in different approaches was challenging. The main difficulty was memory management, an acceptable training time and data organisation. Especially the size and format of the image has a significant impact on our import pipeline and, therefore, on the total training time. 
 
-In the following, TensorFlow's own binary storage format \texttt{TFRecord} is introduced. This approach facilitates the mix and match of data sets and network architectures. The large amount of data that was collected has a significant impact on our import pipeline and, therefore, on the total training time. The file format is optimized for images and text data. These are stored in tuples which always consist of file and label. In our case, the difference in reading time is significant, because the data is stored in the network and not on a SSD on the local PC. The serialized file format allows the data to be streamed efficiently through the network efficiently. Another advantage is that the file is transportable over several systems, regardless of the model one wants to train.
-
-Two data sets are created in this file format. One with all files that were preprocessed, and another one with the preprocessed and labeled data.
-The first binary file includes images with background in png format and has a size of 225 Gigabyte (GB). The \texttt{TFRecord} file does not only take up less memory capacity, but can also be read more efficiently. The second \texttt{TFRecord} file includes all labeled images and their labels.
-
-Working with these files simplifies the next steps of transformations. With the \mbox{\texttt{tf.data}} API complex, input pipelines from simple and reusable components are created, even for large data sets. The preferred pipeline for our asparagus project can apply complex transformations to the images and combine them into stacks for training, testing and validating in arbitrary ratios. A data set can be changed, e.g.\ by using different labels or by transformations like mapping, repeating, batching, and many others. These dozens of transformations can be combined.
-
-Besides the described functional transformations of the input pipeline under \mbox{\texttt{tf.data.dataset}}, an iterator gives sequential access to the elements in the data set. The iterator stays at the current position and allows to call the next element as a tuple of tensors. Initializable iterators go through the data set in parallel. In addition, different parameters are passed to start the call. This is especially handy when searching for the right parameters in parallel.
+In respect to the importance of preventing any loss of information the \texttt{TFRecord} storage format was chosen to retain the data set. Two data sets are created in this file format. One with all files that were preprocessed with a size of 225 GB, and another one with the preprocessed and labeled data, respectively smaller. 
 
-In summary there are two advantages using \texttt{tf.data}. On the one hand, it is possible to build a data set with different data sources. On the other hand, there are many functional transformations and an iterator with sequential access to the data.
+A detailed description of the benefits of the used data format can be found in the appendix at \autoref{subsec:BenefitsDataSet}. In summary, there are advantages using \texttt{tf.data}. On the one hand, it is possible to reduce the disk space and to read the image data faster due to an iterator with parallel and sequential access to the data. On the other hand, many functional transformations to adjust the images with custom preprocessing steps according to the approach can be used in the pipeline, as we did manually.
 
 \bigskip
-Further we tried to add our data set to the \texttt{\acrshort{tfds}} \footnote{See~\url{https://www.tensorflow.org/datasets/beam\_datasets}}. \texttt{\acrshort{tfds}} enables all users to access the data set directly using the TensorFlow API. For this, we need to publish the data. However, the process of publishing the data set turned out to be too time consuming. Especially the large amount of data was a problem. Questions like: How do we deal with the fact that only a part of the images has labels? How should we pass the labels: each as a feature, in a list, or as several features? It would have been better, faster, and more helpful for the development of our network approaches, if we had continued to search and to integrate the \texttt{TFRecord} files with the \mbox{\texttt{tf.data}} API in our pipeline early. 
+We tried to add our data set to the \texttt{\acrshort{tfds}} \footnote{See~\url{https://www.tensorflow.org/datasets/beam\_datasets} (visited on 04/26/2020)}. \texttt{\acrshort{tfds}} enables all tensorflow users to access the data set directly using the TensorFlow API. However, the process of publishing the data set turned out to be too time consuming because of the large amount of images. In order to use the benefits of the \mbox{\texttt{tf.data}} API we should have integrated it into our pipeline at an earlier stage.
\ No newline at end of file
diff --git a/documentation/report/Chapters/Discussion.tex b/documentation/report/Chapters/Discussion.tex
index 5cfa0ae..35b3e98 100644
--- a/documentation/report/Chapters/Discussion.tex
+++ b/documentation/report/Chapters/Discussion.tex
@@ -4,79 +4,74 @@
 \section{Discussion}
 \label{ch:Discussion}
 
-In our study project we pursued three main objectives. The main goal was to explore and implement different algorithms for asparagus classification. The second objective is closely linked to this and relates to best practices in relation to applied data science and big data. This included storage of data on remote servers and computationally expensive procedures that are required for training in the computational grid of Osnabr{\"u}ck University: The methodological aspect of our study project. As our work also served as a sample project to learn more about possibilities to effectively organize collaborative work, we also targeted a third objective, that is, the organizational aspect that is closely linked to project management. In the following, the core results with respect to these three objectives are shortly named and discussed.
+In our study project we pursued three main objectives. The first one is to explore and implement different algorithms for asparagus classification. The second one is closely linked to this and relates to best practices in relation to applied data science and big data. This included storage of data on remote servers and computationally expensive procedures that are required for training in the computational grid of Osnabr{\"u}ck University: The methodological aspect of our study project. As our work also served as a sample project to learn more about possibilities to effectively organize collaborative work, we also targeted a third objective, that is, the organizational aspect such as project management. In the following, the core results with respect to these three objectives are shortly named and discussed.
 
 
 \subsection{Classification results}
 \label{sec:DiscussionResults}
 
-Asparagus spears have several features that we aimed to extract. Some features such as the length and width of asparagus spears were undoubtedly measurable using a pure computer vision approach that does not rely on machine learning. For others, direct filtering was not easily possible because no clear cut definition of features, such as bent or violet exists, that is precise enough to be implemented directly. Although relevant information can easily be extracted, the rules to infer the desired binary features are inaccessible: On one side, decision boundaries for binary classifiers have to be found. On the other side, the perception of the features color or bent has been shown to be very subjective. Moreover, filtering features such as flower has been proven difficult. Named attribute relates to details in a few pixels, comes in different forms, and highly depends on the perspective. These are the reasons machine learning must be employed to successfully classify asparagus.
+Asparagus spears have several features that we aimed to extract. Some features such as the length and width of asparagus spears were undoubtedly measurable using a pure computer vision approach that does not rely on machine learning. For others, direct filtering was not easily possible because no clear cut definition of features, such as bent or violet exists, that is precise enough to be implemented directly. Although relevant information can easily be extracted, the rules to infer the desired binary features are inaccessible: On one side, decision boundaries for binary classifiers have to be found. On the other side, the perception of the features color or bent has been shown to be very subjective. Moreover, filtering features such as flower has been proven difficult. Named attribute relates to details in a few pixels, comes in different forms, and highly depends on the perspective. These are the reasons machine learning promises better results to successfully classify asparagus from image data.
 
 We designed several neural networks and applied them in different ways to analyze and classify a large-scale asparagus image data set. Some approaches worked better than others, as can be concluded from the \nameref{ch:Summary}.
 
 \bigskip
-Training \acrshort{mlp} classifiers on histograms of palette images has proven a promising approach to predict color features. Named histograms contain information about the fraction of foreground pixels that correspond to violet or rusty colors. As \acrshortpl{mlp} have few parameters, the design is rather trivial and the training process quick. The fact that predictions are far from perfect might be due to inconsistencies in the training data. One may assume, however, that the models generalize well and represent rules that relate to average opinions or definitions of highly subjective features such as color or bent.
+Training \acrshort{mlp} classifiers on histograms of palette images is a promising approach to predict color features. As \acrshortpl{mlp} have fewer hyperparameters, the design is rather trivial. Moreover networks for sparse data can be very small. The fact that predictions are far from perfect might be due to inconsistencies in the training data. One may assume, however, that the models generalize well and represent rules that relate to average opinions or definitions of highly subjective features such as color or bent.
 
 \bigskip
-The single-label \acrshort{cnn} for the classification of 13 features promises a flexible and easy-to-use solution. Not much preprocessing is needed\footnote{The input images are only slightly reduced in pixel size to increase training speed, however, the background is not removed.} and it is able to learn every feature. The network architecture is not aimed at a small, specific subset of only one or two features. Rather it is a basic network that could now be fine-tuned to the individual needs of each feature.
+The single-label \acrshort{cnn} for the classification of 13 features promises a flexible and easy-to-use solution. Not much preprocessing is needed\footnote{The input images are only slightly reduced in pixel size to increase training speed, however, the background is not removed.} and it is able to learn every feature. The network architecture is not aimed at a small, specific subset of only one or two features. Rather it is a basic network that could now be fine-tuned to the individual needs of each feature. This would be a good starting point for a follow-up project (see \autoref{ch:Conclusion}).
 
-As already mentioned before (like for the feature engineering network from \autoref{subsec:FeatureEngineering}), the lack of a clear threshold regarding certain features like rusty body or bent might be a factor that reduces the possible performance of the \acrshort{cnn}. The smooth transition of the presence and absence of a feature makes it more difficult to categorize images close to named transition. In contrast, the performance of the model is well on features that were previously not labeled by humans but labeled automatically, like width and length. This difference between humanly labeled and automatically labeled features is striking in the network’s results.
-
-A drawback of the approach is the need of more labeled data with a feature being present to make the network more robust to outliers.
+As already mentioned before (like for the feature engineering network from \autoref{subsec:FeatureEngineering}), the lack of a clear threshold regarding certain features like rusty body or bent might be a factor that reduces the possible performance of the \acrshort{cnn}. The smooth transition of the presence and absence of a feature makes it more difficult to categorize images close to named transition. In contrast, the performance of the model is well on features that were previously not labeled by humans but labeled automatically, like width and length. This difference between hand-annotated and automatically labeled features is striking in the network’s results.
 
+A drawback of the approach is the need of more labeled data with a feature being present to make the network more robust.
 
 \bigskip
-Feedforward  \acrshortpl{cnn} were applied to predict individual features, for multilabel prediction, and predictions based on snippets that depict asparagus heads. In addition, effects of a custom loss function were tested. Promising in the multilabel prediction is that not only individual features, but also the relation between features can be considered in the learning process.
+Feedforward \acrshortpl{cnn} were applied to predict individual features, for multi-label prediction, and predictions based on snippets that depict asparagus heads. In addition, effects of a custom loss function were tested. Promising in the multi-label prediction is that not only individual features, but also the relation between features can be considered in the learning process.
 
-Our multilabel  \acrshort{cnn} reaches an accuracy up to 87\%, which seems high. However, when looking at the accuracy and loss values over time, one can see that the model does not improve much. While sensitivity and specificity improve, and therefore indicate learning, the validation loss remains high, indicating overfitting. This model seems to be especially sensitive to the imbalance between 0 and 1 in the label vectors. Concerning this, there is still room to play around with our parameters to further improve the architecture.
+Our multi-label  \acrshort{cnn} reaches an accuracy up to 87\%, which seems high. However, when looking at the accuracy and loss values over time, one can see that the model does not improve much. While sensitivity and specificity improve, and therefore indicate learning, the validation loss remains high, indicating overfitting. This model seems to be especially sensitive to the imbalance between 0 and 1 in the label vectors. After solving the problem of overfitting, more fine-tuning of the parameters and the network architecture could improve the performance of this approach.
 
 \bigskip
 Applying \acrshort{pca} on individual features and projecting the image information into a smaller dimensional subspace showed promising results. It revealed that the first principal components managed to capture most of the information. However, differences between most features seem to be too small to be adequately represented in the low dimensional space.
 
-In this approach, the features width, length and hollow seem to be classifiable with high performance, and the features bent and rusty body seem to be most difficult. Width, length and hollow (as hollow asparagus is likely to be thick) are features that can be related to the shape and spatial appearance of the spear in the picture. This walks together with the findings that the first principle components (so the most important ones) only refer to the appearance of the asparagus in the picture and show the same picture for all asparagus. This leads to the assumption that the spatial appearance is counted as the most important feature, rather than taking the surface of the spears into account. This problem could be improved by generating pictures, where the possible asparagus positions are equally distributed over the pictures. Another reason for the similarities between the first principal component pictures is that one asparagus can have many features. Therefore the same pictures can be used for several feature \acrshortpl{pca}. As asparagus with only one present feature is rather difficult to find, another solution to solve this problem needs to be found.
+In this approach, the features width, length and partly hollow seem to be classifiable with high performance, and the features bent, violet and rusty body seem to be most difficult. Width, length and hollow (as hollow asparagus is likely to be thick) are features that can be related to the shape and spatial appearance of the spear in the picture. This walks together with the findings that the first principle components only refer to the appearance of the asparagus in the picture. This leads to the assumption that the spatial appearance is counted as the most important feature, rather than taking the color and structure of the spears into account. This problem could be improved by generating pictures, where the possible asparagus positions are equally distributed over the pictures. Another reason for the similarities between the first principal component pictures is that one asparagus can have many features. Therefore the same pictures can be used for several features \acrshortpl{pca}. As asparagus with only one present feature is rather difficult to find, another solution to solve this problem needs to be found.
 
 \bigskip
-Similarly, \acrlongpl{vae} were used to derive a low dimensional representation using unsupervised learning. While some features such as the width and length are mapped to clearly differentiable regions in latent asparagus space, this is not the case for many others. Only as a tendency, spears labeled as bent are for example mapped to regions in the lower periphery. Autoencoders are known for blurry reconstructions. This is a possible explanation for the lack of clusters in latent space for features that relate to details that are not sufficiently reconstructed.
+Similarly, \acrlongpl{vae} were used to derive a low dimensional representation using unsupervised learning. While some features such as the width and length are mapped to clearly differentiable regions in latent asparagus space, this is not the case for many others. For example, spears labeled as bent are tendentially mapped to regions in the lower periphery. Autoencoders are known for blurry reconstructions. This is a possible explanation for the lack of clusters in latent space for features that relate to details that are not sufficiently reconstructed.
 
 \bigskip
-Convolutional autoencoders were used for semi-supervised learning. However the results for this approach can be described as merely mediocre. One problem is arguably the mentioned insufficiency in reconstructing details. As details such as brown spots define target classes (e.g.\ rusty head) and they are not present in latent space. It is hard to establish a correlation of the respective latent layer activation and the target labels. Larger input image sizes or different network architectures that are suitable to reconstruct higher detail images could potentially help to improve performance of these semi-supervised learning algorithms.
+Convolutional autoencoders were used for semi-supervised learning. However the results for this approach can be described as merely mediocre. One problem is arguably the mentioned insufficiency in reconstructing details. As details such as brown spots define target classes (e.g.\ rusty head) but are not present in latent space, it is hard to establish a correlation of the respective latent layer activation and the target labels. Larger input image sizes or different network architectures that are suitable to reconstruct images with higher detail could potentially help to improve performance of these semi-supervised learning algorithms.
 
 \bigskip
-Detecting the feature rusty head has proven rather difficult even though a dedicated network was trained on snippets that show asparagus heads in rather high resolution. This is potentially the case because details that are hardly visible even to the human eye have to be considered that occur in different locations. Although better results are achieved for the feature flower, the same most likely holds for this category as well. In contrast, better results are achieved for features that relate to the overall shape of asparagus spears instead of fine details. This holds for the category hollow as well as  for bent. Color features are detected especially well based on histograms of palette images while \acrshortpl{cnn} have proven suitable to detect shape related features.
+Detecting the feature rusty head has proven rather difficult even though a dedicated network was trained on snippets that show asparagus heads in rather high resolution. This is potentially the case because some details are hardly visible even to the human eye. Although better results are achieved for the feature flower, the same holds for this category as well. Even better results are achieved for features that relate to the overall shape of asparagus spears instead of fine details. This holds for the category hollow as well as for bent. Color features are detected especially well based on histograms of palette images while \acrshortpl{cnn} have proven suitable to detect shape related features.
 
 As previously noted (i.e.\ in \autoref{subsec:FeatureEngineering} and \autoref{subsec:SingleLabel}), an obstructive factor for applications relying on labeled data proves to be the inconsistent labelling of certain data samples. Even for an expert in asparagus labeling like the owner of Gut Holsterfeld, setting a clear threshold for the absence or presence of specific features (and thus the attribution of a class label) becomes difficult in certain cases. This is partly evident in our classification approaches. Additionally, the agreement of manual annotators has to be better controlled during labeling. Thus, one suggestion for improvement would be to label the data a second time, with clear and consistent thresholds for feature presence and absence, then adapt and improve the supervised approaches.
 
 \bigskip
-Our work was specially directed at the Autoselect ATS II sorting machine. Whether we have succeeded in improving its currently running sorting algorithm could not yet be clarified systematically because of a lack of time and resources for a suitable comparison and evaluation. An idea for evaluating the current sorting method and our developed methods would be to run pre-sorted asparagus through the machine, test our approaches on the generated images, and then compare the performance of both. 
-
-During our meetings we discussed existing difficulties of evaluation. If our algorithm controls the sorting machine, new possibilities to create an evaluation are given. 
-One method is to measure and compare the sorting of the harvesters who control the asparagus label after it was sorted by the machine. Further methods need to be considered. If necessary, the setup for automated procedures can be extended.
+Our work was specially directed at the Autoselect ATS II sorting machine. Whether we have succeeded in improving its currently running sorting algorithm could not yet be clarified systematically because of a lack of time and resources for a suitable comparison and evaluation. An idea for evaluating the current sorting method and our developed methods would be to run pre-sorted asparagus through the machine, test our approaches on the generated images, and then compare the performance of both.
 
-In cooperation with the local asparagus farm Gut Holsterfeld and the manufacturer of the sorting machine, a concrete realization of our approaches should now be developed and tested. 
+In cooperation with the local asparagus farm Gut Holsterfeld and the manufacturer of the sorting machine, a concrete realization of our approaches should now be developed and tested.
 
 \bigskip
-In summary, we successfully measured the width and height of asparagus spears and were able to develop detectors for the other features that performed surprisingly good, given the moderate inter-coder reliability that was partially due to unclear definitions of binary features such as bent or violet.
+In summary, we successfully measured the width and height of asparagus spears and were able to develop detectors for the other features that performed surprisingly good, given the moderate inter-coder reliability that was partially due to unclear definitions of binary features such as bent or violet. Further, we provide an extensive theoretical and practical groundwork that can guide future developments of asparagus classification algorithms.
 
 
 \subsection{Methodology}
 \label{sec:DiscussionMethodology}
 
-Looking back, there are several methodological issues, which we would process differently now. We started our study project at the beginning of April. This is as well the beginning of the asparagus harvesting season. Therefore,  we were able to start collecting data straight away. On the down-side, we had to start collecting data without a detailed plan in advance. Planning the data collection ahead could have made the data acquisition more efficient, and structured. Afterwards we could not change relevant parameters to answer a lot of organizational and methodological questions such as: How much data do we need? What format do we need it in? Is autonomous calibration possible? How exactly do we store the images effectively and efficiently? Is the hardware setup as we need it? And if not, how can we improve it? What kind of measurements or changes do we want to perform with the camera? Is the illumination as we want it? Could stereo cameras or other 3D viewing techniques such as depth cameras or laser grid be used? Would we integrate an additional camera taking pictures from the bottom of the spear or from the head region separately?  What should our pipeline look like? How can we get labeled data right away?
+Looking back, there are several methodological issues, which we would process differently now. We started our study project at the beginning of April. This aligns with the beginning of the asparagus harvesting season. Therefore, we were able to start collecting data straight away. On the down-side, we had to start collecting data without a detailed plan. Planning the data collection ahead could have made the data acquisition more efficient, and structured. Afterwards we could not change relevant parameters to answer a lot of organizational and methodological questions such as: How much data do we need? What format and hardware setup satisfies our purpose? Is autonomous calibration possible? How exactly do we store the images effectively and efficiently? What kind of measurements or changes could be applied to the camera? Is the illumination sufficient? Could stereo cameras or other 3D viewing techniques such as depth cameras or laser grids be useful? Is it possible to integrate an additional camera to capture the bottom of the spear or the head region separately? How can we efficiently obtain labeled data?
 
 \bigskip
-As already mentioned in \autoref{sec:Roadmap}, there was a misunderstanding between the group and the supporting asparagus farm about the type of  data necessary. The already existing images were too few, and unlabeled. Therefore, we spent the entire asparagus season with data acquisition instead of starting with preprocessing as had been planned at project start. The number of labeled images that were collected by running pre-sorted classes though the machine is arguably insufficient to learn classes using the chosen deep learning approaches~\citep{russakovsky2013detecting,russakovsky2010attribute,how_many_images}. Therefore, a lot of time was spent on preprocessing and labeling the data manually. 
-
-Another discussion point concerns our data. The image quality in terms of pixel size of our images is really high. Due to limited memory capacity and long runtimes of some of the tested networks, images needed to be down-sampled. Additionally, images of different file types were used (e.g.\ png, jpg), which also includes reduction in disc space. We should further investigate to what extent images can be down-sampled without losing critical information, in order to optimize the named difficulties.
+As already mentioned in \autoref{sec:Roadmap}, a misunderstanding between the group and the supporting asparagus farm about the type of data necessary occured. The already existing images were too few, and unlabeled. Therefore, we spent the entire asparagus season with data acquisition instead of starting with preprocessing as planned. The number of labeled images that were collected by running pre-sorted classes though the machine is arguably insufficient to learn classes using the chosen deep learning approaches~\citep{russakovsky2013detecting,russakovsky2010attribute,how_many_images}. Therefore, additional time was spent on preprocessing and labeling the data manually.
 
-Even though three images of every asparagus spear are given, they are all taken from the same angle – from above. In the ideal case, the asparagus spear rotates over the conveyor belt, such that each spear is depicted in the pictures from a different viewpoint. The better the asparagus rotates, the more reliable is a later judgement of the spear in terms of class labels or features. Since the rotation is often missing when the spear is too bent, an additional angle could improve the rating. Concrete ideas on how to improve the setup are given in the last chapter, \ref{ch:Conclusion}~\nameref{ch:Conclusion}.
+Another discussion point concerns the data. The image quality in terms of pixel size of our images is really high. Due to limited memory capacity and long runtimes of the tested networks, images needed to be down-sampled. It should be further investigated to what extent images can be down-sampled without losing critical information.
 
-As previously mentioned, our labels of asparagus features are partly achieved by computer vision algorithms, partly based on human perception. As previously outlined, human performance is commonly acknowledged as the baseline performance in classification tasks. While the performance of our automatic feature extraction for length and width is really high, we decided that for the features violet, rusty head, rusty body, bent, hollow, and flower a human perception would be more accurate. Even though this is commonly used as the \enquote{gold standard}, it can of course also bring more variation, and maybe even inconsistency between raters, than an algorithm.
+Even though three images of every asparagus spear are given, they are all taken from above. In the ideal case, the asparagus spear rotates over the conveyor belt, such that each spear is depicted in the pictures from a different viewpoint. The better the asparagus rotates, the more reliable is a later judgement of the spear in terms of class labels or features. Since the rotation is often missing when the spear is too bent, an additional camera could improve the rating. Concrete ideas on how to improve the setup are given in the last chapter, \ref{ch:Conclusion}~\nameref{ch:Conclusion}.
 
-As explained in the section ~\nameref{sec:Preprocessing}, during preprocessing for the labeling procedure, the three images of one asparagus spear are joined together and then labeled. Therefore, labeling was faster because three perspectives are labeled at once. However, it could also be tried to use the single perspectives and conclude the classes from a combination of the recognized features.
+As previously mentioned, our labels of asparagus features are partly achieved by computer vision algorithms and partly based on human perception. Human performance is commonly acknowledged as the baseline performance in classification tasks. While the performance of our automatic feature extraction for length and width is really high, for the features violet, rusty head, rusty body, bent, hollow, and flower a human perception is more accurate. Even though this is commonly used as the \enquote{gold standard}, it holds space for variation, and maybe inconsistency between and even within raters, in contrast to an algorithm.
 
-We kept the features binary, as this is easier to label, and easier to use for our supervised classification approaches. The down side of a binary label is, however, that a clear boundary is set, where in real life there is a smooth transition. Even for our supervising farmer, it is sometimes difficult to decide on a boundary. This is arguably due to the small differences between class labels, and vague borders between positive and negative examples. While the binary representation makes certain analyses and classification much easier, it also brings restrictions.
+As explained in the section ~\nameref{sec:Preprocessing}, during preprocessing for the labeling procedure, the three images from different perspectives of one asparagus spear are labeled together. Therefore, labeling was faster.
+We kept the features binary, as this is easier to label, and suitable for supervised classification approaches. The down side of a binary label is, however, that a clear boundary is set, where in real life there is a smooth transition. Even for our supervising farmer, it is sometimes difficult to decide on a boundary due to the small differences between class labels, and vague borders between positive and negative examples. While the binary representation makes certain analyses and classification much easier, it also brings restrictions.
 
-Moreover, we observed difficulties concerning the labels in the communication between the group and the farmer. The communicated need is that the sorting algorithm works \enquote{better}. But what does that technically mean? And what is technically possible? For the farmer, the sorting would already be \enquote{better}, if the sorting mistakes would be more systematic. This would not necessarily mean that the overall accuracy of correctly sorted asparagus into one out of 13 class labels needs to improve, but that the overall impression of all spears sorted into one tray is more homogeneous. 
+Moreover, we observed difficulties in the communication between the group and the farmer. The goal was to improve the sorting algorithm. But what that technically refers to and what improvements are technically possible remained unclear. For example, the sorting would already be improved, if the sorting mistakes would occur more systematically. This would not necessarily mean that the overall accuracy of correctly sorted asparagus into classes is greater, but that the overall impression of all spears sorted into one tray is more homogeneous.
 
 
 \subsection{Organization}
@@ -93,7 +88,7 @@ \subsection{Organization}
 
 Further, the strengths of the single members have to be evaluated before the project starts, so they can be used efficiently. Although, not everyone has to do the task he or she is best at. One should also have the opportunity to work on tasks that are new, challenging and interesting. It allows each member to broaden their skills and it avoids discouragement. 
 
-Finally, the team agrees to have more focus on the overall goal than only think of what directly lies ahead. For this, concrete goals have to be formulated well. Milestones or intermediate goals should be defined and evaluated more rigorously. More time has to be taken into consideration when planning ahead as well as for including adjustments.
+Finally, the team agrees to focus more on the overall goal than to only think of what directly lies ahead. For this, concrete goals have to be formulated well. Milestones or intermediate goals should be defined and evaluated more rigorously. More time has to be taken into consideration when planning ahead as well as for including adjustments.
 
 \bigskip
 In conclusion, the experience of having two different working structures gave us the ability to compare and judge what is essential to successful teamwork. It also helped to understand how each member can contribute to the team regarding personal skills and interests, and what each member wants to improve for future teamwork.
diff --git a/documentation/report/Chapters/Introduction.tex b/documentation/report/Chapters/Introduction.tex
index 9476a23..b19bff6 100644
--- a/documentation/report/Chapters/Introduction.tex
+++ b/documentation/report/Chapters/Introduction.tex
@@ -9,9 +9,9 @@ \section{Introduction}
 \bigskip
 The aim of this study project was to investigate how techniques from computer vision, both classical and deep learning-based, can be applied to improve classification results for asparagus sorting machines. Asparagus is sorted by quality. The quality is defined based on several features that represent differences in color, shape, or texture. Hence, improving sorting algorithms for the classification of asparagus requires reliable means of measuring these features. As such, we were interested in the question of how the quality-defining features of asparagus can be estimated using state of the art computer vision and machine learning techniques.
 
-The project was supported by a local asparagus farm and a company specializing in asparagus sorting machines. They provided training data and expertise on the classification of the asparagus. The idea to apply new machine learning approaches to a commercial problem of the agricultural industry seemed promising and challenging at the same time. As the data received from the farm was unlabeled, a larger focus on the preprocessing of the recorded image data became inevitable. Further, the variance in quality classes and the subjective view of human sorting behavior toughened the perspective on a practical application into the actual sorting machine. Nevertheless, the original goal of studying different techniques in computer vision and testing their usability in a fixed hardware setting for asparagus classification gave valuable insights into the practical application of previously theoretically assessed knowledge.
+The project was supported by a local asparagus farm and a company specializing in asparagus sorting machines. They provided training data and expertise on the classification of the asparagus. The idea to apply new machine learning approaches to a commercial problem of the agricultural industry seemed promising and challenging at the same time. As the data received from the farm was unlabeled, a larger focus on the preprocessing of the recorded image data became inevitable. Further, the variance in quality classes and the subjective view of human sorting behavior toughened the perspective on a practical application into the actual sorting machine. Therefore, the goal of the project shifted to a more broadly based, exploratory study in which we developed different techniques in computer vision and tested their usability in a fixed hardware setting. This gave us valuable insights into the practical application of previously theoretically assessed knowledge.
 
-The hands-on experience on a long-term project gave us the chance to improve our project management skills, learn how to distribute work, and how to organize ourselves in a larger team. Over the course of one year, the team members worked on the implementation of different approaches to tackle the issue of image classification. After intense engagement with the subject, the methodologies and the project outcome were documented and critically reviewed. On the following pages, the different stages of the project together with the results of the applied computer vision approaches are described in detail. The report chapters are mostly in chronological order, each with the focus on a different stage of the study project throughout the year.
+The hands-on experience on a long-term project gave us the chance to improve our project management skills, learn how to distribute work, and how to organize ourselves in a larger team. Over the course of one year, the team members worked on the implementation of different approaches to tackle the issue of image classification. After intense engagement with the subject, the methodologies and the project outcome were documented and critically reviewed. On the following pages, the different stages of the project together with the results of the applied approaches are described in detail. The report chapters are mostly in chronological order, each with the focus on a different stage of the study project throughout the year.
 
 \bigskip
 In chapter~\ref{ch:Introduction}~\nameref{ch:Introduction}, the idea of the project is presented together with an introduction to the current standards of image classification in machine learning as well as a general background on the quality classes, the sorting process, and classification of asparagus in the agricultural industry.
@@ -22,9 +22,9 @@ \section{Introduction}
 
 Chapter~\ref{ch:Classification}~\nameref{ch:Classification} constitutes the different approaches that were chosen to tackle the issue of image classification. All approaches are critically reviewed and discussed in the subchapters. An emphasis was put on the application of deep learning techniques for the classification of the asparagus. As the collected data includes labeled and unlabeled samples, methods from the range of supervised learning, semi-supervised learning, and unsupervised learning were tested.
 
-Chapter~\ref{ch:Summary}~\nameref{ch:Summary} contains the overall results of the classification approaches. The outcome of each approach is described and compared to the other methods as well as to the original classification accuracy of the sorting machine at the asparagus farm. Further, the overall results are discussed with an evaluation of their practical application in the food industry in chapter~\ref{ch:Discussion}~\nameref{ch:Discussion}. The overall outcome of the project is judged as a scientific study and as a team management experience.
+Chapter~\ref{ch:Summary}~\nameref{ch:Summary} contains the overall results of the classification approaches. The outcome of each approach is described and compared with each other as well as to the original classification accuracy of the sorting machine at the asparagus farm. Further, the overall results are discussed with an evaluation of their practical application in the food industry in chapter~\ref{ch:Discussion}~\nameref{ch:Discussion}. The overall outcome of the project is judged as an exploratory scientific study and as a team management experience.
 
-In Chapter~\ref{ch:Conclusion}~\nameref{ch:Conclusion}, the findings are set into a broader perspective. The project and its results are summarized on a scientific basis as well as on an organizational level while the future prospects of the project are assessed.
+In Chapter~\ref{ch:Conclusion}~\nameref{ch:Conclusion}, the findings are set into a broader perspective. The project and its results are summarized on a scientific basis as well as on an organizational level. As an outlook, the existing preliminary work is summed up, while the future prospects of the project are assessed by evaluating possible starting points for follow-up projects.
 
 \bigskip
 Before analyzing the specific context of software development for the application in classification tasks further, a thorough insight into the idea of the study project is given in the following chapter. Additionally, background on the addressed topics of machine learning and the sorting of asparagus is provided.
@@ -34,19 +34,19 @@ \section{Introduction}
 \subsection{The project}
 \label{sec:Project}
 
-The objective of the study project was to find both, conventional and deep learning computer vision-based approaches that can be tested for their practical application in the commercial sorting of white asparagus. The different methods were implemented based on the image data that was received from the automatic sorting machine Autoselect ATS II employed by the asparagus farm ``Gut Holsterfeld’’~\footnote{see the website of Gut Holsterfeld at~\url{https://www.gut-holsterfeld.de/}}. It is explored whether approaches from the fields of machine learning and computer vision can be applied to improve the classification behavior of the asparagus sorting machine. Further it is investigated, if these approaches can be used in a more industrial than scientific setting for the specific task of asparagus classification. The initial intention to directly implement new software into the machine was postponed to allow for intensive research on the different approaches and on fine-tuning the received data to build a practical data set to train neural networks.
+The objective of the study project was to find both, conventional and deep learning computer vision-based approaches that can be tested for their practical application in the commercial sorting of white asparagus. The different methods were implemented based on the image data that was received from the automatic sorting machine Autoselect ATS II employed by the asparagus farm \enquote{Gut Holsterfeld}~\footnote{see the website of Gut Holsterfeld at~\url{https://www.gut-holsterfeld.de/} (visited on 03/15/2020)}. It is explored whether approaches from the fields of machine learning and computer vision can be applied to improve the classification behavior of the asparagus sorting machine. Further it is investigated, if these approaches can be used in a more industrial than scientific setting for the specific task of asparagus classification. The initial intention to directly implement new software into the machine was postponed to allow for intensive research on the different approaches and on fine-tuning the received data to build a practical data set to train neural networks.
 
-The idea for the project was formed by one of the students of the study group. She has a familial relationship to the asparagus farm Gut Holsterfeld and had received note of the unsatisfying classification performance of its sorting machine. The practical application to a real-world problem sparked the interest of the group and the general curiosity towards computer vision, in particular neural networks, further inspired to deal with the sorting issue in a deep learning context, as well as with classical methods.
+The idea for the project was formed by one of the students of the study group. She has a familial relationship to the asparagus farm Gut Holsterfeld and had received note of the unsatisfying classification performance of its sorting machine. The practical application to a real-world problem sparked the interest of the group and the general curiosity towards computer vision and machine learning further inspired to deal with the sorting issue by experimenting with a broad selection of methods.
 
 All project members are students in the field of Cognitive Science at the University of Osnabr{\"u}ck. The study project is part of the Master Program in Cognitive Science at the University of Osnabr{\"u}ck. It is supervised by Dr.\ Ulf Krumnack and Axel Schaffland.
 
 The study project intends to confirm the ability of its participants to independently formulate and solve an unknown problem from the scientific context of one subject area using the methods and terms they previously learned~\citep{moduledescription,studyregulations}. This includes the documentation and presentation of the results, the methodologies as well as the reflection on the work process. Within the scope of the project, for example, the development of software, analysis, and interpretation of statistical data material is practiced. A further aspect of the study project is to deepen the communicative and decision-making competence of its participants~\citep{moduledescription}. The idea is to train independent project work in groups of students under conditions that are common for research projects in science or industry.
 
-As the project took part in the scope of an examination for the Cognitive Science Master Program at the University of Osnabr{\"u}ck, most of the work was developed at the university, that is, at the \acrfull{ikw}. 
+As the project took part in the scope of an examination for the Cognitive Science Master Program at the University of Osnabr{\"u}ck, most of the work was developed at the university, that is, at the \acrfull{ikw}.
 
-Additional image data was received from the asparagus farm ``Querdel’s Hof’’~\footnote{see the website of Querdel’s Hof at~\url{https://www.querdel.de/}}. Further cooperation existed with the mechanical engineering company HMF Hermeler Maschinenbau GmbH~\footnote{see the website of HMF Hermeler at~\url{www.hmf-hermeler.de}} that developed the sorting machine Autoselect ATS~II and provided valuable expertise on the sorting issue.
+Additional image data was received from the asparagus farm \enquote{Querdel’s Hof}~\footnote{see the website of Querdel’s Hof at~\url{https://www.querdel.de/} (visited on 03/15/2020)}. Further cooperation existed with the mechanical engineering company HMF Hermeler Maschinenbau GmbH~\footnote{see the website of HMF Hermeler at~\url{www.hmf-hermeler.de} (visited on 03/15/2020)} that developed the sorting machine Autoselect ATS~II and provided valuable expertise on the sorting issue.
 
-All associated software is stored online, in our GitHub repository, and at the institute internal storing system.\footnote{The documentation to the project can be found at~\url{https://asparagus.readthedocs.io/en/latest}. The GitHub repository can be found at~\url{https://github.com/CogSciUOS/asparagus}. \\ Until 31/07/2020, the internal storing folder to the project at the University of Osnabr{\"u}ck could be accessed via~/net/projects/scratch/winter/valid\_until\_31\_July\_2020/asparagus. Since then, it can be accessed via~/net/projects/scratch/summer/valid\_until\_31\_January\_2021/jzerbe, until 31/01/2021.}
+All associated software is stored online, in our GitHub repository, and at the institute internal storing system.\footnote{The documentation to the project can be found at~\url{https://asparagus.readthedocs.io/en/latest} (as of 11/27/2020). The GitHub repository is located at~\url{https://github.com/CogSciUOS/asparagus} (as of 11/27/2020). \\ Until 07/31/2020, the internal storing folder to the project at the University of Osnabr{\"u}ck could be accessed via~/net/projects/scratch/winter/valid\_until\_31\_July\_2020/asparagus. Since then, it can be accessed via~/net/projects/scratch/summer/valid\_until\_31\_January\_2021/jzerbe, until 01/31/2021.}
 
 
 \subsection{Classification based on computer vision}
@@ -54,25 +54,25 @@ \subsection{Classification based on computer vision}
 
 In this chapter, the current standard of computer-based image classification is described. The main focus will be on \acrfullpl{ann} and classical computer vision techniques. A broad overview of relevant topics will be given and their importance for this project will be underlined.
 
-Computer vision is a field of computer science that aims to automatically extract high-level understanding from image data and provide appropriate output. It is closely linked with machine learning, which describes the ability of a system to learn and improve from experience rather than being specifically programmed. Machine learning is frequently used to solve computer vision tasks.
+Computer vision is a field of computer science that aims to extract high-level understanding from image data. It is closely linked with machine learning, which describes the ability of a system to learn and improve from experience rather than being specifically programmed. Machine learning is frequently used to solve computer vision tasks.
 
 \bigskip
-Image classification is one of the main subfields of computer vision which gains a lot of attention in the scientific as well as the economic world. Besides the classical computer vision techniques that use algorithmic approaches to determine patterns, edges, and other points of interest that can help to classify images, artificial intelligence was introduced to the field in the 1960s \citep{szeliski2010computer}. Since then, more and more creative and complex artificial neural networks were introduced to solve numerous classification tasks, including the recognition of letters, faces, and street signs~\citep{mironczuk2018recent,balaban2015deep,stallkamp2011german}.
+Image classification is one of the main subfields of computer vision which gains a lot of attention in the scientific as well as the economic world. Besides the classical computer vision techniques that use algorithmic approaches to determine patterns, edges, and other points of interest that can help to classify images, artificial intelligence was introduced to the field in the 1960s to automate intelligent behaviour~\citep{szeliski2010computer}. Since then, more and more creative and complex technologies such as artificial neural networks were introduced to solve numerous classification tasks, including the recognition of letters, faces, and street signs~\citep{mironczuk2018recent,balaban2015deep,stallkamp2011german}.
 
-Some experts claim that artificial neural networks revolutionized the field of image classification, yielding better results than ever before~\citep{he2016deep,alexnet2012original}. More and more challenges, for example ImageNet \citep{russakovsky2015imagenet}, were introduced and computer scientists all over the world implemented creative solutions for the proposed problems. Additionally, influential companies like Google have a deep interest in finding solutions for image-based classification problems and push the research on these and related topics even further. In the era of optimization, computer-based classification became indispensable in many industries and also found its way into agriculture which is the field of interest in this study project.
-In the following, a short introduction in both artificial neural networks and classical computer vision techniques is given which will span a bridge to our current classification problem.
+Some experts claim that especially artificial neural networks, as one of the popular methods in the field of artificial intelligence, revolutionized the field of image classification, yielding better results than ever before~\citep{he2016deep,alexnet2012original}. More and more challenges, for example ImageNet \citep{russakovsky2015imagenet}, were introduced and computer scientists all over the world implemented creative solutions for the proposed problems. Additionally, influential companies like Google have a deep interest in finding solutions for image-based classification problems and push the research on these and related topics even further. In the era of optimization, computer-based classification became indispensable in many industries and also found its way into agriculture which is the field of interest in this study project.
+In the following, a short introduction to artificial neural networks and classical computer vision techniques is given which will span a bridge to our current classification problem.
 
 \bigskip
-Neural networks are used for image classification in many domains. In contrast to the algorithmic approaches, it is not determined by the programmer how and what exactly the neural network learns. A large amount of data is provided to the model, which then extracts relevant information to learn the classification task. However, what is relevant to the model is not necessarily relevant or interpretable to a human observer. This is one of the biggest disadvantages compared to classical computer vision approaches, which are understandable and, therefore, interpretable and adjustable. Another problem that comes along with this is that the bias for those approaches often does not lie in the code itself but in the data, which is by far more difficult to detect and surpass. For this reason, many people see artificial neural networks as a black box, which may yield great results but cannot be fully understood. Recent advances try to tackle that issue by systematically researching artificial intelligence with the aim of making it more explainable~\citep{tjoa2019survey,gilpin2018explaining}.
+Artificial neural networks are used for image classification in many domains. In contrast to the algorithmic approaches, it is not determined by the programmer how and what exactly the neural network learns. A large amount of data is provided to the model, which then extracts relevant information to learn the classification task. However, what is relevant to the model is not necessarily relevant or interpretable to a human observer. This is one of the biggest disadvantages compared to classical computer vision approaches, which are understandable (human readable) and, therefore, interpretable and adjustable. Another problem that comes along with this is that the bias for those approaches often does not lie in the code itself but in the data, which is by far more difficult to detect and surpass. For this reason, many people see artificial neural networks as a black box, which may yield great results but cannot be fully understood. Recent advances try to tackle that issue by systematically researching artificial neural networks with the aim of making it more explainable~\citep{tjoa2019survey,gilpin2018explaining}.
 
 In image classification, usually \acrfullpl{cnn} are used \citep{geron2019hands,lecun1995convolutional}, which are loosely inspired by the visual cortex of the brain. The idea is that highly specialized components or filters learn a very specific task, which is similar to the receptive fields of neurons in the visual cortex~\citep{hubel1962receptive}. These components can then be combined to high-level features which, in turn, can be combined to objects that can be used for classification~\citep{geron2019hands,bishop2006pattern,lecun1995convolutional}. In \acrshortpl{cnn} this concept is implemented by several successive convolutional layers in which one or more filters are slid over the input, generating so-called feature maps. Each unit in one feature map looks for the same feature but in different locations of the input. In recent years, \acrshortpl{cnn} improved so much that they outperform humans in many classification tasks~\citep{russakovsky2015imagenet,KarpathyConvNet}.
 
 \bigskip
-Although machine learning exhibits very promising results and a lot of research and literature is available on the topic, many branches of industry still rely on traditional computer vision techniques in their implementation of image classification. This also applies to the asparagus sorting paradigm. To the best of our knowledge, no asparagus sorting machine is currently on the market that uses artificial intelligence for its classification algorithm.
+Although artificial neural networks exhibit very promising results and a lot of research and literature is available on the topic, many branches of industry still rely on traditional computer vision techniques in their implementation of image classification. This also applies to the asparagus sorting paradigm. To the best of our knowledge, no asparagus sorting machine is currently on the market that uses artificial neural networks for its classification algorithm.
 
-Many classical computer vision algorithms aim to detect and describe points of interest in the input images that can be generalized to features. For these features, low-level attributes such as rapid changes in color or luminance can be used. In contrast to the features learned with the help of machine learning, these features are not specific to any training data set and therefore do not depend on it being well-constructed. Further, they are usually created in a way that is interpretable by humans which makes them very flexible and easily adaptable to specific use cases. This is why in some cases, traditional computer vision techniques can solve a problem much more efficiently than deep learning approaches.
+Many classical computer vision algorithms aim to detect and describe points of interest in the input images that can be generalized to features. For these features, low-level attributes such as rapid changes in color or luminance can be used. In contrast to the features learned with the help of artificial neural networks, these features are not specific to any training data set and therefore do not depend on it being well-constructed. Further, they are usually created in a way that is interpretable by humans which makes them very flexible and easily adaptable to specific use cases. This is why in some cases, traditional computer vision techniques can solve a problem much more efficiently than machine learning approaches.
 
-One of the big advantages of deep learning algorithms is that they can extract underlying features from the data. In machine learning, it is essential to know about the components constructing a problem, whereas artificial neural networks have the ability to learn the features of the data, combine and correlate them and thus enable faster learning without an explicit command how to do so \citep{LeCun2015}.
+One of the big advantages of machine learning algorithms, such as artificial neural networks, is that they can extract underlying features from the data. In algorithmic approaches (where predictions are based on fixed formulas), it is essential to know about the components constructing a problem, whereas artificial neural networks have the ability to learn the features of the data, combine and correlate them and thus enable faster learning without an explicit command how to do so \citep{LeCun2015}.
 
 \bigskip
 In summary, both approaches have interesting implications for computer-based image classification tasks and provide promising techniques for our problem of asparagus classification. While we rely mostly on machine learning for the classification task itself, traditional computer vision algorithms are used to detect important features from the images.
@@ -129,7 +129,7 @@ \subsection{Background on sorting asparagus}
 	\label{tab:AsparagusLabels}
 \end{table}
 
-In the European Union, there is a uniform system for the sorting of asparagus into quality classes~\citep{euspargelnorm,unspargelnorm}.\footnote{see \url{https://mlr.baden-wuerttemberg.de/de/unser-service/presse-und-oeffentlichkeitsarbeit/pressemitteilung/pid/nationale-handelsklassen-fuer-frisches-obst-und-gemuese-abgeschafft-1/}}\footnote{see \url{https://www.bzfe.de/inhalt/spargel-kennzeichnung-5876.html}} However, supply and demand usually determine the number and accuracy of these classes. One of the first defining features is the color of the asparagus which comprises four categories: white, violet, violet-green, and green~\citep{euspargelnorm}. For this project, only the first two colors are of relevance. A further distinction is made between the quality classes Extra, class I, and class II. The class Extra defines the product as perfect quality, class I defines it as good quality, and class II includes products that do not qualify for the other classes but satisfy the minimum requirements for commercial distribution~\citep{euspargelnorm}. The last distinction is made for the characteristics of length and width. White and violet asparagus may not exceed 220 mm in length. The minimal length of the asparagus varies but should be above 170 mm for long asparagus. Additionally, there is some level of tolerance accepted for the quality classes. According to the quality class, 5\% - 15\% of wrongly sorted asparagus is tolerated in a package or bundle~\citep{euspargelnorm}.
+In the European Union, there is a uniform system for the sorting of asparagus into quality classes~\citep{euspargelnorm,unspargelnorm}.\footnote{see \url{https://mlr.baden-wuerttemberg.de/de/unser-service/presse-und-oeffentlichkeitsarbeit/pressemitteilung/pid/nationale-handelsklassen-fuer-frisches-obst-und-gemuese-abgeschafft-1/} (visited on 03/16/2020)}\footnote{see \url{https://www.bzfe.de/inhalt/spargel-kennzeichnung-5876.html} (visited on 03/16/2020)} However, supply and demand usually determine the number and accuracy of these classes. One of the first defining features is the color of the asparagus which comprises four categories: white, violet, violet-green, and green~\citep{euspargelnorm}. For this project, only the first two colors are of relevance. A further distinction is made between the quality classes Extra, class I, and class II. The class Extra defines the product as perfect quality, class I defines it as good quality, and class II includes products that do not qualify for the other classes but satisfy the minimum requirements for commercial distribution~\citep{euspargelnorm}. The last distinction is made for the characteristics of length and width. White and violet asparagus may not exceed 220 mm in length. The minimal length of the asparagus varies but should be above 170 mm for long asparagus. Additionally, there is some level of tolerance accepted for the quality classes. According to the quality class, 5\% - 15\% of wrongly sorted asparagus is tolerated in a package or bundle~\citep{euspargelnorm}.
 
 \begin{figure}[!b]
 	\centering
diff --git a/documentation/report/Chapters/Preparations.tex b/documentation/report/Chapters/Preparations.tex
index d2f670c..60de94f 100644
--- a/documentation/report/Chapters/Preparations.tex
+++ b/documentation/report/Chapters/Preparations.tex
@@ -18,7 +18,7 @@ \subsection{Roadmap of the project}
 	\centering
 	\includegraphics[scale=0.43]{Figures/chapter02/new_timetable.png}
 	\decoRule
-	\caption[Timetable of the Project]{\textbf{Timetable of the Project}~~~The upper timeline shows the estimated time of the study project from April 2019 to April/Mai 2020. The lower timeline displays how the time was spent. Both timelines differ in that the year was more optimistically planned than realized. A major factor was the lack of experience of the participants concerning the conduction of a larger project with many co-workers as well as concerning the general implementation of the preprocessing stage for machine learning classification. Another factor influencing the shifted timeline was the appearance of a fifth major stage, the \emph{Manual Labeling}.}
+	\caption[Timetable of the Project]{\textbf{Timetable of the Project}~~~The upper timeline shows the estimated time of the study project from April 2019 to April/Mai 2020. The lower timeline displays how the time was spent. Both timelines differ in that the year was more optimistically planned than realized. Contributing factors were the lack of experience in managing a larger project and, on the technical side, the absence of expertise regarding the general implementation of the preprocessing stage. Additionally influencing the shifted timeline was the appearance of a fifth major stage, the \emph{Manual Labeling}.}
 	\label{fig:Timetable}
 \end{figure}
 
@@ -38,37 +38,38 @@ \subsection{Roadmap of the project}
 	\label{fig:RoadmapActual}
 \end{figure}
 
-The timetable in~\autoref{fig:Timetable} gives a broad outline of the major stages of the project. In the upper timeline of the figure, it was estimated how much time for a specific phase is needed, whereas in the lower timeline the real time spent for the stage is given. Both timelines are structured to display the project year, starting in April 2019 and ending in April/Mai 2020. The months are represented by the x-axis while the colors mark the different working stages.
+The timetable in~\autoref{fig:Timetable} gives a broad outline of the major stages of the project. The upper timeline of the figure shows how much time for a specific phase was estimated, whereas the lower timeline shows the real time spent on each stage. Both timelines are structured to display the project year, starting in April 2019 and ending in April/Mai 2020. The months are represented by the x-axis while the colors mark the different working stages.
 
-The project comprises four to five major stages: \textit{Data Collection \& Organization}, \textit{Preprocessing}, \textit{Manual Labeling}, \textit{Classification}, and \textit{Evaluation}. A detailed representation of the single tasks attributed to each stage can be found in the roadmaps in~\autoref{fig:RoadmapPlanned} and~\autoref{fig:RoadmapActual}. The project started with the data collection and the organization of the study project. During the first stage, the images were recorded with the sorting machine, while the major planning and research for the project took place. In the second phase, most of the preprocessing happened, that is, preparing and labeling the image data. The classification stage includes the time spent on the machine learning approaches which were implemented and trained on the asparagus data. In the last stage, the approaches were evaluated and their results were compared. The different stages overlapped to a certain degree. For the purpose of this figure, the start and end time is displayed as a hard boundary.
+The project comprises five major stages: \textit{Data Collection \& Organization}, \textit{Preprocessing}, \textit{Manual Labeling}, \textit{Classification}, and \textit{Evaluation}. A detailed representation of the single tasks attributed to each stage can be found in the roadmaps in~\autoref{fig:RoadmapPlanned} and~\autoref{fig:RoadmapActual}. The project started with the data collection. During the first stage, the images were recorded with the sorting machine, while the major planning and research for the project took place. The second phase covered the preprocessing, that is, preparing and labeling the image data. The classification stage includes the time spent on the implementation of the machine learning approaches. In the last stage, the approaches were evaluated and their results were compared. The different stages overlapped to a certain degree. For the purpose of this figure, the start and end time is displayed as a hard boundary.
 
-When comparing both timelines, some distinctions can be recognized. The upper timeline shows that preprocessing was estimated to be done by September. However, the phase continued until October, as can be observed in the lower timeline. Furthermore, the time for labeling a sufficient amount of images was underestimated, resulting in adjustments of the time attributed to this task. More specifically, it led to the \textit{Manual Labeling} of the image data receiving its own phase in the timetable, independent of the preprocessing phase.
+It is obvious that both timelines differ. In detail, preprocessing was estimated to be done by September, however, the phase continued until October. Further, the time for labeling a sufficient amount of images was underestimated, resulting in adjustments of the time attributed to this task.
 
 The differences are depicted with different color codings. While the main focus of this project was supposed to be the application of different machine learning techniques to classify the data (color-coded in blue and green), the preprocessing phase and the data set creation/manual labeling posed to be most time-consuming (color-coded in orange and yellow).
 
-In~\autoref{fig:RoadmapPlanned} and~\autoref{fig:RoadmapActual}, the stage specific tasks can be seen in more detail. Again, both figures display the estimated time and the actual time, respectively. The headlines serve as a division into the major stages except for the first heading, \emph{Constant Work}, which shows the tasks that demanded continuous attention and effort throughout the year. The duration of tasks is represented in blue, while the yellow lines mark milestones that are explained in the legends.
+In~\autoref{fig:RoadmapPlanned} and~\autoref{fig:RoadmapActual}, the stage specific tasks are displayed in more detail. Again, both figures display the estimated time and the actual time, respectively. The headlines serve as a division into the major stages except for the first heading, \emph{Constant Work}, which shows the tasks that demanded continuous attention and effort throughout the year. The duration of tasks is represented in blue, while the yellow lines mark milestones that are explained in the legends.
+
 
 \subsection{Organisation of the study group}
 \label{sec:Organization}
 
-In this chapter, the management of the work distribution and the communication are taken into focus. For this the tools that were used for communication and organization will be examined as well as the structure of the group work.
+This chapter focuses on the management of the work distribution and the communication. For this the used communication and organization tools will be examined as well as the structure of the group work.
 
 
 \subsubsection{Communication}
 \label{subsec:Communication}
 
-The main communication consisted of weekly meetings in which the working process was discussed and new tasks were distributed. In addition to those meetings, different platforms were used, which worked with varying degrees of success. Used platforms were Asana, GitHub and Telegram to facilitate the communication aside from group meetings. The different means of communication will be described and evaluated in the following section.
+The main communication took place in weekly meetings in which the working process was discussed and new tasks were distributed. In addition, different platforms were tried, which were used with varying quantities such as Asana, GitHub and Telegram. The different means of communication will be described and evaluated in the following section.
 
 \bigskip
-Regular meetings made up the core organization of our project. During the meetings a discussion leader and a protocol writer were picked.\footnote{The protocols were saved for review in the GitHub project at \\ \url{https://github.com/CogSciUOS/asparagus}} The meetings were characterized by long discussions about how to approach the upcoming project step and how to tackle the next challenge. The project’s supervisors were usually present at the meetings, to bring in their expertise and to give the opportunity to ask concrete questions. During the first half of the project, tasks were distributed at the end of each meeting. In the subsequent meeting the working progress was discussed. This procedure was changed in the second half of the project. A schedule was used that described the different tasks and deadlines in detail. Additionally, we regularly gathered for co-working. The organizational meetings were continued, during which everyone gave precise and structured reports on their area of responsibility. This helped us to spend less time discussing and have more time for task-relevant work.
+The weekly meetings were organized by a previously announced discussion leader and captured by a protocol writer.\footnote{The protocols were saved for review in the GitHub project at \\ \url{https://github.com/CogSciUOS/asparagus} (as of 11/27/2020)} The project’s supervisors were usually present at the meetings, to contribute with their expertise. At the beginning, tasks were distributed at the end of the meeting. In the second half of the project, the work was distributed through a schedule describing tasks and deadlines in detail. Additionally, a weekly co-working space was established. 
 
-The majority of important information was exchanged via Telegram~\footnote{Telegram is a cloud-based instant messaging service for the use on smartphones, tablets and computers.}.  Starting from the first meeting, we had a constant conversation in a group chat on Telegram, in which we informed each other about the status of the project as well as support each other by answering questions. The group chat also created space for mutual motivation when needed.
+ Starting from the first meeting, we had a constant information exchange through a group chat on Telegram~\footnote{Telegram is a cloud-based instant messaging service for the use on smartphones, tablets and computers.}. The group chat helped to update others about the progress, support in technical issues but also created space for mutual motivation when needed.
 
-Additionally, Asana was used in the beginning of the project. The communication platform is usually used to distribute tasks and to communicate about them. Many integrations of other applications, such as Slack, can help to achieve this. However, the tasks were easier to distribute in direct consultation at physical meetings and results could be easier demonstrated or discussed. If we had relied on communication with Slack or other agreed services or applications, it might have made more sense, but Asana alone has proven to be inefficient in our use case.
+Asana was used in the beginning of the project to distribute tasks. However, the tasks were easier to distribute in direct consultation at physical meetings and results could be easier demonstrated and discussed.
 
-During the project, it was further learned how to work with GitHub~\footnote{GitHub is a web-based popular platform using the version control system Git that helps developers to store and manage their code, and track and control changes to their project.}. Git allowed us to work from anywhere, which facilitated the workflow.
+During the project, we gained a lot of expertise in using GitHub~\footnote{GitHub is a web-based popular platform using the version control system Git that helps developers to store and manage their code, and track and control changes to their project.}. Git allowed us to contribute to each other's work from anywhere, which facilitated the workflow.
 
-Furthermore, we were able to automatically create documentations via Sphinx. This means that by adhering to the style conventions, the protocols, work schedules, manuals, and code comments were automatically included in our documentation.\footnote{see our documentation at~\url{https://asparagus.readthedocs.io/en/latest/}}
+Furthermore, we automatically created documentations via Sphinx. Thus, adhering to the style conventions, the protocols, work schedules, manuals, and code comments were automatically included in our documentation.\footnote{see our documentation at~\url{https://asparagus.readthedocs.io/en/latest/} (as of 11/27/2020)}
 
 
 \subsubsection{Teamwork}
@@ -77,21 +78,20 @@ \subsubsection{Teamwork}
 This section starts by introducing the team members and their previous experiences. It is followed by a description of the practical aspects of teamwork, the working structure, and the distribution of project-relevant tasks.
 
 \bigskip
-The project was an initiative of one of the students. A large part of the project members knew each other in private but had not yet worked together. Further students joined the project after its public announcement to complete the team. Thus, the group consisted of members with varying degrees of knowledge about each other. The team was initially made up by Josefine Zerbe, Katharina Groß, Malin Spaniol, Maren Born, Michael Gerstenberger, Richard Ruppel, Sophia Schulze-Weddige, Luana Vaduva, Thomas Klein, and Subir Das. None of the members had yet worked together as such on a project of this scope. During the course of the project, three members left the team for various reasons. Thomas left in July due to a change in his study program. Further, Luana and Subir left in October to pursue different study projects.
-
-The members brought a wide variety of backgrounds into the team through different bachelor programs or different majors in the broader field of Cognitive Science. In the beginning of the project, the team members had little to no experience in the application of computer vision or neural networks. The motivation of most students was to pursue new and interesting tasks in these fields. Four students had a theoretical background in computer vision, six students had gained some experience with neural networks through the course ``\acrshortpl{ann} with TensorFlow’’, taught at the University of Osnabr{\"u}ck. Some had also taken machine learning classes during their study program. Git was previously only used by three students, but none of them were experts on its usage. Further, the team had neither experience with the Grid system of the \acrshort{ikw}, nor with running jobs on different machines. None of the members had prior knowledge about project management or task organization on a broader level.
+The project was an initiative of one member of the project group. Further students joined the project after its public announcement to complete the team.. The team was initially made up by Josefine Zerbe, Katharina Groß, Malin Spaniol, Maren Born, Michael Gerstenberger, Richard Ruppel, Sophia Schulze-Weddige, Luana Vaduva, Thomas Klein, and Subir Das. None of the members had yet worked together as such on a project of this scope. During the course of the project, three members left the team for various reasons.
+The members brought a wide variety of backgrounds into the team through different bachelor programs or different majors in the broader field of Cognitive Science. In the beginning of the project, the team members had little to no experience in the application of computer vision or neural networks. The motivation of most students was to pursue new and interesting tasks in these fields. Four students had a theoretical background in computer vision, six students had gained some experience with neural networks. Git was previously only used by three students, but none of them were experts on its usage. Further, the team had neither experience with the Grid system of the \acrshort{ikw}, nor with running jobs on different machines. None of the members had prior knowledge about project management or task organization on a broader level.
 
 \bigskip
-In the beginning, the team lacked some structure and a clear distribution of individual roles. One reason for this could have been the harmonious atmosphere between team members. Further tasks such as the trips to the asparagus farm strengthened the team spirit and the social interactions. Thus, the task distribution was  very dynamically structured by making every decision democratically. Most tasks were performed in smaller teams of two to three people.  During meetings, possible next tasks were formulated but  without assigning them to specific members or working groups. This resulted in a lot of unassigned tasks and a discontinuous workflow. 
+In the beginning, the team lacked some structure and a clear distribution of individual roles. One reason for this could have been the harmonious atmosphere between team members. Further tasks such as the trips to the asparagus farm strengthened the team spirit and the social interactions. Thus, the task distribution was very dynamically structured by making every decision democratically. Most tasks were performed in smaller teams of two to three people. During meetings, tasks were formulated but not clearly assigned to members which led to a discontinuous workflow.
 
-In August, the organization was restructured. On the one hand, a new structure for task distribution seemed more appropriate, instead of a democratic distribution. On the other hand, the strengths of the individual team members should be used more efficiently. Some team members had less programming experience than others. They had difficulties realizing certain tasks in an equal time period and with the same precision as others. Although they had good ideas in terms of concept, these were not implemented quickly enough to include them into the project. Nevertheless, it gave the opportunity to acquire new programming skills. To integrate more of the strengths that the single team members brought and to tackle the issue of time management, it was decided to write a work schedule that distributed the work more appropriately, gave an overview of the tasks that still had to be done and showed how much time was left to do them.
+In August, a new structure for task distribution was introduced, such that the strengths of the individual team members could be used more efficiently. Due to the different backgrounds, team members could not resolve programming tasks equally efficiently, such that good concepts could not be realized quickly enough to include them into the project. Nevertheless, it gave the opportunity to acquire new programming skills. To integrate more of the strengths that the single team members brought and to tackle the issue of time management, it was decided to write a work schedule that distributed the work more appropriately, gave an overview of the tasks that still had to be done and showed how much time was left to do them.
 
-The supervision of the work was divided into manager roles, which means that the work was split into different main fields. Each member was responsible for managing their assigned area, distributing tasks and keeping an overview of the relevant work inside their working field. The manager could be consulted for questions, when in need of discussion, or for feedback. The meetings became more effective due to the new structure, and there was less discussion concerning task distribution. 
+The supervision of the work was divided into manager roles and the work itself into different main fields. Each member was responsible for managing their assigned area, distributing tasks and keeping an overview of the relevant work inside their working field. The manager could be consulted for questions, when in need of discussion or feedback. This distribution made the meetings more effective.
 
-Further, common working hours on campus were introduced. The common working hours ensured that questions and decisions that arose could be discussed in person. This was especially helpful when different tasks overlapped and required communication and agreement.
+During common working hours, questions and decisions that arose could be discussed in person. This was especially helpful when different tasks overlapped and required communication and agreement.
 
 \bigskip
-In conclusion, the team structure and the distribution of work changed over the course of the project. The strengths of single members were used more efficiently and the supervision of working areas led to a more structured time management and task distribution. 
+In conclusion, the team structure and the distribution of work changed over the course of the project. The strengths of the single members were used more efficiently and the supervision of working areas led to a more structured time management and task distribution.
 
 
 \subsection{Data collection}
@@ -99,8 +99,6 @@ \subsection{Data collection}
 
 In this section the asparagus sorting machine at Gut Holsterfeld is described. Then the process of collecting labeled and unlabeled data is reported.
 
-The machine Autoselect ATS~II (2003) is designed for sorting white and green asparagus (see~\autoref{fig:SortingMachine}) \citep{autoselectanleitung}. The asparagus is arranged on a conveyor belt that runs it through the recording section of the machine. Here, a camera takes three pictures per asparagus spear (see \autoref{fig:SortingMachineSketch} and \autoref{fig:ExampleImagesAnna}). Small wheels on the conveyor belt rotate the asparagus in the meantime so that it can be photographed from several positions. In the best case, on each image a different side of the asparagus is recorded. The conveyor belt transports the spear further and it is sorted into a tray depending on the chosen class label by the machine. The sorting is based on the parameters for width, length, shape, curvature, rust, and color. A total of 30 criteria for classifying an asparagus spear are used to describe these parameters. The calculation of the single features is based on a classical analytical approach. For example, the parameter for color detection is composed of eight sub-parameters. Each spear is reviewed at different areas (the head of the asparagus, the area below the head, and the stem) and judged for its hue in percentage. The values are compared and, according to a threshold, the spear is sorted into a color category (e.g.\, white or violet). For all parameters, there is a minimal threshold and a maximal threshold. As another example, the parameter for width detection calculates at three points at the asparagus (top, middle, and bottom part). From these three values, an average value is calculated that decides in which category the asparagus is sorted. If an asparagus exceeds the maximal threshold for parameter detection, it is not recognized and cannot be sorted accordingly. The same holds for values below the minimal threshold. Thus, all parameters have to have an upper and a lower threshold, including parameters that decide the presence of features like shape, curvature, and color. When evaluating what parameter boundaries to choose, it is recommended to check that most asparagus spears tend to be in between the average value and the maximal threshold, with a larger tendency to accumulate around the average value. Reportedly, the parameters and their respective ranges can be freely chosen by the user and can in this way be fitted to the needs of the respective asparagus farm~\citep{autoselectanleitung}.
-
 \begin{figure}[!t]
 	\centering
 	\includegraphics[width=0.9\textwidth]{Figures/chapter02/asparagusconveyerbelt_new.png}
@@ -111,10 +109,13 @@ \subsection{Data collection}
 	\centering
 	\includegraphics[scale=0.6]{Figures/chapter02/sortingmachine_front.png}
 	\decoRule
-	\caption[The Autoselect ATS II at Gut Holsterfeld]{\textbf{Autoselect ATS~II at Gut Holsterfeld}~~~In the figure the asparagus sorting machine Autoselect ATS~II at Gut Holsterfeld can be seen. The conveyor belt transports the asparagus from the left side of the image to the right side. It thereby passes the camera system. The display of the machine gives information on the parameters and the images. The machine is mainly controlled from here.}
+	\caption[The Autoselect ATS II at Gut Holsterfeld]{\textbf{Autoselect ATS~II at Gut Holsterfeld}~~~In the figure the asparagus sorting machine Autoselect ATS~II at Gut Holsterfeld is displayed. The conveyor belt transports the asparagus from the left side of the image to the right side. It thereby passes the camera system. The display of the machine gives information on the parameters and the images. The machine is mainly controlled from here.}
 	\label{fig:SortingMachine}
 \end{figure}
 
+\bigskip
+The machine Autoselect ATS~II (2003) is designed for sorting white and green asparagus (see~\autoref{fig:SortingMachine}) \citep{autoselectanleitung}. First, the asparagus is arranged on a conveyor belt that runs it through the recording section of the machine. Here, a camera takes three pictures per asparagus spear (see \autoref{fig:SortingMachineSketch} and \autoref{fig:ExampleImagesAnna}). Small wheels on the conveyor belt rotate the asparagus throughout the process so that it can be photographed from several positions. In the best case, every image shows a different side of the asparagus. Subsequently, the asparagus is transported  into a tray depending on its class label. The sorting is based on the parameters for width, length, shape, curvature, rust, and color. A total of 30 criteria for classifying an asparagus spear are used to describe these parameters. The calculation of the single features is based on a classical analytical approach. For example, the parameter for color detection is composed of eight sub-parameters. Each spear is reviewed at different areas (the head of the asparagus, the area below the head, and the stem) and judged for its hue in percentage. The values are compared and, according to a threshold, the spear is sorted into a color category (e.g.\, white or violet). For all parameters, there is a minimal threshold and a maximal threshold. As another example, the parameter for width detection calculates at three points at the asparagus (top, middle, and bottom part). From these three values, an average value is calculated that decides in which category the asparagus is sorted. If an asparagus exceeds the maximal threshold for parameter detection, it is not recognized and cannot be sorted accordingly. The same holds for values below the minimal threshold. Thus, all parameters have to have an upper and a lower threshold, including parameters that decide the presence of features like shape, curvature, and color. When evaluating what parameter boundaries to choose, it is recommended to check that most asparagus spears tend to be in between the average value and the maximal threshold, with a larger tendency to accumulate around the average value. Reportedly, the parameters and their respective ranges can be freely chosen by the user and can in this way be fitted to the needs of the respective asparagus farm~\citep{autoselectanleitung}.
+
 \begin{figure}[!htb]
 	\centering
 	\includegraphics[scale=0.3]{Figures/chapter02/sorting_machine_slots.png}
@@ -125,19 +126,19 @@ \subsection{Data collection}
 
 Before the first use of the machine, all parameters are selected after a calibrating charge of asparagus has run through the machine. Then, the user can adjust the thresholds accordingly.
 
-According to the manual, the number of quality classes is selectable. The user can define the order of quality classes by choosing the arrangement of parameters. The manufacturer suggests to first sort for length and width, then use the parameters that sort for color, and the parameters for shape detection last.
+According to the manual, the number of quality classes is selectable, by choosing the arrangement of parameters. The manufacturer suggests to first sort for length and width, then use the parameters that sort for color, and the parameters for shape detection last.
 
-The accuracy of the sorting machine is described to be as good as 90\% best case by the manufacturer, while the farmer at Gut Holsterfeld reported it to be around 70\% at best, with re-sorting being necessary by professional sorters. Especially categories like Blume or Hohle were considered to be inconsistent by both, manufacturer and farmer. Further information could not be given on the software of the machine. A meeting with a representative of the engineering company HMF~\footnote{see \url{www.hmf-hermeler.de}}  that manufactured the sorting machine was arranged. Unfortunately, the source code itself was not available to HMF as it was produced by another company. 
+The accuracy of the sorting machine is described to be as good as 90\% best case by the manufacturer, while the farmer at Gut Holsterfeld reported it to be around 70\% at best, with re-sorting being necessary by professional sorters. Especially categories like Blume or Hohle were considered to be inconsistent by both, manufacturer and farmer. Further information could not be given on the software of the machine. A meeting with a representative of the engineering company HMF~\footnote{see \url{www.hmf-hermeler.de} (visited on 03/15/2020)}  that manufactured the sorting machine was arranged. Unfortunately, the source code itself was not available to HMF as it was produced by another company. 
 
 \bigskip
-In the following it is described how the image data was collected. It is possible to save images with the Autoselect ATS II, however, the storage space on the machine is very limited. Further, the selection of images to be saved is restricted to only 1000 images.
-One workaround to the problem is the installation of the Teamviewer software~\footnote{see \url{https://www.teamviewer.com/en/}} on the machine and the connection of an external hard drive. After the installation, the process of image collection could be started remotely. This work was very ineffective and time-consuming. The data could not be directly transmitted to another computer because the internet speed at the farm is too slow. An automatic transfer of the images to the external hard drive was not possible until the installation of an automatic file moving service, for which the requirements are described below.
+It is possible to save images with the Autoselect ATS II, however, the storage space on the machine is very limited. Further, the selection of images to be saved is restricted to only 1000 images at a time.
+One workaround to the problem is the installation of the Teamviewer software~\footnote{see \url{https://www.teamviewer.com/en/} (visited on 03/15/2020)} on the machine and the connection of an external hard drive. After the installation, the process of image collection could be started remotely. This work was very ineffective and time-consuming. The data could not be directly transmitted to another computer because of the lack of internet speed at the farm. An automatic transfer of the images to the external hard drive was not possible until the installation of an automatic file moving service, for which the requirements are described below.
 
-The file moving program needs to transfer the images to a new saving destination and has to run in the background without disturbing the workflow of the sorting machine. After research on background processes and programs, the decision was made to use a service, that is, a system process running independently of any program. The service manages moving the newly generated image files, as described in detail in the appendix in \autoref{subsec:FileService}.
+The file moving program transfers the images to a new saving destination and runs in the background without disturbing the workflow of the sorting machine. We decided to use a service, that is, a system process running independently of any program. The service manages moving the newly generated image files, as described in detail in the appendix in \autoref{subsec:FileService}.
 
-The project members split in groups of two and exchanged the hard drive two times a week. The collected images were then transferred to storage capacities of the university.
+Hard drives were collected two times a week to subsequently store the collected images in the storage capacities of the university.
 
-The label that the machine attributes to each asparagus is not reliable. Therefore, sending the asparagus through the machine a second time would be the only way to gather labeled images. Unfortunately, a second sorting is not good for the quality of the asparagus. Further, at least one project member has to be involved in the re-sorting and, thus, has to be present at the farm. The sessions of exchanging the external hard drive and collecting labeled image data were combined.
+The label that the machine attributes to each asparagus is not reliable. Therefore, sending the asparagus through the machine a second time would be the only way to gather labeled images. Unfortunately, a second sorting degrades the quality of the asparagus. Therefore, only a limited amount of labeled data was collected. 
 
 \begin{figure}[!h]
 	\centering
@@ -154,12 +155,11 @@ \subsection{Data collection}
 		\includegraphics[width=0.95\linewidth]{Figures/chapter02/anna_c.png}
 		\caption{right}
 	\end{subfigure}
-    \caption[Example Asparagus Images]{\textbf{Example Asparagus Images}~~~Example pictures of the quality class I~A~Anna. The asparagus is arranged on a conveyor belt that runs it through the recording section of the machine, where a camera takes three pictures. In picture (A) the target asparagus is to the left, in (B) it is in the middle, and in (C) it is to the right. Small wheels on the conveyor belt rotate the asparagus in the meantime so that it can be photographed from several positions. In the best case, on each image a different side of the asparagus is recorded. The conveyor belt transports the asparagus further and it is sorted into a tray depending on the chosen quality class by the machine. 
+    \caption[Example Asparagus Images]{\textbf{Example Asparagus Images}~~~Example pictures of the quality class I~A~Anna whose capturing process is described in \autoref{fig:SortingMachineSketch}. In picture (A) the target asparagus is to the left, in (B) it is in the middle, and in (C) it is to the right. Small wheels on the conveyor belt rotate the asparagus in the meantime so that it can be photographed from several positions. 
 }
     \label{fig:ExampleImagesAnna}
 \end{figure}
 
-\bigskip
 An example image of the received data can be seen in~\autoref{fig:ExampleImagesAnna}. There are three pictures per asparagus. The image resolution is $1040\times1376$ pixel per image, with an RGB color space.
 
 \begin{table}[!htb]
@@ -177,14 +177,10 @@ \subsection{Data collection}
 		\hline
 	\end{tabular}%
 	}
-	\caption[Collected Images with Class Label]{\textbf{Collected Images with Class Label} \\ In this table, the number of collected images with a class label is reported. This was achieved by running pre-sorted spears a second time through the sorting machine}
+	\caption[Collected Images with Class Label]{\textbf{Collected Images with Class Label} \\ In this table, the number of collected images with a class label is reported. This was achieved by running pre-sorted spears through the sorting machine a second time.}
 	\label{tab:LabeledClassNumber}
 \end{table}
 
-In total, 612113 images were collected, with each class label being represented with at least 309 images, corresponding to 103 asparagus.
-
-At the asparagus farm Gut Holsterfeld, 591495 labeled and unlabeled images were collected with the Autoselect ATS~II. The number of unlabeled data is 578226 images, thus, around 192742 different asparagus spears. Of the labeled data, the number of images that were collected per quality class can be found in \autoref{tab:LabeledClassNumber}. The image number does not represent the number of different asparagus spears, as each asparagus spear is represented by three distinct images.
-
 \begin{figure}[!ht]
 	\centering
 	\vspace{20pt}
@@ -210,23 +206,28 @@ \subsection{Data collection}
     \label{fig:ExampleImagesQuerdel}
 \end{figure}
 
-Additionally, a few images could be recorded at another asparagus farm, Querdel’s Hof~\footnote{see~\url{https://www.querdel.de/}}, in Emsb{\"u}ren. The farm sorts the asparagus with an updated version of the Autoselect ATS~II at Gut Holsterfeld, that is, it uses the same software but other hardware. In particular the resolution of the camera was improved and a second camera was installed that focuses on the head region of the asparagus. At Querdel’s Hof, 20616 images were collected in total, 76 from the class label ``normal’’, 152 from the class label ``violet/flower’’, and 20388 unlabeled images. Each asparagus spear is represented by four images: three images show the asparagus from different perspectives and a fourth image depicts solely the head region. Example images for one asparagus can be seen in~\autoref{fig:ExampleImagesQuerdel}. No internet connection could be established to the farm, thus, no further images were collected. Moreover, the data format of the images from Querdel’s Hof is different to the data from the farm Gut Holsterfeld, due to the additional head image. Therefore a combination of both data sets was not convenient. 
+\bigskip
+In total, 612113 images were collected, with each class label being represented with at least 309 images, corresponding to 103 asparagus.
+
+At the asparagus farm Gut Holsterfeld, 591495 labeled and unlabeled images were collected with the Autoselect ATS~II. The number of unlabeled data is 578226 images, thus, around 192742 different asparagus spears. Of the labeled data, the number of images that were collected per quality class can be found in \autoref{tab:LabeledClassNumber}. The image number does not represent the number of different asparagus spears, as each asparagus spear is represented by three distinct images.
+
+Additionally, a few images could be recorded at another asparagus farm, Querdel’s Hof~\footnote{see~\url{https://www.querdel.de/} (visited on 03/15/2020)}, in Emsb{\"u}ren. The farm sorts the asparagus with an updated version of the Autoselect ATS~II at Gut Holsterfeld, that is, it uses the same software but other hardware. In particular the resolution of the camera was improved and a second camera focuses on the head region of the asparagus. At Querdel’s Hof, 20616 images were collected in total, 76 from the class label \enquote{normal}, 152 from the class label \enquote{violet/flower}, and 20388 unlabeled images. Each asparagus spear is represented by four images: three images show the asparagus from different perspectives and a fourth image depicts solely the head region. Example images for one asparagus can be seen in~\autoref{fig:ExampleImagesQuerdel}. No internet connection could be established on the farm, thus, no further images were collected. Moreover, the data format of the images from Querdel’s Hof is different to the data from the farm Gut Holsterfeld. Therefore a combination of both data sets was inconvenient.
 
 
 \subsection{Literature on food classification using computer vision}
 \label{sec:Literature}
 
-In the classification of food products, there are numerous possibilities to apply classical machine learning and \acrshort{ann} approaches for classification tasks on image data~\citep{bhargava2018fruits,brosnan2002inspection}.
+In the classification of food products, there are numerous possibilities to apply machine learning approaches for classification tasks on image data~\citep{bhargava2018fruits,brosnan2002inspection}.
 
 
 For the scope of this investigation, we decided to focus our literature search on fruit and vegetable quality evaluation using computer vision and machine learning. Compared to other fields, research and evaluation in agricultural classification shares many characteristics and faces similar difficulties.
 
-The quality inspection based on computer vision is usually constituted into five main steps: image acquisition, preprocessing, image segmentation, feature extraction and classification~\citep{bhargava2018fruits}. Moreover, most data in agriculture is based on photographic images. Also the features of interest are similar for different kinds of fruit or vegetable. Frequently by traditional computer vision techniques inspected features concern color, shape, size, texture, and defect~\citep{bhargava2018fruits}. This makes other papers in the field of agricultural evaluation directly comparable to our case. Moreover, we hope to get an impression of the state of the art of how many images are needed in our case, how high the image resolution needs to be, what kind of computer vision approaches could be helpful as a starting point,  and also to become aware of known challenges.
+The quality inspection based on computer vision is usually constituted into five main steps: image acquisition, preprocessing, image segmentation, feature extraction and classification~\citep{bhargava2018fruits}. Moreover, most data in agriculture is based on photographic images. Also the features of interest are similar for different kinds of fruit or vegetable. Features that are frequently inspected by traditional computer vision techniques concern color, shape, size, texture, and defect~\citep{bhargava2018fruits}. This makes other papers in the field of agricultural evaluation directly comparable to our case. Moreover, we hope to get an impression of the state of the art of how many images are needed in our case, how high the image resolution needs to be, what kind of computer vision approaches could be helpful as a starting point, and also to become aware of known challenges.
 
 \bigskip
-None of the found papers were suitable as blueprints for the asparagus classification project. However, some of them helped to get an idea of how to proceed with the project. For example, some papers show how the preprocessing phase could be structured~\citep{mery2013automated}, or they evaluate the machine learning methods that were already used on other food classification tasks~\citep{bhargava2018fruits}. Further, some of the literature is concerned with the classification of food products but not with differentiating between as many classes as 13~\citep{diaz2004comparison,kilicc2007classification}. Often, the variance in the food products, that is, the quality as well as the type of food used is either too high~\citep{zhang2012classification} or too low~\citep{kilicc2007classification,al2011dates} in comparison to the variance in our project data.  One paper evaluates the sorting of asparagus, however, it only does so on a small data set with three categories of green asparagus~\citep{donis2016classification}. Further papers on food classification are not detailed enough in their explanations and do not share the information needed for replication~\citep{pedreschi2016grading}. Another paper is mainly about the implementation of a certain toolbox~\citep{mery2013automated}.
+None of the found papers were suitable as blueprints for the asparagus classification project. However, some of them helped to get an idea of how to proceed with the project. For example, some papers show how the preprocessing phase could be structured~\citep{mery2013automated}, or they evaluate the machine learning methods that were already used on other food classification tasks~\citep{bhargava2018fruits}. Further, some of the literature is concerned with the classification of food products but not with differentiating between as many classes as 13~\citep{diaz2004comparison,kilicc2007classification}. Often, the variance in the food products, that is, the quality as well as the type of food used is either too high~\citep{zhang2012classification} or too low~\citep{kilicc2007classification,al2011dates} in comparison to the variance in our project data. One paper evaluates the sorting of asparagus, however, it only does so on a small data set with three categories of green asparagus~\citep{donis2016classification}. Further papers on food classification are not detailed enough in their explanations and do not share the information needed for replication~\citep{pedreschi2016grading}.
 
-Even though no specific paper was used as guidance to our project, some specific papers inspires us to try out certain algorithms, such as PCA ~\citep{Vijayarekha2008, Zhu2007}) or neural networks ~\citep{Jhuria2013, Pujari2014}. Moreover, the literature review made us aware of the limiting fact that images of fruits and vegetables are captured mainly from one direction~\citep{bhargava2018fruits}.The literature suggests that performance might improve, if more perspectives are taken into account. Moreover, the literature shows that different authors use different color spaces such as CIE Lab, RGB or HSI ~\citep{Liming2010, Garrido-Novell2012, Kondo2010}. This further inspired us to apply color quantization on our data.
+Even though no specific paper was used as guidance to our project, some specific papers inspired us to try out certain algorithms, such as \acrshort{pca} ~\citep{Vijayarekha2008, Zhu2007}) or neural networks ~\citep{Jhuria2013, Pujari2014}. Moreover, the literature review made us aware of the limiting fact that images of fruits and vegetables are captured mainly from one direction~\citep{bhargava2018fruits}.The literature suggests that performance might improve, if more perspectives are taken into account. Moreover, the literature shows that different authors use different color spaces such as CIE Lab, RGB or HSI ~\citep{Liming2010, Garrido-Novell2012, Kondo2010}. This further inspired us to apply color quantization on our data.
 
 \bigbreak
-As the available data was only sparsely labeled, further research was done to evaluate the use of a semi-supervised learning approach~\citep{olivier2006semi,zhu05survey}. Details about the corresponding literature can be found in~\autoref{sec:SemiSupervisedLearning}. In regards to deep learning-based approaches, classical neural networks -- such as AlexNet \citep{alexnet2012original}, \acrshort{vgg}16/\acrshort{vgg}19 \citep{vgg2014original}, GoogleNet \citep{googlenet2015original}, Capsule Networks \citep{capsulenet2017original},  DenseNet \citep{densenet2017original}, ResNet \citep{resnet2016original} or \acrfull{nin} \citep{lin2013network} -- were assessed for better understanding of the range of possible pre-trained networks and ideas for network structures. Also, classical computer vision approaches were considered, like multiclass \acrfullpl{svm} \citep{prakash2012multi}.
+As the available data was only sparsely labeled, further research was done to evaluate the use of a semi-supervised learning approach~\citep{olivier2006semi,zhu05survey}. Details about the corresponding literature can be found in~\autoref{sec:SemiSupervisedLearning}. In regards to deep learning-based approaches, classical neural networks -- such as AlexNet \citep{alexnet2012original}, \acrshort{vgg}16/\acrshort{vgg}19 \citep{vgg2014original}, GoogleNet \citep{googlenet2015original}, Capsule Networks \citep{capsulenet2017original},  DenseNet \citep{densenet2017original}, ResNet \citep{resnet2016original} or \acrfull{nin} \citep{lin2013network} -- were assessed for better understanding of the range of possible pre-trained networks and ideas for network structures. Additonal machine learning approaches were considered, like multiclass \acrfullpl{svm} \citep{prakash2012multi}.
diff --git a/documentation/report/Chapters/Summary.tex b/documentation/report/Chapters/Summary.tex
index 720aa0e..cc4faad 100644
--- a/documentation/report/Chapters/Summary.tex
+++ b/documentation/report/Chapters/Summary.tex
@@ -1,28 +1,86 @@
+%----------------------------------------------------------------------------------------
+%	SUMMARY OF RESULTS
+%----------------------------------------------------------------------------------------
+
 \section{Summary of results}
 \label{ch:Summary}
 
-The study project was conducted to investigate the state of the art of asparagus classification aiming at the development of approaches that could help to improve the current sorting algorithm implemented in the Autoselect ATS II at the asparagus farm Gut Holsterfeld. Data was collected, preprocessed and then analyzed with seven different approaches. Out of our 591495 images, roughly corresponding to 197165 different asparagus spears, 13271 images were collected by re-sorting with the machine and 13319 images were manually labeled by us. This labeled data is considered for the supervised approaches. The semi-supervised approach is in addition based on approximately equally many unlabeled images, 20000 in total. The unsupervised learning approaches are based on roughly 5500 images. The results illustrate that classifying asparagus is not a trivial problem. However, the results also show that it is possible to extract relevant features that might improve current sorting approaches.
+The study project was conducted to explore and develop various approaches for asparagus classification with the aim of improving the current sorting performance of the Autoselect ATS II at the asparagus farm Gut Holsterfeld. Data was collected, preprocessed and then analyzed with seven different approaches. Out of our 591495 images, roughly corresponding to 197165 different asparagus spears, 13271 images were collected by re-sorting with the machine and 13319 images were manually labeled by us. This labeled data is used for the supervised approaches. The semi-supervised approach is additionally based on approximately equally many unlabeled images, 20000 in total. The unsupervised learning approaches are based on roughly 5500 images.
+
+\begin{table}[!htb]
+\centering
+\resizebox{\columnwidth}{!}{%
+\begin{tabular}{llllllll}
+                 & Semi-supervised      & Semi-supervised &  PCA                     &            Head &   Color       &    Partial        & Single-label \\
+                 & VAE      & autoencoder &                      & network             & histograms         & angles           &  CNN \\
+\noalign{\smallskip}
+\hline
+\noalign{\smallskip}	
+Flower           &                          &                             & 0.33 & 0.62 &                          &                          & 0.46                                 \\
+Rusty head       &                          &                             &                          & 0.29 &                          &                          & 0.42                                 \\
+Bent             & 0.16 & 0.28                                            & 0.2  &                          &                          & 0.72 & 0.66                                 \\
+Violet           & 0    & 0                                               & 0    &                          & 0.59 &                          & 0.48                                 \\
+Rusty body       & 0.49 & 0.67                                            & 0.29 &                          & 0.67 &                          & 0.69                                 \\
+Fractured        & 0    & 0.67                                            &                          &                          &                          &                          & 0.91                                 \\
+Very thick       &      &                                                 &                          &                          &                          &                          & 0.93                                 \\
+Thick            & 0    & 0                                               &                          &                          &                          &                          & 0.94                                 \\
+Medium thick     &                          &                                                 &                          &                          &                          &                          & 0.8                                  \\
+Thin             & 0.7  & 0.84                                            &                          &                          &                          &                          & 0.88                                 \\
+Very thin        &      &                                                 &                          &                          &                          &                          & 0.92                                 \\
+Hollow           &                          &                             & 0.5  &                          &                          &                          & 0.63                                 \\
+Length           &                          &                             & 1    &                          &                          &                          &                  \\
+Width            &                          &                             & 0.83 &                          &                          &                          &                  \\
+Not classifiable &                          &                             &                          &                          &                          &                          & 0.6                                 
+\end{tabular}%
+}
+\caption[Comparing F1 Scores for Features]{\textbf{Comparing F1 Scores for Features}~~~Here, the F1 scores of different approaches are compared for the single features. If an approach does not include a feature, the corresponding cell is empty.}
+	\label{tab:f1ScoresLarge}
+\end{table}
+
+The results illustrate that classifying asparagus is not a trivial problem. However, the results also show that it is possible to extract relevant features that might improve current sorting approaches.
 
 \bigskip
-For supervised learning, we employed \acrshortpl{mlp} and  \acrshortpl{cnn}. Whereas the former were trained on sparse descriptions retrieved by high level feature engineering, the latter were directly trained on preprocessed images. They include networks for single-label classification as well as multi-label classification. 
+For supervised learning, we employed \acrshortpl{mlp} and  \acrshortpl{cnn}. Whereas the former were trained on sparse descriptions retrieved by high level feature engineering, the latter were directly trained on preprocessed images. They include networks for single-label classification as well as multi-label classification.
 The feature engineering \acrshort{mlp} for curvature prediction and the single-label  \acrshort{cnn} perform binary classification, whereas the  \acrshort{cnn} for the multi-label approach as well as head-related features network perform multi-label classification. All approaches aim to solve the same image classification problem, using supervised learning.
 
-Each approach has drawbacks and benefits. The complexity and the requirement to specify many parameters has proven to be a disadvantage of relatively deep but also rather shallow \acrshortpl{cnn} as compared to e.g.\ \acrshortpl{mlp}. In contrast, stronger preprocessing or even feature engineering is required to successfully employ the latter. After all, however, the most important criterion to evaluate an approach is its predictive performance.
+\begin{table}[!htb]
+    \centering
+	\resizebox{.70\linewidth}{!}{%
+    \begin{tabular}{lr}
+        {Approach} &  (mean) F1 \\
+        \noalign{\smallskip}
+        \hline
+        \noalign{\smallskip}
+        Partial angles & 0.72 \\
+        Color histograms & 0.66 \\
+        Multi-label CNN with binary cross-entropy loss & 0.67 \\
+        Multi-label CNN with hamming loss & 0.68 \\
+        Multi-label CNN with custom loss & 0.65 \\
+        Single-label CNN & 0.72 \\
+        Head network & 0.46 \\
+        Semi-supervised VAE & 0.19 \\
+        Semi-supervised autoencoder & 0.41 \\
+        PCA & 0.45
+    \end{tabular}%
+    }
+    \caption[Mean F1 Score]{\textbf{Mean F1 Score}~~~The mean F1 score for each approach is displayed. Note that a different set of features was selected for different approaches. Hence, the mean F1 score can only give a first impression regarding differences in the performance. For details see \autoref{tab:f1ScoresLarge}.}
+    \label{tab:f1ScoresSmall}
+\end{table}
 
-The heterogeneity of approaches with respect to the number of target categories and the variety of performance measures pose challenges for a direct comparison using the overall accuracies. Therefore, feature-wise evaluation appears most promising. As the distribution of some features (e.g.\ violet) has proven to be very unbalanced in our data set, even high accuracies might relate to poor predictions (e.g.\ when the feature is never detected). Hence, feature-wise accuracies are only a coarse indicator of the model’s performance that may nonetheless give insights where difficulties lie and what features are more difficult to determine than others. However, for some promising approaches we computed the sensitivity and specificity per feature to reveal a more fine-grained picture of the predictive performance.
+The heterogeneity of approaches with respect to the number of target categories and the variety of performance measures pose challenges for a direct comparison using the overall accuracies. Therefore, feature-wise evaluation appears most promising. As the distribution of some features (e.g.\ violet) has proven to be very unbalanced in our data set, even high accuracies might relate to poor predictions (e.g.\ when the feature is never detected). Hence, feature-wise accuracies are only a coarse indicator of the model’s performance that may nonetheless give insights where difficulties lie and what features are more difficult to determine than others. However, for some promising approaches we computed the sensitivity and specificity per feature to reveal a more fine-grained picture of the predictive performance. Further, F1 scores were calculated for each approach. For the feature-wise approaches, the F1 score was calculated for each feature individually (see \autoref{tab:f1ScoresLarge}) as well as a mean value for easy comparison with the other approaches (see \autoref{tab:f1ScoresSmall}). The best overall performance was reached by the single-label \acrshort{cnn} with an F1 score of 0.72. Additionally, this approach reached the best feature-wise performances for rusty head, rusty body, fractured and hollow as well as all features related to the width of the asparagus. The best results for the feature violet were reached by the color histograms (0.59) and the best results for the feature flower by the dedicated head network (0.62). Detailed summaries of the results of each approach can be found in the following.
 
 \bigskip
-In the single-label \acrshort{cnn}, very good results are achieved for features relying on the thickness and length of the asparagus (see \autoref{subsec:SingleLabel}). All of these features achieve a balanced accuracy above 90\%, with best results for the feature very thick (98\% sensitivity and 99\% specificity). Of the solely hand-labeled features, feature hollow shows the best performance (77\% sensitivity and 98\% specificity). The feature rusty head has the least performance (52\% sensitivity and 81\% specificity).
+In the single-label \acrshort{cnn}, very good results are achieved for features relying on the thickness and length of the asparagus (see \autoref{subsec:SingleLabel}). All of these features achieve a balanced accuracy above 90\%, with best results for the feature very thick (98\% sensitivity and 99\% specificity). Of the solely hand-labeled features, feature hollow shows the best performance (77\% sensitivity and 98\% specificity). The feature rusty head has the worst performance (52\% sensitivity and 81\% specificity).
 
-The multi-label approach has an overall accuracy of 75\% (see \autoref{subsec:MultiLabel}). The performance of the  \acrshort{cnn} for the two head-related features is indicated by sensitivity and specificity values. Flower detection reaches 55\% sensitivity and 95\% specificity while rusty head detection attains only 19\% sensitivity at 98\% specificity. The overall accuracy of the multi-label \acrshort{cnn} approach reaches up to 87\%. For this model, accuracies are not calculated per feature.
+The multi-label approach for head-related features has an overall accuracy of 75\% (see \autoref{subsec:MultiLabel}). Its performance is further indicated by sensitivity and specificity values. Flower detection reaches 55\% sensitivity and 95\% specificity while rusty head detection attains only 19\% sensitivity at 98\% specificity. The multi-label \acrshort{cnn} approach with a binary cross-entropy loss reaches an overall accuracy of 87\%, specificity of 91.41\% and sensitivity of 67.27\%. For this model, accuracies are not calculated per feature.
 
-In contrast, feature-wise accuracies for binary classification can be reported (see \autoref{subsec:FeatureEngineering}). The same holds for feature-wise performance measures that were calculated for some of the other approaches. The feature engineering based approaches show good results on all of its three detected features, namely for bent (82\% sensitivity, 67\% specificity) and similarly for violet detection (62\% sensitivity and 96\% specificity) as well as for rusty body (71\% sensitivity, 65\% specificity).
+In contrast, feature-wise accuracies for binary classification can be reported (see \autoref{subsec:FeatureEngineering}). The same holds for feature-wise performance measures that were calculated for some of the other approaches. The feature engineering based approaches (see \autoref{subsec:FeatureEngineering}) show good results on all of its three detected features, namely for bent (82\% sensitivity, 67\% specificity) and similarly for violet detection (62\% sensitivity and 96\% specificity) as well as for rusty body (71\% sensitivity, 65\% specificity).
 
 \bigskip
-The unsupervised learning approaches, namely \acrshort{pca} and the convolutional autoencoder, both deal with dimension reduction. Both were trained on the sample set for which labels are available. While the classification method based on \acrshort{pca} targets at binary feature prediction (absence or presence of a feature) (see \autoref{subsec:PCA}), the unsupervised autoencoder does not predict labels (see \autoref{subsec:Autoencoder}). The accuracy of \acrshort{pca} is promising for length and hollow (100\%) but extremely poor for bent detection (20\%). It has to be mentioned that only very few samples were used for training and evaluation of the named approach. As such it is yet to be proven whether or not these results generalize.
+The unsupervised learning approaches, namely \acrshort{pca} and the convolutional autoencoder, both deal with dimension reduction. While the classification method based on \acrshort{pca} targets at binary feature prediction (absence or presence of a feature) (see \autoref{subsec:PCA}), the unsupervised autoencoder does not predict labels (see \autoref{subsec:Autoencoder}). The accuracy of \acrshort{pca} is promising for length (100\%) and width (sensitivity 100\%, specificity 60\%) but extremely poor for violet detection (sensitivity 0\%). It has to be mentioned that only very few samples were used for training and evaluation of the named approach. As such it is yet to be proven whether or not these results generalize.
 
 \bigskip
 A semi-supervised learning method was based on a partially labeled data set (see \autoref{subsec:VariationalAutoencoder}). A semi-supervised autoencoder and a semi-supervised variational autoencoder perform multi-label classification. The more simple semi-supervised autoencoder performs better. Unfortunately, the predicted power is still rather poor, best for fractured (57\% sensitivity, 100\% specificity) and worst for violet as it does not detect any violet spears (0\% sensitivity).
 
 \bigskip
-In the last approach, a random forest model that predicts the class labels based on the annotated features instead of features as the aforementioned approaches, delivers an average accuracy of about 75\%. Our analysis shows that the model recalls some class labels like I A Anna or Hohle more reliably than class labels like II A or II B.
\ No newline at end of file
+In the last approach, a random forest model that predicts the class labels based on the annotated features instead of features as the aforementioned approaches, delivers an average accuracy of about 75\%. Our analysis shows that the model recalls some class labels like I~A~Anna or Hohle more reliably than class labels like II~A or II~B.
\ No newline at end of file
diff --git a/documentation/report/asparagus-report.pdf b/documentation/report/asparagus-report.pdf
index c92020f..e83304c 100644
Binary files a/documentation/report/asparagus-report.pdf and b/documentation/report/asparagus-report.pdf differ
diff --git a/documentation/report/bibliography.bib b/documentation/report/bibliography.bib
index 77831c4..fac8c5f 100644
--- a/documentation/report/bibliography.bib
+++ b/documentation/report/bibliography.bib
@@ -256,6 +256,7 @@ @inproceedings{caruana2006comparison
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/1143844.1143865},
+urldate = {2020-03-25},
 doi = {10.1145/1143844.1143865},
 booktitle = {Proceedings of the 23rd International Conference on Machine Learning},
 pages = {161–168},
@@ -805,8 +806,8 @@ @misc{wiki:precisionrecall
     author = "{Wikipedia contributors}",
     title = "Precision and recall --- {Wikipedia}{,} The Free Encyclopedia",
     year = "2020",
-    url = "https://en.wikipedia.org/w/index.php?title=Precision_and_recall&oldid=945184172",
-    note = "[Online; accessed 21-April-2020]"
+    url = {https://en.wikipedia.org/w/index.php?title=Precision_and_recall&oldid=945184172},
+    urldate = {2020-04-21}
 }
 
 @online{hassan2018vgg,
@@ -893,6 +894,7 @@ @book{szeliski2010computer
 @article{LeCun2015,
   doi = {10.1038/nature14539},
   url = {https://doi.org/10.1038/nature14539},
+  urldate = {2020-04-24},
   year = {2015},
   publisher = {Springer Science and Business Media {LLC}},
   volume = {521},
@@ -924,6 +926,7 @@ @article{Zhu2007
     Pages = {741 - 749},
     Title = {Gabor feature-based apple quality inspection using kernel principal component analysis},
     Url = {http://www.sciencedirect.com/science/article/pii/S0260877407000489},
+    urldate = {2020-04-24},
     Volume = {81},
     Year = {2007},
 }
@@ -940,7 +943,8 @@ @inproceedings{Jhuria2013
     Pages = {521-526},
     Title = {Image processing for smart farming: Detection of disease and fruit grading},
     Year = {2013},
-    URL = {https://doi.org/10.1109/ICIIP.2013.6707647}
+    URL = {https://doi.org/10.1109/ICIIP.2013.6707647},
+    urldate = {2020-04-25}
 }
 
 
@@ -954,7 +958,8 @@ @article {Pujari2014
       volume = {17},
       number = {2},
       Pages = {29 - 34},
-      url = {https://content.sciendo.com/view/journals/ata/17/2/article-p29.xml}
+      url = {https://content.sciendo.com/view/journals/ata/17/2/article-p29.xml},
+      urldate = {2020-04-25}
 }
 
 
@@ -967,7 +972,8 @@ @article{Liming2010
     Note = {Special issue on computer and computing technologies in agriculture},
     Pages = {S32 - S39},
     Title = {Automated strawberry grading system based on image processing},
-    Url = {http://www.sciencedirect.com/science/article/pii/S016816990900204X},
+    url = {http://www.sciencedirect.com/science/article/pii/S016816990900204X},
+    urldate = {2020-04-25},
     Volume = {71},
     Year = {2010},
 }
@@ -983,6 +989,7 @@ @article{Garrido-Novell2012
     Pages = {281 - 288},
     Title = {Grading and color evolution of apples using RGB and hyperspectral imaging vision cameras},
     Url = {http://www.sciencedirect.com/science/article/pii/S0260877412002701},
+    urldate = {2020-04-25},
     Volume = {113},
     Year = {2012},
 }
@@ -998,7 +1005,19 @@ @article{Kondo2010
     Pages = {145 - 152},
     Title = {Automation on fruit and vegetable grading system and food traceability},
     Url = {http://www.sciencedirect.com/science/article/pii/S0924224409002611},
+    urldate = {2020-04-26},
     Volume = {21},
     Year = {2010}
 }
 
+@inproceedings{ronneberger2015u,
+  title={U-net: Convolutional networks for biomedical image segmentation},
+  author={Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
+  booktitle={International Conference on Medical image computing and computer-assisted intervention},
+  pages={234--241},
+  year={2015},
+  organization={Springer}
+}
+
+
+