-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNotCoveredPathExample.tex
57 lines (46 loc) · 5.49 KB
/
NotCoveredPathExample.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
\subsection{DFA Coverage Example}
\label{sec:coverageexample}
Bug reports related to regular expressions abound.
A search for ``regex OR regular expression'' in GitHub yields over 555,000 issues, with 22\% of those still being open.
One in particular illustrates how coverage metrics on the DFA could have brought a particular bug to the developer's attention sooner. This bug report\footnote{\url{https://github.com/maven-nar/nar-maven-plugin/issues/228}} describes an issue with the regular expression \verb!\d+\.d+! in the NAR plugin for Maven.
Figure~\ref{fig:bug} shows the DFA of this regular expression built using RE2~\cite{re2}, and we take this opportunity to describe the DFA notation used throughout this paper\footnote{The regular expression in the bug is triggered by {\tt Matcher.find()} with a ManyMatch DFA. For simplicity, we show the FullMatch DFA, a subgraph of the ManyMatch.}.
Node~0 is the start-state, indicated by the incoming arrow.
Nodes with double-circles are accept states, such as Node~4.
Node~e is the error state, denoting a mismatch.
The edges are labeled with transitions, often using syntactic sugar for ease of interpretation.
The edge $\overrightarrow{01}$ is traversed when a digit from \verb!0-9! is read.
If any other character is read at Node~0, (i.e., \verb!not 0-9!), edge $\overrightarrow{0e}$ is traversed.
There is a self-loop on Node~1 for digits \verb!0-9!.
If the period character is read from Node~1, then edge $\overrightarrow{12}$ is traversed.
%If any other character is read (i.e., not `\verb!/!), then edge $\overrightarrow{21}$ is traversed to Node~1. Since Node~1 is an accept state, any other character read from Node~1 reaches the error state, Node~e.
In RE2, when reading an input string, byte~{\tt [256]}, is added as a text-end marker. %, to mark the end of processing.
For example, the input string ``0.0'' is transformed to the byte stream {\tt [48 46 48 256]}, as {\tt [48]} is the byte for `0', {\tt [46]} is for `.', and {\tt [256]} marks the end of the string. Byte [256] is matched on edges `\verb![0-256]!', `\verb!not 0-9!', `\verb!not d!', or `\verb!any except 0-9 and .!'.
%\todo{replace `[0-256]` in the DFA with `any`; I think it's best not to explain what 256 means at this point}
%\adh{I would delete this paragraph and simply go with the last paragraph: the issue is not testing one of the branches, our approach would have fixed that. Therefore the problem we are talking about exists in the wild at least sometimes.}
%With the regular expression in Figure~\ref{fig:bug}, input ``/root'' is rejected when it should be accepted, traversing $0 \rightarrow 2 \rightarrow 1 \rightarrow e$.
%The string ``///'' is accepted should actually be rejected, traversing $0 \rightarrow 2 \rightarrow 2 \rightarrow 2 \rightarrow 1$. This final transition from $2 \rightarrow 1$ occurs because the RE2 tool that built the DFA reads a final byte (byte [256], to be exact) following the last character in the input string to signify the end of processing. The same though would have occurred for ``/root'', except the error state was reached when the first `o' was read, thus no further input characters were processed.
%and As a result, the urlPattern ``/*'' matches an empty url and does not match the normal url.
%The traversed path of these three inputs are , , .
\begin{figure}[t]
\centering
\includegraphics[width=0.48\textwidth]{figures/bug4}
\vspace{-12pt}
\caption{Full-match DFA for regular expression: {\tt \textbackslash d+\textbackslash .d+}}
\label{fig:bug}
\vspace{-12pt}
\end{figure}
The bug report mentions that the regular expression \verb!\d+\.d+! is buggy and the patch adds an escape before the second d, \verb!\d+\.\d+!. The intended behavior is to match input strings with one or more digits, followed by a period, followed by one or more digits.
In this work, the structural metrics could reveal this fault.
With the DFA in Figure~\ref{fig:bug}, when Node~3 is reached, the fault may be revealed.
Input "0.d" traverses $0 \rightarrow 1 \rightarrow 2 \rightarrow 3 \rightarrow 4$ and ends in an accept state, when it should fail. However, input "0.d3" traverses $0 \rightarrow 1 \rightarrow 2 \rightarrow 3 \rightarrow 4 \rightarrow e$ and ends in an error state, as expected.
Covering edge $\overrightarrow{2e}$ may also reveal the fault; input "2.3" traverses $0 \rightarrow 1 \rightarrow 2 \rightarrow e$ and ends in and error state, when it should be accepted.
Requiring coverage of all feasible nodes and edges could have revealed this fault in the regular expression.
%That is, input "0" traverses $0 \rightarrow 1 \rightarrow e$, the expected behavior, and the fault is not revealed.
%
%the problematic situation when the empty string ``'' is accepted while should be rejected.
%The empty string is read as simply {\tt [256]}, traversing $\overrightarrow{01}$ to the accept state.
%Without testing the empty string, the edge $\overrightarrow{01}$ would be uncovered.
%Presumably, in this case, failure to test the empty string led to a bug report.
%%According to the bug report, the fix for this regular expression is to replace \verb!/*! with \verb!/.*+!, however, fixing the regular expression is not the goal of the work.
%The goal of this work is to identify uncovered edges and nodes, such as $\overrightarrow{01}$.
As with code coverage, uncovered artifacts alert the programmer to untested behavior. Such coverage information can indicate that a regular expression is not well tested and for some inputs it may not behave as intended, as is the case here.