rq1.tex

\section{Bug Report Study (RQ1)}
\label{sec:rq1}
In this section, we present four common types of regular expression.related bugs. We describe the bug causes and its impacts of each bug type. At the end of discussing each type, we propose some possible solutions to them.

\subsection{Artifacts}
 To explore the impact of buggy regular expressions, we explored bug databases for large bug reporting systems, specifically Bugzilla, GitHub, and Apache JIRA.
  We filtered bugs related to regular expressions with searching keywords "regular expression", "regex", and "regexp". 
  There are \todo{num} bugs collected, and we randomly selected \todo{num} to analyze. The goal was to determine the extent to which low coverage poses a problem for regular expressions. 
  
  \subsection{Analysis}
  For each bug report, we \todo{describe your process}

\subsection{Results}
\subsubsection{Catastrophic Backtracking}
Backtracking is the most difficult type of regular expression.related bugs. Backtracking happens in the regular expression engines of Nondeterministic Finite Automaton (NFA) implementation. When NFA decides the states it should take, it tries one state a time. If the rest matching process arrives an error state, it backtracks to where it choose the state and selects another state to try.

\noindent {\textbf{Impact: }} Backtracking usually takes exponential time to complete. It also consumes a lot of computing resources (e.g., CPU and Memory) and therefore slows down the program responses~\cite{stackoverflow}. Sometimes, backtracking in regular expression could even cause crashes of systems and softwares (e.g., browsers, websites, etc)~\cite{browser}. It can be exploit for ReDos attacks as well.

\noindent {\textbf{Solutions: }} a) Use a regular expression engine implemented in DFA. While the NFA implementation takes regular expression engines exponential time to finish, the DFA version takes linear time. However, it is not always feasible to choose regular expression engines. b) Avoid the regular expression features which causes backtracking. Backtracking is only used for some of the regular expression engines. Finding equivalent regular expression without these features could avoid the backtracking.

\subsubsection{Strings of zero length and long string} If a valid regular expression is given to match a zero-length string, some regular expression engines will try to read a character infinitely. This problem is manifested in Bug 374923\todo{add}. If the regular expression is given a very long string, its execution time is linear to the string length in DFA implemented regular expression engine. This situation is common in regular expression containing wildcard * and +. Take the regular expression "a+b" as an example. The matching for string "ab" 
\noindent {\textbf{Impact: }} Similar to backtracking, strings of zero length could cause system hangs.
\noindent {\textbf{Solutions: }}
\subsubsection{Escape Characters}
\noindent {\textbf{Impact: }}
\noindent {\textbf{Solutions: }}
For the characters which have dual meaning in the regular expression.
\subsubsection{Case sensitivity}
\noindent {\textbf{Impact: }}
\noindent {\textbf{Solutions: }}
Need test cases to test its sensitivity to match its regular expression purpose.
\subsubsection{Special Characters}
\noindent {\textbf{Impact: }}
\noindent {\textbf{Solutions: }}
Some regular expression may accept unexpected characters because the composition of regular expression does not consider it. Some paths are not explored during testing. low coverage.
Some regular expression reject characters but they should accept them.

The sample should consider as many samples as possible. Although the regular expression may have a 100% coverage, we should also analyze the rejected strings. The weakness of strings generated by regular expression tools is that these strings are usually constructed to match the regular expression, not to mismatch it.

Another solution is collecting similar regular expressions of the same purpose.
\subsubsection{Match pattern and alternation order}
\noindent {\textbf{Impact: }}
\noindent {\textbf{Solutions: }}
Match pattern matters. \todo{add an example about first match and longest match}
alternation order matters. h|s|m|ms or h|s|ms|m----------what is the results of First match longest match for them.