-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrq.tex2
699 lines (541 loc) · 68.6 KB
/
rq.tex2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
%\begin{figure}[th]
% \centering
% \includegraphics[width=0.5\textwidth]{figures/bugfirstMatch}
% \vspace{-6pt}
% \caption{DFA graph generated by RE2~\cite{re2} for the regular expression \texttt{[0-9]+(h|m|s|ms)}. \todo{rename -1 to E}}
% \label{fig:firstMatch}
% \vspace{-6pt}
%\end{figure}~\begin{figure}[th]
% \centering
% \includegraphics[width=0.5\textwidth]{figures/bugfullMatch}
% \vspace{-6pt}
% \caption{DFA graph generated by RE2~\cite{re2} for the regular expressions \texttt{[0-9]+(h|ms|s|m)} and \texttt{[0-9]+(h|m|s|ms)\$}. \todo{rename -1 to E}}
% \label{fig:fullMatch}
% \vspace{-6pt}
%\end{figure}
\section{Background and Motivation}
\label{sec:motive}
To facilitate a precise description of the work, we define regular expressions more formally.
A regular expression is a sequence of characters that defines a search pattern.
The set of strings matched by the regular expression is a language.
That is, a regular expression $R$ represents a language $L(R)$ over an alphabet $\Sigma$, where $L(R)$ is a (possibly infinite) set of strings.
For a given language, there are many regular expressions that can describe it.
A regular expression can be represented as a string of tokens, a finite state automaton in deterministic (DFA), or a non-deterministic (NFA) form.%, or a parse tree.
In this work, we propose fine-grained test coverage metrics over the DFA representing a regular expression. This requires two informal explorations to ensure the feasibility of this idea as well as the potential impact. First, we explore regular expressions collected from a previously existing Python dataset~\cite{chapman2016} and test them for regularity, from the perspective of formal language theory~\cite{sipser2006introduction}. Second, since our coverage metrics are based on the structure of the DFA, we explore whether faults can lie along untested paths in a DFA as a motivating example.
\subsection{How Regular are Regular Expressions?}
\label{regularregularexpressions}
Regular expression in source code often can contain non-regular features, such as backreferences. An example is the regular expression, \verb!([a-z]+\1)!, which matches repeat words in a string, such as ``appleapple". Building a DFA is not possible for this since this regular expression is non-regular. For regular expressions in source code that are indeed regular, we can build DFAs and measure coverage based on a test suite.
Here, we are testing how many of the regular expressions in the wild are truly regular.
%\adh{These first two paragraphs can be combined and shortened to just say some languages allow non-regular expressions, we need to be able to build a DFA, we're testing how many of the regexes in the wild are truly regular.}
%Regular expression libraries in modern programming languages can contain non-regular features, such as backreferences.
%This means that a DFA cannot be constructed to represent it. An example is the regular expression, \verb!([a-z]+\1)!, which matches repeat words in a string, such as ``appleapple". Building an NFA or DFA to recognize such a pattern is infeasible; it requires a push-down automata (PDA). These features may improve the usability of the libraries, but they allow developers to define regular expressions that are not {\em regular} in terms of formal language theory.
%
%For regular expressions in source code that are indeed regular, we can build DFAs and measure coverage based on a test suite.
%However, if only a small percentage of regular expressions in the wild are in fact regular, the potential for impact of code coverage metrics that depend on DFAs is limited.
We explore an existing and publicly available dataset of 13,597 regular expressions scraped from Python projects on GitHub. To test for regularity, we use an empirical approach since the ability to build a DFA from a regular expression implies that it is regular from a theory perspective~\cite{sipser2006introduction}.
Of the 13,597 Python regular expressions, 13,029 (95.9\%) are regular in that we were successful in building DFAs for each using the RE2~\cite{re2} regular expression processing engine. For the remaining 568, we investigated each by hand. One regular expression was removed because its repetition exceeds the RE2 limits. While it may indeed be regular, to be conservative, we mark it as non-regular. An additional 81 contained comments within the regular expressions, which are unsupported in RE2, so these were also assumed to be non-regular; 128 contained unsupported characters. The remaining 368 were non-regular as they contained backreferences.
In the end, with nearly 96\% of the regular expressions being regular (as a low estimate), we conclude that most regular expressions found in the wild are regular and thus can be modeled with DFAs. %Thus, the potential for impact was high enough to justify further exploration.
\input{NotCoveredPathExample}
\iffalse
\begin{figure}[t]
\centering
\includegraphics[width=0.35\textwidth]{figures/bugnfa}
\vspace{-6pt}
\caption{Finite automata generated for the regular expression \texttt{[0-9]+(h|m|s|ms)}.}
\label{fig:nfa}
\vspace{-6pt}
\end{figure}
\subsection{Are Non-covered Portions of a DFA Problematic?}
The bug reports related to regular expressions abound. One in particular illustrates how our proposed coverage metrics could have brought a particular bug to the developer's attention sooner. This bug report\footnote{\url{https://github.com/ebu/ebu-tt-live-toolkit/issues/67}} describes the desired matching behaviors of the regular expression \verb![0-9]+(h|s|m|ms)!. Figure~\ref{fig:nfa} shows finite automata, and we take this opportunity to describe the DFA notation used throughout this paper.
Node~0 is the start-state, indicated by the incoming arrow. Nodes with double-circles are accept states, such as Node~3. The edges are labeled with transitions, often using syntactic sugar for ease of interpretation. The edge $\overrightarrow{01}$ is traversed when a character in the range of \verb![0-9]! is read. If any other character is read (i.e., \verb!not 0-9!), then the error state (not shown) is reached. There is a self-loop on Node~1 for character in the range \verb![0-9]!, and an edge $\overrightarrow{13}$ for when \verb!h!, \verb!m!, or \verb!s! is read.
With the regular expression in Figure~\ref{fig:nfa}, inputs ``9m" and ``9ms" in Python's regular expression matching library are accepted as the extracted string is ``9m", which is correct for the former example, but not the latter. It also accepts ``9mh", when that should be rejected.
Each of these inputs will traverse nodes $0 \rightarrow 1 \rightarrow 3$, and then stop.
%The reason lies in the recursive backtracking implementation of Python's {\tt re} engine.
It interprets the priorities of `h', `s', `m', and ``ms" as the order of their appearance in the regular expression, short-circuiting the evaluation of the entire string.
%This buggy regular expression could also be explicated by a regular expression engine of DFA implementation. Figure~\ref{fig:firstMatch} shows the DFA graph for the same regular expression. Since the expected matching results for ``9m", ``9ms" and ``9mh" is different, they should cover different nodes and edges of the graph. The matching to ``9m" follows transitions 0$\rightarrow$1$\rightarrow$2$\rightarrow$3, while the latter two follow the same transitions 0$\rightarrow$1$\rightarrow$2$\rightarrow$3$\rightarrow$-1.
The fix for this regular expression is to switch the order of ``ms" and `m' or to change the matching from beginning of inputs\footnote{In Python, {\tt re.match} is defaulted to match from the beginning of inputs, rather than match the entire strong} to from beginning to the end of the inputs. However, fixing the regular expression is not the goal of the work. The goal of this work is to identify edges $\overrightarrow{11}$, $\overrightarrow{12}$, and $\overrightarrow{23}$, and Node~2 as uncovered, alerting the programmer to a possible issue with the regular expression. Such coverage information can indicate when a regular expression is not behaving as intended, as is the case here. The expected path for input ``ms" would be $0 \rightarrow 1 \rightarrow 2 \rightarrow 3$, where the actual path is $0 \rightarrow 1 \rightarrow 3$.
\fi
%In this work, we define and explore coverage metrics for regular expressions. By adopting standard graph-based coverage metrics from the literature, we model regular expressions as DFAs and explore the coverage levels of regular expressions found in GitHub projects.
%Figure~\ref{fig:fullMatch} shows that ``9m", ``9ms" and ``9mh" have transitions different from each other. ``9m" follows transitions 0$\rightarrow$1$\rightarrow$3$\rightarrow$4; ``9ms" follows transitions 0$\rightarrow$1$\rightarrow$3$\rightarrow$5$\rightarrow$4; and ``9mh" follows transitions 0$\rightarrow$1$\rightarrow$3$\rightarrow$4$\rightarrow$-1.
\input{coverage}
\section{Research Questions}
\label{sec:rq}
To explore the potential of using fine-grained test coverage metrics over a DFA for regular expressions, we evaluate the following research questions.
%
%\noindent {\textbf{RQ1:}} {\em What is the impact of under-tested regular expressions?}
%
%\noindent To answer RQ1, we explore publicly available bug reports related to regular expression. By analyzing \todo{number} bug reports and the implicated code, we categorize the impacts of the bug on the software functionality and performance and propose solutions to fix them.
\noindent {\textbf{RQ1:}} {\em How well are regular expressions tested in Github?}
\noindent To answer RQ1, we identify 1,225 Java projects that have analyzable regular expressions and existing test suites covering the regular expressions. From these, we extract 15,096 regular expressions and 899,804 total test input strings, measuring NC, EC, and EPC for each regular expression.
To obtain the regular expressions and their corresponding strings which are covered by test cases, we use the Java bytecode manipulation framework Javassist~\cite{javassist} to record the regular expressions when pattern matching methods are triggered by test cases.
%After eliminating the regular expressions unsupported by RE2~\cite{re2}, we compute coverage based on the strings collected during instrumentation. %We also explore other aspects of regular expressions, such as the percentage of unsupported features, the proportion of pattern mismatching in the total pattern usage, and the replacement of regular expressions with simple string operations.
\noindent {\textbf{RQ2:}} {\em How well can existing regular expression string generation tools improve the test coverage of regular expressions?}
\noindent Using the regular expressions from RQ1, we generate test strings using Rex~\cite{rex} and calculate the regular expression coverage, comparing it to the coverage of the user-defined test suites from RQ1.
Using Rex, we generate test suites of three sizes, one to match the size of the user-defined test suites from the GitHub projects, one 5x that size, and one 10x that size.
%Using Rex, we generate test suites of three sizes, one to match the size of the user-defined test suites from the GitHub projects so not to bias the results based on test suite size~\cite{coveragetestsuitecorrelation},\adh{the title at least of this reference suggests that coverage does not have an effect on performance, so why would bias occur?} one 5x the size of the user-defined test suite, and one 10x the size of the user-defined test suite, so not to bias against the Rex tool.
%\footnote{The number of accepted strings for some restricted regular expressions is less than 100.}.
%This is because some regular expressions are very restricted and only accept a small number of strings while some other regular expressions are very generic and their accepted strings are innumerable.
%Similar to RQ1, we feed the pairs of regular expressions and their strings into the process of coverage calculation.
By comparing the coverage statistics we got in RQ2 to those in RQ1, we evaluate the the test coverage possibilities through using off-the-shelf tools.
%As the research questions operate over different data sets and with different instrumentation, we present each in turn.
%Because some strings may not increase the coverage of regular expressions at all, we also discuss the least number of strings to achieve the same coverage in RQ3. The results could be a hint for a better method of string compositions in regular expression tools.
%\noindent {\textbf{RQ4:}} {\em How likely is a regular expression regular?}
%To answer this question, we calculate the percentage of regular expressions based on which we can successfully build the DFA graphs. For the regular expressions on which we cannot successfully build a DFA, they are not regular. Besides the regularity, we also explore the density of DFA graphs, and the impacts of different matching types to the construction of DFA graph.
\section{Study}
\label{sec:method}
To address RQ1, applying the coverage metrics defined in Section~\ref{coveragemetrics} to
regular expressions from the wild requires (1) instrumentation to capture the regular expressions and strings matched against them (Section~\ref{rq1instrumentation}), (2) a tool to measure coverage given a regular expression and a set of strings (Section~\ref{subsec:cov}), and (3) a large corpus of projects with regular expressions and test suites that execute the regular expressions (Section~\ref{rq1:artifacts}).
%To achieve (1), we found 1,665 Java projects on GitHub that contain regular expressions and test suites with Maven (Section~\ref{rq1:artifacts}). For (2), we use the Java bytecode manipulation framework Javassist~\cite{javassist} to capture regular expressions and strings matched against them during runtime (Section~\ref{rq1instrumentation}). For (3), we modify the RE2~\cite{re2} regular expression engine to extract the DFA representation and measure coverage (Section~\ref{subsec:cov}).
To address RQ2, we use the Rex~\cite{rex} tool to generate input strings for the regular expressions in our study (Section~\ref{rq2:artifacts}).
%The attribute \textit{Projects} represents the number of Java Projects in which the regular expression is used. The attributes of \textit{total inputs}, \textit{unique inputs}, and \textit{projects} indicates the existence of regular expressions used in third-party libraries. They are the noisy data which we will detect and remove from the dataset for coverage analysis. \todo{what does unique inputs mean?} \todo{what does Projects mean?}
\subsection{Instrumentation}
\label{rq1instrumentation}
This section describes our approach to collecting regular expressions from GitHub projects and the strings evaluated against the regular expressions during testing. % during GitHib projec of program instrumentation to obtain the regular expressions in a project and their input strings for pattern matching during the execution of unit tests.
\subsubsection{Instrumented Functions}
-\label{sec:instrumentedfunctions}
There are different types of matching between a regular expression and a string. The Java function \emph{Pattern.matches} requires the regular expression to match a string from its beginning to its end; Python's \emph{re.match} requires the regular expression to match a string only from its beginning, not necessarily match to the end of the string; and the C\# function \emph{Regex.Match} requires the regular expression to match only a substring of the input string. These are called \emph{FullMatch}, \emph{FirstMatch}, and \emph{ManyMatch}, respectively.
%There are two types of matching between a regular expression and a string: \emph{FullMatch} and \emph{ManyMatch}. \emph{FullMatch} requires the entire string to match the regular expression while \emph{ManyMatch} could find all substrings which match the regular expression.
In this work, we consider only \emph{FullMatch} matches and related functions in Java projects.
The related functions for FullMatch in Java are:
%\todo{how do you identify if a a full match or many match should be used? fullmatch can be achieved with end-point anchors - do you require the presence of endpoint anchors?}The pattern matching functions we instrumented are:
\begin{itemize}
\item \texttt{java.lang.String.matches(String regex)}
\item \texttt{java.util.regex.Matcher.matches()}
\item \texttt{java.util.regex.Pattern.matches(String regex,\\ CharSequence input)}
%\item \texttt{public boolean String.matches(String regex)}
%\item \texttt{public boolean Matcher.matches()}
%\item \texttt{public static boolean Pattern.matches(String regex, CharSequence input)}
\end{itemize}
\iffalse
\begin{itemize}
\item \texttt{public boolean java.lang.String.matches(String regex)}
\item \texttt{public boolean java.util.regex.Matcher.matches()}
\item \texttt{public static boolean java.util.regex.Pattern.matches(String regex, CharSequence input)}
\end{itemize}
\fi
\noindent In these functions the entire string is required to match the regular expression~\cite{friedl2002mastering}. Thus, a regular expression with end-point anchors (i.e., \verb!^! and \verb!$!) and without are no different.
%We instrumented these functions and recorded the parameters of pattern and string.
\subsubsection{Bytecode Manipulation}
Our instrumentation is built on top of the Java bytecode manipulation framework Javassist~\cite{javassist}, which can dynamically change the class bytecode in the JVM. All the projects are run in jdk1.7. First, we built a Java library for the instrumentation that uses {\em jar} files to intercept the invocations of FullMatch functions in Java. For each invocation, we collect information about the regular expression itself, its location in the code, and any strings matched against it during test suite execution. These strings matched against the regular expression are referred to as the \emph{input strings} or \emph{test inputs}; the set of all input strings matched against a regular expression is called $S$.
% Our instrumentation is built on top of the Java bytecode manipulation framework Javassist~\cite{javassist}, which can dynamically change the class bytecode in the JVM. All the projects are run in jdk1.7. First, we built a Java library for the instrumentation that uses {\em jar} files to intercept the invocations of FullMatch functions in Java. For each invocation, we collect information about the regular expression itself, its location in the code, and any strings matched against it during test suite execution. These strings matched against the regular expression are referred to as the \emph{input strings} \todo{we need to pick a term and be consistent with it}; the set of all strings matched against a regular expression is called $S$.
%Regular expressions and their input strings are the main information we collected from unit test executions.
Since a regular expression may also appear in third-party libraries (including Maven and JUnit), we use the Java Reflection API to additionally record the caller function stack of the instrumented methods and extract the file name, class name, and method name of the their caller methods.
This allows us to identify when the regular expression being executed is from the system under test and when it is from a third-party library.
%This allows for some data cleanup (Section~\ref{datacleanup}).
%However, because same regular expression may appear in different places of source code and then tested differently in the test suite, we need to calculate their test coverage separately.
%In order to differentiate between the same regular expression (i.e., that match the same language) but that appear in different code locations, we additionally recorded the caller function invocation of the instrumented methods and extracted the file name, class name, and method name of the caller method. Together with the regular expression, these four pieces of information compose a unique regular expression to calculate its test coverage.
%\begin{figure}[tb]
%\begin{footnotesize}
%\noindent\fbox{%
%\begin{minipage}{0.45\textwidth}
%(1) matches in class: java.util.regex.Matcher[on line number: 559 of file: Matcher.java]\\
%(2) compileRoutePattern in class: kirouter.SinatraRouteParser[on line number: 38 of file: SinatraRouteParser.java]\\
%...\\
%(3) singleParameter in class: KiRouterFilterTest[on line number: 19 of file: KiRouterFilterTest.java]\\
%...\\
%(4) invoke in class: java.lang.reflect.Method[on line number: 606 of file: Method.java]
%...\\
%(5) run in class: org.junit.runners.ParentRunner\$3[on line number: 238 of file: ParentRunner.java]\\
%(6) run in class: org.apache.maven.surefire.junitcore.pc.Scheduler\$1[on line number: 370 of file: Scheduler.java]\\
%...\\
%(7) Matcher matches()---regex: ((:\textbackslash w+)|\textbackslash*)---input: one-name---\#
%\end{minipage}}
%\caption{Instrumentation output for collecting regular expressions and the input strings \todo{can cut for space}}
%\label{instrumentationoutput}
%\end{footnotesize}
%\end{figure}
\subsubsection{Example}
As an example, we illustrate the recorded informations for the regular expression \verb!((:\w+)|\*)! and a string ``one-name'' from a project used in our study.\footnote{\url{https://github.com/mikko-apo/KiRouter.java}}. The following information is retained for analysis:
%Figure~\ref{instrumentationoutput} shows a snippet of recorded informations for the regular expression \verb!((:\w+)|\*)! and a string ``one-name'' (shown in line (7)) from a project used in our study.\footnote{\url{https://github.com/mikko-apo/KiRouter.java}} Lines (1) to (6) are the caller method information in terms of method name, class name, line number, and file name. For example, line (1) shows the matching is invoked in class {\tt java.util.regex.Matches}, the invoked method is {\tt matches}, and the invocation happens at the line 559 of file {\tt Matcher.java}. Since line (2) is the caller method of the instrumented function, we retain the following location and regular expression information for analysis:
\begin{itemize}% \itemsep -1pt
\item system under test: {\tt mikko-apo/KiRouter.java}
\item test method: {\tt compileRoutePattern}
\item test class: {\tt kirouter.SinatraRouteParser}
\item test file: {\tt SinatraRouteParser.java}
\item regular expression: \verb!((:\w+)|\*)!
\item input string:``one-name"
\end{itemize}
% In the text snippet above, the test file is \textit{KiRouterFilterTest.java} while the matching occurs in file \textit{SinatraRouteParser.java}.
%\todo{is this correct?}
%Regular expressions which are syntactically same are differentiated by recorded stacks, which consists of a project name, a test method, a test class, a test file, and a regular expression. This is the unit to calculate the coverages for each regular expression. %The unit of our regular expression coverage calculation is a unique regular expression on a recorded stack, consisting a project name, a test method, a test class, a test file, a regular expression, and a set of input strings.
%However, these frameworks sometimes use regular expressions as well.
%This muddies the data by including invocations of the targeted Java APIs by code outside the system under test.
%The matching does not necessarily occur in source code for testing.
%Next, for valid Java Maven Projects in RepoReaper~\cite{repo},
%\subsection{Extracting Regular Expressions and Input Strings}
%Given that Maven tests are run in parallel, we forked the JVM and ran the tests separately for each module. In this way, we could collect regular expression information for modules with passing tests even if other modules' tests failed
%In this way we dropped 8,496 regular expressions, 1,259 regular expressions which are unique in syntax.
%The resulting dataset contains 1,256 projects, 15,562 regular expressions, and 14,040 regular expressions which are syntactically unique.
%Note that the total number of syntactically unique regular expression is larger than 15091 because syntactically regular expressions could appear independently in both stacks of a third-party library and stacks of a GitHub project and thus are countered twice.
\subsubsection{Limitations}
One drawback of JVM dynamic instrumentation is that it intercepts all method invocation for the instrumented functions, which includes the system under test as well as third-party libraries imported by these projects (e.g., {\tt org.junit}, {\tt org.apache.maven}).
To remove the regular expressions used in third-party libraries, we first dropped all regular expressions from packages \texttt{org.junit.runner.*} and \texttt{org.apache.maven.plugins.*} \todo{where would thee packages appear? test class? test file? test file?} because these two packages are always used in maven tests. %This left us with 1,665 projects and 24,058 regular expressions.
\subsection{Coverage Analysis}
\label{subsec:cov}
This section details the construction of DFAs for computing coverage.
Given a regular expression $R$ and a set of input strings $S$, we first build a DFA for $L(R)$ and then track the nodes and edges visited in the DFA during pattern matching with each string $s \in S$. We built our infrastructure on top of RE2~\cite{re2}, a regular expression engine similar to those used in PCRE, Perl, and other languages.\footnote{Original RE2 at \url{https://github.com/google/re2} and modified code at [blinded for review]} %Then, we calculate graph coverages according to the description in Section~\ref{sec:coverage}.
\subsubsection{DFA Types.} Given a regular expression and an input string to match, we could build multiple DFAs with different considerations. We could build a static DFA with a regular expression alone or build a DFA on-the-fly (dynamic DFA) considering both a regular expression and an input string. For the same regular expression, different input strings will yield different dynamic DFAs. We can also build a Forward DFA and Backward DFA depending to the direction of scanning the regular expression. These decisions come with various performance tradeoffs during the matching process.
%Figure~\ref{fig:forwardbackward} shows the differences between a Forward DFA and a Backward DFA for the regular expression \verb!a+b!, which matches one or more {\tt a} followed by exactly one {\tt b}. Its Forward DFA (Figure~\ref{fig:forward}) and Backward DFA (Figure~\ref{fig:backward}) differ in both DFA size and DFA structure.
For the purpose of our work, we need each DFA to be built in a consistent way, so we use a static DFA. We chose the forward direction
%We choose the Forward static DFA for the regular expression and the Forward dynamic DFA for {\em FullMatch} of the regular expression and the input string, since the forward direction is applied in regular expression engines by default and these
as it seems the most natural for interpretation.
%\subsubsection{DFA Generation}
\subsubsection{DFA Mapping}
\label{dfamapping}
%\url{ https://github.com/wangpeipei90/re2}.}.
%\begin{figure}[t]
%\centering
% \begin{subfigure}[t]{0.40\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/forwardab}
% \vspace{-6pt}
% \caption{Forward FullMatch DFA }
% \label{fig:forward}
% \end{subfigure}
%
% \begin{subfigure}[t]{0.4\textwidth}
% \centering
% \includegraphics[width=\textwidth]{figures/backwardab}
% \vspace{-6pt}
% \caption{Backward FullMatch DFA}
% \label{fig:backward}
% \end{subfigure}
% \vspace{-6pt}
%\caption{Forward DFA and Backward DFA for the same regular expression: {\tt a+b} \todo{can drop for space}}
%\vspace{-6pt}
%\label{fig:forwardbackward}
%\end{figure}
%The process of pattern parsing which express the regular expression as the state transition from one state to the next state according to every byte input. It begins with the start state and ends in either a dead state(Ox1) or a match state. By traversing all the transitions and states in this process, we build a graph as the static DFA. Similarly, we build other graphs as dynamic DFAs during the process pattern matching.
When matching an input string to a regular expression, RE2 builds a dynamic DFA. However, our coverage is computed over a static DFA. This requires mapping to aggregate coverage of a regular expression given multiple input strings.
For a single regular expression, different input strings often result in different dynamic DFAs. To make matters worse, these DFAs have inconsistent naming of their states.
Therefore, to calculate the coverage of a certain regular expression based on the same DFA, these dynamic DFAs have to be mapped to the same static DFA, and then coverage is computed on the static DFA. This is usually straightforward as the dynamic DFA is always an isomorphic subgraph of the static DFA.
% \todo{is it not always isomorphic?}
%Their start states are equivalent. Each state transition of the dynamic DFA contains one byte input while the equivalent state transition of the static DFA contains a set of byte input. Thus, by traversing the dynamic DFA from the same start state and identifying the arrival states with byte input states, we could make the naming of equivalent states consistent among the static DFA and the dynamic DFAs.
\begin{figure}[t]
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{figures/digitsDFA}
\vspace{-12pt}
\caption{Fully specified static DFA for: {\tt \textbackslash d+}}
\label{fig:staticfull}
\end{subfigure}
%\vspace{12pt}
%\begin{subfigure}[t]{0.45\textwidth}
% \centering
% \includegraphics[width=0.6\textwidth]{figures/digits1001mapping2}
% \vspace{-6pt}
% \caption{Dynamic DFA prior to mapping for regular expression: {\tt \textbackslash d+} and input: ``1001''}
% \label{fig:map}
%% \vspace{-6pt}
%\end{subfigure}
\vspace{12pt}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=0.5\textwidth]{figures/digits2}
\vspace{-6pt}
\caption{Dynamic DFA for regular expression: {\tt \textbackslash d+} and input: ``2''}
\label{fig:match1}
\end{subfigure}
\vspace{12pt}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=0.6\textwidth]{figures/digits1001}
\vspace{-6pt}
\caption{Dynamic DFA for regular expression: {\tt \textbackslash d+} and input: ``1001''}
\label{fig:match2}
\end{subfigure}
\vspace{12pt}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=0.3\textwidth]{figures/digitsunew}
\vspace{-6pt}
\caption{Dynamic DFA for regular expression: {\tt \textbackslash d+} and input: ``u''}
\label{fig:mismatch1}
\end{subfigure}
\vspace{12pt}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=0.7\textwidth]{figures/digits100u}
\vspace{-6pt}
\caption{Dynamic DFA for regular expression: {\tt \textbackslash d+} and input: ``100u''}
\label{fig:mismatch2}
\end{subfigure}
\vspace{-6pt}
\caption{Visited DFA subgraphs for the regular expression {\tt `\textbackslash d+'}. For each figure, $N_0$ is the initial node 0, $N_m$ is the accept node 3, $N_e$ is the error node e.
%State 2 is the initial state. State 5 is a final matching state. State 1 is a final non-matching state.
The arrows colored blue represent transitions in successful matches. The arrows colored red represent transitions in failed matches. \todo{is there any reason to have these colored edges? If no, make them all black. It doesn't seem to add value.}
%Each ellipse in the graph represents one state of the DFA. The ellipse labeled `2' with an incoming arrow is the start state of DFA. The ellipse with double circles is the end states of DFA. The unfilled double-circled ellipses are matching end state of DFA, which implies a successful match between the regular expression and the string. The gray double-circled ellipse with double circles labeled `1' is the error end state, which implies a failure in matching the string to regular expression. A state with an input of a byte transitions to another state.
The characters without square brackets are the literal characters in state transitions. For example, `u' prompts the transition from Node 0 to Node e. [256] implies that there is no more bytes from the input string.}
\vspace{-6pt}
\label{fig:dfas}
\end{figure}
Considering the regular expression \verb!\d+! and $S = \{s_0, s_1, s_2\}$ from Section~\ref{coveragemetrics} where $s_0$ = ``2'', $s_1$ = ``1001'', $s_2$ = ``u'', and $s_3$ = ``100u''. Figure~\ref{fig:staticfull} shows the static forward DFA. From this point, all DFAs are assumed to be forward.
Figure~\ref{fig:dfas} shows the dynamic DFAs corresponding to all these four inputs in Figure~\ref{fig:match1}, Figure~\ref{fig:match2}, Figure~\ref{fig:mismatch1}, and Figure~\ref{fig:mismatch2}, respectively. Blue arrows are used to identify the visited edges in the dynamic DFAs when the input string is a match. Red edges are used to identify the visited edges when the input string is not a match.
% after mapping to its static DFA, excepting Figure~\ref{fig:map}, which shows the original dynamic DFA for \verb!\d+! and $s_3$ \emph{prior} to mapping onto the static DFA.
Note that in Figure~\ref{fig:dfas}, we have re-named the nodes in the dynamic DFAs according to the static DFA in Figure~\ref{fig:staticfull}, showing the DFAs after mapping.
In reality, there is a step in which each of the nodes in the dynamic DFAs is mapped onto the static DFA. This mapping process is straightforward as $N_0$, $N_e$ and $N_m$ are consistently labeled in the static and dynamic DFAs. %\todo{Peipei, is this last sentence correct?}
% we temporarily denote them as \emph{A,B,C,D}. The traversal of this DFA is $A \rightarrow B \rightarrow C \rightarrow C \rightarrow C \rightarrow D$. Consider the static DFA in Figure~\ref{fig:staticfull} (equivalent to Figure~\ref{fig:static}, but repeated here for easier reading).
%The initial state is straightforward to map, and $A \mapsto 0$.
%The transition in the dynamic DFA is along edge $\overline{AB}$ with input $1$.
%The same input in the static DFA traverses edge $\overline{01}$, and so $B \mapsto 1$.
%Similarly, $C \mapsto 2$.
%Since the final state D is not the error state, state D is regarded as the final accept state and $D \mapsto 3$. The results of mapping Figure~\ref{fig:map} is shown in Figure~\ref{fig:match2}.%\todo{why is ((2)) an accept state if it's impossible to accept anything from it (that is, there is no in-arrow of byte 256)?}
\subsubsection{RE2 Limitations and Modifications}
\label{dfageneration}
% There are some other modifications of the RE2 source code so that the DFAs could be publicly visible and accessible to operate.
We changed the default memory size of a cached DFA from 1MB to 256MB so that it could accommodate large DFA graphs.
However, due to the maximum length of a single argument in Linux environment (131,072), input strings longer than 131,071 characters are dropped from generating dynamic DFAs. Null byte \verb!\x00! is not acceptable as an argument and input strings containing \verb!\x00! are dropped as well.
These situations are rare, impacting $<1$\% of the collected regular expressions (see Section~\ref{rq1:artifacts}).
\subsubsection{Coverage Calculation}
With the consistent naming between a static DFA and a dynamic DFA, all nodes, edges, and edge pairs in the latter are regarded as visited nodes, edges, and edge pairs of the former.
That is, a node only appears in a dynamic DFA when it is visited during matching; these can be thought of as \emph{just-in-time} DFA constructions in the context of a string to match.
The coverage metrics from Section~\ref{coveragemetrics} are computed over the static DFAs, aggregating over all input strings observed during testing.
%For regular expressions and tool-generated strings, all strings are matched to their corresponding regular expressions successfully. We report the test coverages of nodes, edges, and edge pairs for each regular expression. However, for data collected in GitHub projects, some input strings could fail to match their regular expression.s. As a result, we report the test coverages differently:
%As there exists both successful matches and failed matches, we report the test coverage in three aspects: 1) the test coverages of nodes, edges, and edge pairs for the regular expression given successfully matched input strings; 2) the test coverages of exits, nodes, edges, and edge pairs for the regular expression given failed matched input strings; 3) the test coverages of nodes, edges, and edge pairs for the regular expression given all input strings.
%As explained in Section~\ref{coveragemetrics}, the coverage metrics are computed over the total set of input strings $S$ as well as over the set of input strings that lead to an accept state, $S_{succ}$ and those that lead to a failure state, $S_{fail}$.
%\subsection{Existing Tools Analysis (RQ2)}
%We first calculated the coverages with same number of unique inputs as these in GitHub project, then we calculated again with all the unique inputs we got from Rex.
\subsection{Artifacts for RQ1}
\label{rq1:artifacts}
%This section illustrates the artifacts used in our analysis related to test coverages of regular expressions. We show first how we collect regular expression.related bugs, then describe the repository to analyze test coverage in community, and finally present the dataset used to measure the effectiveness of regular expression tools.
RepoReaper~\cite{reporeaper} provides a curated list of GitHub projects with the ability to sort based on project properties, such as the availability of test suites, which is a pre-requisite for our study.
We focused on Java projects due to its popularity on GitHub and the availability of a bytecode analysis framework for instrumentation.
%We extracted regular expressions and matching strings from GitHub Java projects.
\subsubsection{Project Selection}
Because the density of regular expressions in projects tends to be low, we automated project builds and test suite execution in order to collect sufficient data. As such, we require all projects we analyze to use
\textit{maven} and \textit{junit} to automatically run unit tests.
We first selected 136,196 Java projects which have unit tests in RepoReaper from December 2017. We identified 13,637 Java Maven projects that used Java pattern matchings functions mentioned above. From those, we selected the ones that could be successfully compiled and tested in Maven, leaving 5,691 projects on which we attempted to collect coverage information.
%Due to the environment dependency limits and execution time limits, we could not compile and run unit tests for all projects listed in RepoReaper.
%To achieve a regular expression corpus similar in size to previous work using Python~\cite{chapman2016}, we set a goal of finding 15,000 unique regular expressions.
\begin{table}[tb]
%\centering
\caption{Description of 1,225 Java Projects Analyzed. All numbers are rounded to nearest integer except the test ratio and kLOC.}
\label{regex:distriprojects}
\begin{small}
%\resizebox{0.5\textwidth}{!}{%
%\begin{tabular}{lllllllll}
\begin{tabular}{p{2cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.7cm}}
\hline
Attributes & mean & 25\% & 50\% & 75\% & 90\% & 99\% \\
\hline
%Regular Exp. & 14 & 2 & 5 & 13 & 25 & 90 \\
%Stars& 40 & 0 & 1 & 4 & 26 & 771 \\
%Test ratio & 0.230 & 0.080 & 0.196 & 0.340 & 0.473 & 0.728\\
%kLOC & 63 & 2 & 8 & 32 & 119 & 910 \\
%Size(KB) & 19530 & 318 & 1347 & 9118 & 41408 & 273233 \\
Regular Exp. & 12 & 1 & 3 & 7 & 18 & 99 \\
%Stars& 16 & 0 & 1 & 3 & 19 & 356 \\
Stars& 35 & 0 & 1 & 5 & 30 & 833 \\
Test ratio & 0.238 & 0.096 & 0.210 & 0.346 & 0.482 & 0.691\\
%kLOC & 55,382 & 2,046 & 6,743 & 25,101 & 86,718 & 951,007 \\
kLOC & 55.4 & 2.0 & 6.7 & 25.1 & 86.7 & 951.0 \\
Size(KB) & 19,062 & 286 & 1,079 & 6,449 & 33,163 & 249,915 \\
\hline
\end{tabular}
%}
\end{small}
%
%\vspace{5pt}
%
%\vspace{-12pt}
\end{table}
\begin{table}[tb]
%\centering
\caption{Description of 15,096 regular expressions analyzed for RQ1. All numbers are rounded to nearest integer}% \todo{shouldn't this be 15,096?}}
\label{regex:distriregex}
\begin{small}
%\resizebox{0.5\textwidth}{!}{%
%\begin{tabular}{lllllllll}
\begin{tabular}{p{2.2cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}}
\hline
Attributes & mean & 25\% & 50\% & 75\% & 90\% & 99\% \\
\hline
Nodes ($\lvert N \rvert$) & 144 & 12 & 28 & 70 & 324 & 939 \\
Edges ($\lvert E \rvert$) & 565 & 24 & 75 & 212 & 938 & 2,813 \\
Edge pairs& 2,115 & 25 & 99 & 414 & 1,647 & 16,850 \\
Regular exp. len. & 31 & 13 & 18 & 39 & 67 & 161 \\
\# Input strings ($\lvert S \rvert$) & 60 & 1 & 2 & 7 & 27 & 662 \\
Input string len. & 125 & 9 & 17 & 63 & 318 & 948 \\
%Input length dev. & 115.76 & 0.00 & 2.12 & 15.45 & 65.50 & 647.24 \\
\hline
\end{tabular}
%}
%
%\vspace{3pt}
%All deviations \adh{deviations?} are rounded to 2 decimal places. All input length averages are rounded to nearest integer.
\end{small}
%
%\vspace{5pt}
%
%\vspace{-12pt}
\end{table}
%\begin{table}[tb]
%%\centering
%\caption{Description of dataset changes.}
%\label{data:change}
%\begin{small}
%%\resizebox{0.5\textwidth}{!}{%
%%\begin{tabular}{lllllllll}
%\begin{tabular}{p{2.5cm}
%>{\raggedleft\arraybackslash}p{0.6cm}
%>{\raggedleft\arraybackslash}p{1.6cm}
%>{\raggedleft\arraybackslash}p{2.6cm}}
%\hline
%Operation & Projects & Regular Exp. & Unique Regular Exp.\\
%\hline
%Instrumentation & 1,665 & 24,058 & 15,091 \\
%Regular Exp. Removal & 1,256 & 15,562 & 14,040 \\
%Static DFA & 1,225 & 15,105 & 13,638 \\
%Dynamic DFA & 1,225 & 15,096 & 13,632\\
%Multi-inputs & 1,014 & 8,778 & 7,947 \\
%Rex(ASCII) & 1,057 & 12,184 & 10,946\\
%Rex(Unicode) & 1,057 & 12,181 & 10,943\\
%\hline
%\end{tabular}
%%}
%\end{small}
%%
%%\vspace{5pt}
%%
%%\vspace{-12pt}
%\end{table}
\subsubsection{Regular Expression and Test Input Collection}
To collect the input strings for each regular expression, we instrumented each project and executed the test suites.
We changed the configurations of the plugin \emph{maven-surefire-plugin} by adding {\em -javaagent} argument to {\em argLine} so that when Maven forks a VM to run the unit tests the VM can load the instrumentation library.
Each project module that runs tests executes in different VMs and the information is recorded in different files. {\em testFailureIgnore} is configured to {\em true} so that one test failure does not affect the other tests, allowing us to record as many regular expressions in the project as possible.
%In collecting the test inputs for each regular expression, we changed the configurations of the plugin \emph{maven-surefire-plugin} by adding {\em -javaagent} argument to {\em argLine} so that when Maven forks a VM to run the unit tests the VM can load the instrumentation library.
%Each project module that runs tests executes in different VMs and the information is recorded in different files. {\em testFailureIgnore} is configured to {\em true} so that one test failure does not affect the other tests, allowing us to record as many regular expressions in the project as possible. %\todo{This information seems important and I don't know where to put it}
Of the 5,691 projects with Maven, test suites, and pattern matching functions, 1,665 projects contained 24,058 regular expressions executed by test suites. The remaining projects contained regular expressions \emph{not} executed by the test suites, and thus could not be instrumented.
\subsubsection{Filtering Out Third-Party Regular Expressions}
%We regard the remaining regular expressions being used in method invocations of third-party libraries are outliers. We detect them using the rule of IQR(InterQuartile Range)~\footnote{\url{https://en.wikipedia.org/wiki/Interquartile_range}}. We calculate the total amount of matching invocations for all regular expression. If the first and third quartiles are denoted by $Q1$ and $Q3$, then $IQR=Q3-Q1$. For a regular expression $r$, its total invocations is denoted $t(r)$. If $t(r)>Q3+1.5*IQR$, then $r$ will be dropped as a regular expression from third-party libraries.
%The remaining regular expressions being used in method invocations of third-party libraries are identified by their proportions in the whole data set. A regular expression is dropped if it is used widely across different projects. We set the threshold to 10\% of all projects.
%\todo{is the following still relevant?} If a regular expression is matched to a large amount of different inputs (i.e, above the 90 percentile of them from 15091 regular expressions), we drop it as well. \todo{if the unique inputs that it is used frequently to match is above the 90 percentile of all regular expressions -- reword please. I don't follow.}.
If a regular expression is from a third-party library, we can detect it by looking for syntactically identical regular expressions with invocations on the same file, same class, same method, but in different GitHub projects.
%We assume that there are no such projects which call the same third-party function and pass along the same regular expression as the argument of that function.
Therefore, we detect third-party library invocations by identifying pairs of identical regular expressions and stack information in different projects.
If the number of projects is larger than one, then it is a third-party regular expression, and all records related to the same stack information are dropped.
A limitation of this approach is that we miss some third-party invocations that are only present in a single project. Given the large number of projects analyzed, the impact of this is likely to be small.
We identified 8,496 regular expressions as coming from third-party libraries. %, 1,259 of which were syntactically unique.
The resulting dataset contains 1,256 projects and 15,562 regular expressions, 14,040 of which are syntactically unique. %The overall full matches in the 1,256 projects are 1,8514.
%In this way we dropped 8,496 regular expressions. %, 1,259 of which were syntactically unique.
%The resulting dataset contains 1,256 projects and 15,562 regular expressions, 14,040 of which are syntactically unique. The overall full matches in the 1,256 projects are 1,8514.
%A limitation of this approach is that we miss some third-party invocations that are only present in a single project. \todo{any mitigation for this? Perhaps move this to threats to validity}
\subsubsection{RE2 Analysis}
Since RE2 only supports the most common regular expression language features, we filtered out the regular expressions containing advanced and non-regular features.
% (e.g., capturing group, possessive match, backreference, etc). In this way, 402 syntactically unique regular expressions are removed from the original 14,040 ones(the dataset after removal third-party libraries). The size of regular expression to analyze dropped from 15,562 to 15,105. The number of GitHub projects are decreased from 1,256 to 1,225.
RE2 failed to construct DFAs for 457 regular expressions, leaving 15,105 regular expressions spread across 1,225 projects.\footnote{Assuming all 457 are non-regular, this means over 97\% of the regular expressions sampled are regular, echoing findings from the Python analysis in Section~\ref{regularregularexpressions}.}
%From the 14,040 syntactically unique regular expressions collected, RE2 failed to generate DFAs for 402, leaving 13,638 syntactically unique regular expressions, spread across 1,225 projects and constructs 15,105 regular expressions.
The RE2 limitations on input string length on the null byte affected 56 regular expressions and 191 input strings; and nine of the 56 regular expressions are removed from coverage analysis because their only input string is dropped.
%The remaining dataset has 1,225 projects, 15,096 regular expressions, of which 13,632 are syntactically unique.
The final dataset used for analysis contains 1,225 projects with 15,096 regular expressions, of which 13,632 are syntactically unique. %The overall full matches in the 1,225 projects are 1,8426.
\subsubsection{Project Characteristics}
Table~\ref{regex:distriprojects} describes the 1,225 projects in terms of stars (a measure of popularity), kLOC (lines of code in thousand), size (size of the repository in KB), test ratio (the ratio of number of lines of code in test files to the total lines of code in repository, as reported by RepoReaper), and numbers of regular expressions. The
\emph{mean} column describes the average value for each attribute. Columns {\em 25\%, 50\%, 75\%, 90\%}, and {\em 99\%} show the distribution of each attribute at 25 percentile, median, 75 percentile, 90 percentile, and 99 percentile, respectively.
We sampled regular expressions per project through the instrumentation, so only those regular expressions in the instrumented functions appear in our dataset. Thus, the number of regular expressions reported in Table~\ref{regex:distriprojects} per project is an underestimate. However, this is unlikely to impact the coverage levels, which is the ultimate goal of this research.
The average number of regular expressions collected per project was 12 with a range of 1 to 2,004. %\todo{is the upper bound on this range correct?}
We note that the 15,096 regular expressions we study come from only 3,093 call sites in the 1,225 projects. In all 1,225 projects, there are 18,426 call sites for the instrumented functions in Section~\ref{sec:instrumentedfunctions}. This means that 15,333 (83,21\%) of the call sites are not covered by the test suites. For those that are, the median unique regular expressions is one, with an average of \todo{X}. \todo{Figure~\ref{}} illustrates the distribution of unique regular expressions executed at each call site. This is a very skewed distribution as many of the call sites are never invoked by the tests; it is for this reason that we focus our discussion on just the regular expressions that are executed by the test suites.
%RE2 failed to interpret 454 regular expressions out of 15,091 as a DFA graph.
\subsubsection{Regular Expression Characteristics}
Table~\ref{regex:distriregex} shows the DFA information for regular expressions. \emph{Nodes, edges, and edge pairs} are the total number of nodes, edges, edge pairs in the DFA graph of a regular expression. The average regular expression is quite large with 144 nodes, though this is skewed as the median is 28 nodes.
The density of DFA graphs~\footnote{$density=\frac{\lvert E \rvert}{\lvert N \rvert(\lvert N \rvert -1)}$ defined at https://en.wikipedia.org/wiki/Dense_graph} is small showing that DFA graphs are sparse graphs. %\todo{what is meant by "density" being "small"? How is density measured? What does "small" density mean?}
\emph{Regular exp. len.} measures the length of the string representing the regular expression itself in characters.
\emph{\# Input strings} is the number of syntactically unique input strings executed by a project's test suite.
The average number of syntactically unique test inputs per regular expression is 60, but the median is 2. %It means there are a small amount of regular expressions which are likely to be tested well enough while most of the regular expressions have a small test suite. % reptition times of same regex and input is ignored. It means there are a small amount of regular expressions which are used more frequently than the others and are used to match the same inputs several times.
\emph{Input string len.} shows the lengths of the input strings (i.e., each $s \in S$) in terms of the number of characters. %; \emph{Input length dev.} shows the standard deviation of input string length of every regular expression.
%Table~\ref{data:change} shows the changes of projects, regular expressions, syntactically unique regular expressions after each step. \emph{Instrumentation} shows the three values after we dynamically instrumented GitHub projects. \emph{Regular Exp. Removal} shows the values after we detected and removed regular expression from third-party libraries. \emph{Static DFA} shows the values after we generated static DFAs for each regular expression and removed the invalid ones. \emph{Dynamic DFA} shows the values after we generated dynamic DFAs for pairs of regular expressions and their input strings and removed invalid strings. \emph{Multi-inputs} shows the values of dataset in which regular expression valid inputs are more than one. \emph{Rex} shows the values of regular expression dataset for which Rex could successfully generate strings.
%The ratio of visited nodes in all nodes of the static DFA is the node coverage of the corresponding regular expression given one input string.
%Given all dynamic DFAs of the regular expression. the ratio of all unique visited nodes constitute the overall node coverage of the regular expression with its all input strings. Similar calculation applies to edge and edge pair coverages as well.
%After building the DFAs for each regular expression as explained in Section~\ref{subsec:cov}, we measured the number of nodes, edges, and edge-pairs per regular expression.
\subsection{Artifacts for RQ2}
\label{rq2:artifacts}
\begin{table}[tb]
\caption{Description of 7,926 regular expressions for RQ2.}
\label{succ:ascii}
\begin{small}
\begin{tabular}{p{2cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}
>{\raggedleft\arraybackslash}p{0.6cm}}
\hline
Attributes & mean & 25\% & 50\% & 75\% & 90\% & 99\% \\
\hline
Nodes ($\lvert N \rvert$) & 220 & 13 & 31 & 162 & 618 & 970 \\
Edges ($\lvert E \rvert$) & 773 & 30 & 97 & 663 & 1,468 & 3,694 \\
Edge pairs& 2,422 & 36 & 186 & 1,021 & 1,999 & 21,274 \\
$\lvert S \rvert$ & 70 & 1 & 2 & 8 & 39 & 961 \\
$\lvert S_{succ} \rvert$& 34 & 1 & 1 & 2 & 8 & 208 \\
Regular exp. len. & 29 & 12 & 15 & 31 & 71 & 160 \\
%Input length avg. & & & & & & \\
%Input length dev. & & & & & & \\
\hline
\end{tabular}
%\vspace{3pt}
%All deviations \adh{deviations?} are rounded to 2 decimal places. All input length averages are rounded to nearest integer.
\end{small}
\end{table}
%This section describes how we compose the dataset for RQ2.
To explore the coverage of regular expressions using tools, we selected Rex~\cite{rex} due to its high language feature coverage~\cite{chapman2016}.
We compare the coverage achieved by Rex to the coverage achieved by tests in the repository.
\adh{Adding a sentence somewhere explaining how Rex works would be helpful. Particularly whether it is generating random strings (so 10x bigger would matter) or if it is systematically enumerating possible regex expressions (where bigger sets wouldn't make a difference past a given threshold)}.
\subsubsection{Artifact Selection}
%Due to the limitations of Rex and RE2, regular expressions used for generating matching strings in Rex must be accepted by RE2. Otherwise, we could not generate matching strings for them or could not construct DFA graphs for coverage.
We need a set of regular expressions with the following characteristics: 1) are covered by tests; 2) can be analyzed by RE2 for coverage analysis; and 3) can be analyzed by Rex for test input generation. To satisfy 1) and 2), we begin with the dataset from RQ1 of 1,225 projects and 15,096 regular expressions. To satisfy 3), we select all the regular expressions that Rex supports and for which $\lvert S_{succ} \rvert > 0$, since Rex only generates matching strings.
%We compare the regular expression coverage in the Java projects to the coverage achieved by the developer tests. %, and use the Python as a cross-language comparison point.
%Our analysis on test coverage of regular expression with tool-generated strings are based on the dataset of 15,096 regular expressions. The regular expressions are same as the dataset in ~\ref{rq1:artifacts}, but the input strings are generated by Rex instead of original Java projects.
Rex has feature limitations similar to RE2~\cite{rexlimits}. It does not support the literal characters between \emph{\textbackslash Q} and \emph{\textbackslash E}, and cannot support anchors (e.g.,\emph{\textbackslash z}, \emph{\textbackslash A}, \emph{\textbackslash G}, \emph{\textbackslash b}, and \emph{\textbackslash B}). It does not support advanced features (e.g., named groups, lookahead, lookbehind, backreferences, conditional alteration, substitution). Additionally, regular expressions with Unicode are not fully supported in Rex.
%Without the DFA graph and the tool-generated strings, we are not able to measure the test coverage of regular expression with the help of regular expression tools.
%\paragraph{Filtering in Rex.}
Thus, after filtering out all the unsupported regular expressions, our reported coverages by Rex strings in ASCII encoding are based on 7,926 regular expressions of 985 Github projects, 7,007 are syntactically unique.
%, and strings in Unicode encodings are based 7,924 regular expressions of 985 Github projects, 7,005 are syntactically unique.
%We chose Rex generated strings in ASCII encoding in RQ2.
%\subsubsection{Coverage analysis}
\subsubsection{Rex Setup and Limitations}
Rex follows the regular expression syntax of C\#, which defaults to \emph{ManyMatch} as opposed to the \emph{FullMatch} behavior of our dataset.
To force Rex to treat each regular expression as a full match and generate strings accordingly, we added endpoint anchors (i.e., \verb!^! and \verb!$!) to each regular expression.
Because Rex may get stuck in generating input strings for certain regular expressions, we set a timeout of one hour for Rex to generate strings; regular expressions that exceed the timeout are discarded. %If Rex could not finish the generation of strings for a regular expression in one hour, we discard that regular expression in the further analysis.
Of the 10,155 regular expressions in GitHub whose $S_{succ}>1$, Rex encountered the timeout for only two. %\todo{what was the timeout? why was this necessary?}
Another complication comes at the intersection of the Rex and RE2 language support. When generating strings with Rex, these must be processable by RE2 for the coverage analysis.
%These tool-generated strings need further validation because of the different regular expression grammars between Rex and RE2.
For example, the character class ``\verb!\s!" in Rex accepts six whitespace characters: a space, a newline (\verb!\n!), a horizontal tab (\verb!\t!), a vertical tab (\verb!\v!), a carriage return (\verb!\r!), and a form feed (\verb!\f!). But vertical tab (\verb!\v!) is excluded from \verb!\s! in RE2. In another example, some generated Unicode strings in Unicode encoding in Rex could not be processed because their Unicode encoding in Rex is UTF-16 while RE2 handles Unicode sequences encoded in UTF-8 or Latin-1.
%\todo{where does Python 3 come in? Explain how it's related to this discussion. My assumption is that RE2 is written in Python 3, but it's not clear.}
To simplify the experiment, we configured Rex to generated strings in ASCII. We also dropped strings which contain unsupported features or characters in either RE2 or Python 3. We also dropped strings which lead to failed matchings, and reported the coverage based on successful matchings.
Table~\ref{succ:ascii} shows the attributes of regular expressions of which Rex could generate regular expressions. Coverage is based on $S_{succ}$, so only the successful inputs are shown.
\subsubsection{Input String Generation}
%\todo{how does Rex generate strings? Random walk over the DFA?}
Rex generates input strings for a regular expression by symbolically analyzing regular expression constraints, solving those constraints with SMT solver Z3~\cite{de2008z3}, and searching input strings guided by Pex~\cite{tillmann2008pex}.
Rex only generates matching strings, so for this experiment, for each regular expression $R$, we generate input string sets relative to the size of the matching strings $\lvert S_{succ} \rvert$.
For a regular expression $R$, we generated input string sets of three sizes: equal to $\lvert S_{succ} \rvert$; equal to $5 \times \lvert S_{succ} \rvert$; and equal to $10 \times \lvert S_{succ} \rvert$. We refer to these experiments as \RexSOne, \RexSFive, and \RexSTen, respectively. For each experiment, we repeated the string generation five times, using the system time as the random seed to encourage diversity among the generated strings. For each size of string sets, NC, EC, and EPC are computed. The averages over the five runs for each metric are reported as Rex's coverage of $R$.
For example, say a regular expression $R$ from GitHub has five input strings; $\lvert S \rvert = 5$. Three of the input strings are matching; $\lvert S_{succ} \rvert = 3$. For this experiment, Rex would generate three strings ten times, then 15 strings five times, then 30 strings five times, totaling $30 + 75 + 150 = 255$ generated strings. For each set of $\{3, 15, 30\}$ strings, NC, EC, and EPC are computed, averaged over $\{10,5,5\}$ runs. %\todo{check for correctness}
Rex may fail to generate required number of input strings for some regular expressions. For example, the total number of matching input strings in ASCII for a regular expression `\textbackslash d+' is ten (i.e.,0-9). If in GitHub there are also three matching input strings, Rex could generate three strings ten times, but only ten strings five times and ten strings five times, totaling $30 + 50 + 50 = 130$ generated strings. There are duplicates among the ten sets of three strings. All sets of ten strings are equal.
In the ten runs of generating input string sets equal to $\lvert S_{succ} \rvert$ for \RexSOne, there are 833 regular expressions or 708 syntactically unique regular expressions which have input strings less than $\lvert S_{succ} \rvert$ in at least one run. In the five runs of generating input string sets 5x of $\lvert S_{succ} \rvert$ for \RexSFive, there are 2,041 regular expressions or 1,758 syntactically unique regular expressions which have input strings less than 5x of $\lvert S_{succ} \rvert$ in at least one run. In the five runs of generating input string sets 10x of $\lvert S_{succ} \rvert$ for \RexSTen, there are 2,336 regular expressions or 1,988 syntactically unique regular expressions which have input strings less than 10x of $\lvert S_{succ} \rvert$ in at least one run.
The calculation of NC, EC, and EPC are based on the best-effort: for each run of every regular expression, we calculate coverages with input strings up to $\lvert S_{succ} \rvert$ in \RexSOne, 5x of $\lvert S_{succ} \rvert$ in \RexSFive, and 10x of $\lvert S_{succ} \rvert$ in \RexSTen; and coverages of every regular expression is the averages of its coverages over $\{10,5,5\}$ runs in \RexSOne, \RexSFive, and \RexSTen. In other words, if Rex failed to generate required number of input strings, the coverage is calculated based on the input strings Rex can generate.
%We conducted two string-generation experiments. In the first experiment, we made Rex try to generate strings as many as the input strings in Section~\ref{rq1:artifacts}. We repeated the experiment ten times. In the second experiment, we made Rex try to generate strings ten times as many as the inputs in Section~\ref{sec:rq3}. We repeated this experiment five times. Rex generated strings in ASCII encoding and Unicode encoding in each repetition of both the two experiments. For each regular expression in every repetition, we used the system time as the seed to randomly generate strings so that the strings in one repetition could be different from those in other repetitions as much as possible. We calculated the node, edge, and edge-pair coverages for each repetition and used their averages as the final coverages. In every experiment, all the repetitions have same number of generated strings.
%Since the maximum of different inputs in the coverage analysis of Section~\ref{rq1:artifacts} is 34, we tried to generate 34 strings in ASCII encoding and 34 strings in Unicode encoding. The actual number of generate strings may be less than 34 due to the accepted strings of regular expressions. For example, the regular expression \verb!\d! accepts only ten characters while \verb![a-zA-z]! could accept 52.
%\subsubsection{Filtering of Rex-Generated Strings}
%For the two different type of encodings, we calculate and report their coverages of success matchings for both ascii and unicode. To maintain the test inputs at same scale, we picked the same number of different inputs as it in dataset of Section~\ref{rq1:artifacts} for each regular expression.
%However, the coverage calculation is based on the regular expressions in ~\ref{rq1:artifacts} which have at least one input for successful matchings. The total number of regular expressions having successful matchings in ~\ref{rq1:artifacts} is 10,155. Of the 12,184 regular expressions for which Rex could generate strings, 7,926 belonging to the 10,155 regular expressions which have at least one successful matchings.
%In the first experiment, the coverage calculation in each repetition requires the same number of inputs of successful matchings in ~\ref{rq1:artifacts}. For ASCII encoding,
%We note that Rex may fail to generate the number of strings for successful matching as required. In each generation exercise, we used \emph{sampling without replacement} to draw same number of strings from the total set among different repetitions.
%In generating input string sets equal to $\lvert S_{succ} \rvert$ for \RexSOne, there are 833 regular expressions or 708 syntactically unique regular expressions which have fewer inputs of successful matchings in at least one repetition, of which 40 regular expressions or 25 syntactically unique regular expressions whose total number of inputs over ten runs is smaller than $\lvert S_{succ} \vert$. To not bias the experiment against Rex, the 40 regular expressions are dropped. Experimental results for \RexSOne reflect the 7,886 regular expressions.
% and for the other 793 sampling without replacement will be used in all inputs over 10 repetitions regardless of the same inputs in different repetitions. %Similarly, in Unicode encoding there are 1,926 regular expressions or 1,573 syntactically unique regular expressions which have less inputs of successful matchings in at least one repetition, of which 789 regular expressions or 638 syntactically unique regular expressions whose total number of inputs over 10 repetitions is less than that in ~\ref{rq1:artifacts}.
%As a result, the 789 regular expressions will be dropped and for the other 1,137 bootstrapping will be used in all inputs over 10 repetitions regardless of the same inputs in different repetitions. In the end, 7,886 regular expressions, in which 7,007 are syntactically unique, are used in coverage calculation for Rex of ASCII encoding in the first experiment.%; 7,135 regular expressions, in which 6385 are syntactically unique, are used in coverage calculation for Rex of Unicode encoding in the first experiment.
%In the second experiment for \RexSFive, Rex failed to generate enough strings to reach $5 \times \lvert S_{succ} \rvert$ input strings for 55 regular expressions (35 syntactically unique); these 55 were dropped.
%of the 7,926 valid regular expressions in Rex with successful matchings, there are 2,041 regular expressions or 1,758 syntactically unique regular expressions which have less than five times of inputs of successful matchings in at least one repetition, of which 55 regular expressions or 35 syntactically unique regular expressions whose total number of inputs over five repetitions is less than $5 \times \lvert S_{succ} \rvert$; these 55 regular expressions are dropped.
%Experimental results for \RexSFive reflect the 7,871 remaining regular expressions. %In the end, 7,871 regular expressions, in which 7,007 are syntactically unique, are used in coverage calculation for Rex of ASCII encoding in the second experiment of scale 5. The dataset of scale 5 ASCII encoding experiment is a subset of experiment one except two regular expressions. This is because when Rex generates more input strings, the likelihood of a valid input string increases.
%In the third experiment for \RexSTen, using the same process, 846 regular expressions or 712 syntactically unique regular expressions are dropped due to insufficient numbers of generated strings, leaving 7,080 regular expressions for analysis.
% unwhen the scale is 10 and encoding is ASCII, of the 7,926 valid regular expressions in Rex with successful matchings, there are 2,336 regular expressions or 1,988 syntactically unique regular expressions which have less than ten times of inputs of successful matchings in at least one repetition, of which 846 regular expressions or 712 syntactically unique regular expressions whose total number of inputs over 5 repetitions is less than ten times of that in ~\ref{rq1:artifacts}. In the end, 7,080 regular expressions, in which 7,007 are syntactically unique, are used in coverage calculation for Rex of ASCII encoding in the second experiment of scale 10. The dataset of scale 10 ASCII encoding experiment is a subset of scale 5 ASCII encoding, but it is a subset of experiment one except two regular expressions because of the same reason above.