-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathpygrate_paper.ltx
608 lines (509 loc) · 31.5 KB
/
pygrate_paper.ltx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
%&pygrate_paper_preamble
\endofdump
\begin{document}
\maketitle
\begin{abstract}
\noindent
Many changes one might wish to make to a programming language break existing
programs, causing disruption and complaint amongst users. Over time, this
causes ever more of a language to be considered unchangeable.
In this paper, we
consider a notably disruptive change to a mainstream programming language:
the transition from Python 2 to Python 3. We introduce
the concept of \emph{Temporary Migration Variants} (TMV) --- intermediate
language versions which allow
code to be gradually migrated until the TMV is
no longer needed. We implement TMVs for both CPython 2.x and CPython 3.x,
allowing code to run with forward/backward compatible fixes and
warnings. Our simple TMVs, though neither complete or perfect, suggest that
it is possible to keep languages evolving by providing the ability
for users to gradually migrate their code from one version to another.
\end{abstract}
\section{Introduction}
Programming languages evolve, either to fix existing imperfections, or because
of changes in the wider computing context. However, users -- for whom programming
languages are generally a means rather than an end -- often resist language evolution,
either because they worry that the language will become too complex,
or because their existing programs will no longer run correctly. In this paper
we are motivated by the latter challenge: many users have large, long-lived
codebases which they must keep running.
Though rarely discussed in the literature, there are many instances in
programming language history of users forcing a language's designers to
reconsider breaking changes, often in the early stages of design.
There are also instances, though harder to quantify, of languages
scaring away potential users after developing a reputation for `too frequent'
breaking changes. Language evolution thus ends up split in two: it
is relatively easy to add new features that don't interfere with existing
programs; but it is very hard to change existing features. This split
inevitably means that large parts of a language ossify.
An obvious way out of this conundrum is to automatically migrate programs as the
underlying language changes. Perhaps surprisingly, there are few automated
tools available to assist to do so. Some languages ship with tools
(e.g.~Python's \texttt{2to3} tool or Rust's \texttt{cargo
fix}) that can fix some changes, though these are typically restricted to
changes that can be detected syntactically, or with minimal static analysis.
As we shall see later, the effects of many language changes on a given program
can only be detected dynamically.
In this paper we introduce the concept of
\emph{Temporary Migration Variants}: variants of the `old' and `new'
versions of a language which reduce \emph{migration friction}. As well
as traditional static (i.e.~`compile-time') warnings of incompatibilities, TMVs aim to
deal with changes which can only be detected dynamically (i.e.~`run time').
Features can be backwards or forward ported between both versions, warning users of
unsupported, or changed, features while still allowing programs to run. Once
all warnings for a given program are fixed, the TMVs have done their job, and
the user can move wholesale to the `new' version of the language.
\begin{figure}[t]
\begin{center}
\includegraphics[width=0.8\textwidth]{semantic_gap.pdf}
\end{center}
\caption{This paper's concepts in the context of Python 2 and Python 3. On
the top row we see the status quo: Python 2 and Python 3 have a substantial semantic gap.
Users who wish to migrate their programs from Python 2 to Python 3 must manually
bridge this gap. This involves running the program, detecting
undesirable changes (not all of which manifest as simple crashes), fixing
them, and then trying again. For much of this process the program will only
partly execute, making it difficult to understand the full effect of any changes made.
On the middle row we have the Platonic ideal of TMVs: variants of Python 2 and
Python 3 that have no semantic gap, and thus allow the smoothest possible migration
between the two. On the bottom row we have the reality of the \pygratetwo and
\pygratethree TMVs: a much smaller, though still perceptible, semantic gap
compared to Python 2 and Python 3. \pygratetwo and \pygratethree make
migrating programs from Python 2 to Python 3 easier, though they cannot remove
remove all sources of migration friction.}
\label{fig:semantic_gap}
\end{figure}
As a concrete case study, we explore TMVs in the face of the most
disruptive change to a mainstream language in living memory: the transition
from Python 2 to Python 3. We create two TMVs: \pygratetwo, a variant of
CPython 2.7 which warns when users use features that are missing or
incompatible in Python 3.x, and which backports some features from Python 3.x;
and \pygratethree, a variant of CPython 3.12 which forwards ports some features
from Python 2.x, warning when they are used. \pygratetwo and
\pygratethree narrow the semantic gap between Python 2 and Python 3,
allowing users to keep programs running during much of the migration process (see
\cref{fig:semantic_gap} for a visualisation of this idea). We migrate
several programs, including \emph{Twisted}, an event-driven networking engine
implemented in Python, showing that our TMVs do meaningfully ease
migration, even though neither yet supports all the differences between Python
2 and Python 3.
This paper thus serves two purposes: it introduces TMVs and provides a case study
of their use. Although we do not want to over-generalise from a single
example, \pygratetwo and \pygratethree suggest that TMVs meaningfully reduce
migration friction. This concept may prove useful for other languages, including
for languages which are very different from Python, in future migrations.
\section{The Migration Challenge}
Most readers will probably have experienced the need to adjust one of their
programs after a programming language has evolved. For example, to print a
value to \lstinline{stdout} in Python 2 one could write:
\begin{lstlisting}[language=Python]
print 2+3
\end{lstlisting}
However, this program fails to compile\footnote{Although often
described as `interpreted' languages, most implementations of languages like
Python first compile programs to another representation (sometimes called
`bytecode') which they then use as a basis for execution.} with Python 3,
reporting a syntax error. Python 2 has a \lstinline{print} statement built-in
to the language's grammar, but Python 3 does not.
Instead a global \lstinline{print} function is provided, which must be called
in the same way as other functions:
\begin{lstlisting}[language=Python]
print(2+3)
\end{lstlisting}
Since \lstinline{print} statements in Python 2 are common, fixing all of
them when migrating to Python 3 may appear tiresome. Fortunately, because
\lstinline{print} is a grammar-level construct in Python 2, it is possible
to statically pinpoint each place that needs fixing with 100\% accuracy. Python 3 comes with a tool called
\lstinline{2to3} which takes as input Python source code and reports, or (with
the `\lstinline{-w}' flag) modifies each \lstinline{print} statement which
needs fixing.
Unfortunately, not all such changes between Python 2 and Python 3 are
handled quite so easily. Consider this Python 2 fragment:
\begin{lstlisting}[language=Python]
x = cmp(2, 3)
\end{lstlisting}
The global \lstinline{cmp(x, y)} function in Python 2 compares its two arguments
and returns -1 if $x < y$, 0 if $x = y$, or 1 if $x > y$ --- in this example,
\lstinline{x} will be assigned the value -1. \lstinline{cmp} is
often used as the basis for determining the order to sort collections such
as lists.
Python 3 removes the global \lstinline{cmp} function, causing the fragment
above to produce a generic `\lstinline{name not defined}' error. Unfortunately,
the \lstinline{2to3} tool does not produce a warning when it sees uses of \lstinline{cmp},
let alone attempt to automatically fix them. The probable reason for this is that,
unlike \lstinline{print}, it is impossible to statically identify calls to
\lstinline{cmp} with 100\% accuracy. For example, the user can introduce
their own local function called \lstinline{cmp} which overrides the global
function or import a \lstinline{cmp} function from another module. Such examples
may seem like they can be statically detected, but it is easy to use Python's
dynamic features to alter namespaces. A deliberately ludicrous, but fully legal,
example shows the extent of this issue:
\begin{lstlisting}[language=Python]
self_mod = __import__(__name__)
n = input()
setattr(self_mod.__builtins__, input(), "xyz")
print(cmp)
\end{lstlisting}
This example first gets a reference to the current module using a dynamic name
space lookup (line 1). It then uses \lstinline{input} to read a string from
the user at run-time (line 2). The value of that string is then used to
create, or if it already exists alters, a global variable,
assigning to it the value of \lstinline{xyz} (line 3). Finally, the current value
of the \lstinline{cmp} global variable is printed out (line 4). For all but one
user input, this will print \lstinline{<built-in function cmp>} --- but if the
user happens to enter the string \lstinline{cmp} then it will print out
\lstinline{xyz}. The point of this example is not, to put it mildly, to showcase good programmer
practise, but to show that Python's highly dynamic nature defeats most seemingly
reasonable static analyses.
The impossibility of fully deriving a program's dynamic behaviour from a static
examination presents a difficult choice for static migration tools such as
\lstinline{2to3}. In this case, \lstinline{2to3} can either: assume that all
uses of the name \lstinline{cmp} point to the global function, risking false
positives; or assume that it can't tell, guaranteeing false negatives.
Neither outcome is desirable, but false positives are arguably worse
since they cause the user's source code to be incorrectly changed.
Perhaps for this reason, \lstinline{2to3} (and other similar tools we are
aware of) nearly always err on the side of caution, heavily favouring
false negatives over false positives.
The ease of migrating \lstinline{print}, and the difficulty of migrating
\lstinline{cmp}, between Python 2 and Python 3 show the opportunities and
challenges that language designers must consider when evolving a language.
Those changes (like
\lstinline{print}) that can be statically detected and fixed can be made with little fear
of substantial user complaint. However, changes which can only be detected
dynamically are far more dangerous. This heavily constrains the possible
changes that most language designers will consider plausible.
In this paper we are therefore
chiefly concerned with what happens when language designers wish to
consider, or actually carry out, changes to a language which can only be
detected dynamically. Must the burden of migrating programs be placed entirely
on users or can we help them carry out their migration? If so, what might `help'
look like and to what degree does it actually help?
\section{Temporary Migration Variants}
The underlying problem that motivates our work is that many desirable changes
to a programming language break user programs in ways that can only
reasonably be detected dynamically. Our solution to this problem is the concept
of \emph{Temporary Migration Variants}: implementations of the `old' and `new'
versions of a language which narrow the semantic gap, helping ease the
migration path. As we shall see, this basic concept can come in several flavours,
though we explore only one in depth in this paper. In the rest of this section
we present a semi-formal definition of TMVs to better clarify some of the
concepts we use in the rest of the paper.
We define a \emph{programming language} (henceforth just `language') $L$ to
have \emph{semantics} that are a combination of the `core'
language (e.g.~syntax and dynamic execution rules) and its standard libraries.
For a given language $L$ we have an `old' version of the language with
semantics $L_O$ and a `new' version with semantics $L_N$. We have corresponding
\emph{implementations} of each language $I_O$ and $I_N$. For brevities sake,
we conflate `language' and `implementation` except in those cases where it is
important to separate the two.
The challenge we address is that an arbitrary program $P$ written for $L_O$
may \emph{break} when executed on $L_N$: it may fail
compile-time checks (and thus not compile); it may run and \emph{abort}
execution when it tries to use a feature that has changed or been removed; or
it may run and \emph{diverge} from the expected execution path.
TMVs are variants of both a language's semantics and its corresponding
implementation. TMVs can be defined for both $L_O$ and $L_N$ though
there are subtle asymmetries in what the resulting TMVs can sensibly do.
If our starting point is $L_O$ then
defining a TMV requires defining a `new' language $L'\!_O$. At the very
least, we expect $L'\!_O$ to produce warnings when a program uses features
not supported in $L_N$. We also generally expect $L'\!_O$ to (fully or
partially) backport features from $L_N$, at which point $L'\!_O$ becomes
a true `intermediate' language. A program $P$ written for $L_O$ that has
been modified to $P'$ for $L'\!_O$ may no longer be correct for $L_O$ or
$L_N$ (though, relative to $P$, we would expect $P'$
to break in fewer ways under $L_N$ and $I_N$).
If our starting point is $L_N$ then a TMV $L'\!_N$ does not meaningfully have
the option to simply produce warnings for features not supported in $L_O$,
since those features are unsupported by definition. Instead, for $L'\!_N$ to be useful, it
must forward port (fully or partially) features from $L_O$. Doing so allows
$L'\!_N$ to warn when such features are used. In other words, it doesn't make
sense for $L'\!_N$ to produce warnings if it has not at least partly forwarded
the feature it is warning about.
It may not be obvious why it is useful to have both $L'\!_O$ and $L'\!_N$ ---
indeed, we at first assumed that only $L'\!_O$ was really necessary. However,
we eventually realised that false positives on $I'\!_P$ are common, because $P$
will rely on libraries for $L_O$ for which no partially migrated version exists
(though, in many cases, a fully migrated version for $L_N$ does exist). In
contrast, warnings on $L'\!_N$ are rarely false positives, since it is unlikely
that someone would try migrating $P$ before, or until, the associated library is
migrated. See \laurie{Section XYZ} for an example.
The definition of TMVs above has deliberately presented them as two
absolutes ($L'\!_O$ and $L'\!_N$), partly for simplicity, and partly because
that is how our case study is structured. However, conceptually, there is no
reason that one could not have multiple TMVs between $L_O$ and $L_N$: when a
particularly disruptive change occurs, it may be useful, perhaps even
necessary, to migrate in stages.
\section{\pygratetwo and \pygratethree}
\begin{figure}[t!]
\centering
\begin{subfigure}[b]{0.8\columnwidth}
\centering
\includegraphics[width=\columnwidth]{pygratetwo_ex.pdf}
\caption{An example \pygrate2 warning for the Python 2 \lstinline{file()} method.}
\label{fig:pygratetwo_example}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.8\columnwidth}
\centering
\includegraphics[width=\columnwidth]{pygratethree_ex.pdf}
\caption{An example \pygrate3 warning for the Python 2 \lstinline{cmp()} method.}
\label{fig:pygratethree_example}
\end{subfigure}
\caption{We show a warning for the \lstinline{file()} method that is deprecated
in Python 3, and suggests to fix the code by using the \lstinline{open()} method that is neutral for this case.
This is an example where \lstinline{open()} is a backport from \pygratethree, facilitating a slow migration
where \lstinline{open()} mutually exists with \lstinline{file()} in Python 2. For context the \lstinline{cmp}
function used as example in Figure \ref{fig:pygratethree_example} was replaced by multiple comparison
operators in Python 3 like \lstinline{>, <, =}, for the user code to work, \lstinline{cmp()} is
also forward ported to Pygrate 3 to faciliate the warning.}
\label{fig:pygrate_examples}
\end{figure}
In this paper, we use the challenge of migrating code from Python 2 to Python 3
as a concrete case study. We introduce two TMVs: \pygratetwo, a modification of
CPython 2.7.\laurie{which point version exactly?}; and \pygratethree, a
modification of CPython 3.\laurie{which version exactly?}.
Both TMVs provide various warnings and, when possible, suggest fixes.
\pygratetwo warns when features present in Python 2 are not present, or work
differently, in Python 3; \pygratetwo also backports some features from Python
3 so that users can fix more warnings while still having a running program.
\pygratethree forward ports features from Python 2 and warns when they are
used. As shown in \cref{fig:semantic_gap}, \pygratetwo and \pygratethree reduce
the semantic gap between Python 2 and Python 3, though they do not yet fully remove
it. We estimate that we have reduced the semantic gap to around 40\% of its original
size: further work could reduce that further, though \laurie{``for reasons
we will show later''?} it is unlikely that the gap can be eliminated entirely.
\subsection{Overview}
The goal of warnings in both \pygratetwo and \pygratethree is to give warnings with useful information
on a like the warning message, fix and line number that the user can then use to adjust their code in
a realistic way, rather than syntax errors that just fail the code without any useful immediate information.
With these warnings, the assumption is that the user runs all their code under \pygratetwo, fixing all
warnings using the provided hints on potential fixes. After which, they can run their code under \pygratethree,
fixing all warnings again. At which point we hope that they can run their code unchanged in python3.
Figure~\ref{fig:pygrate_examples} shows an example of the warnings shown in both \pygratetwo and \pygratethree.
The case is Figure~\ref{} is a simpler case where for gentle migration, the method \lstinline{open()} is backported
to \pygratetwo . The second case where \lstinline{cmp()} is forward ported is much more interesting.
There are about three flavors of \lstinline{cmp()}, the dunder method \lstinline{(__cmp__)} , the function,
\lstinline{cmp()} and the argument. I will describe each in the sequel. The dunder cmp requires that we replace
the \lstinline{_cmp__()} definition with the corresponding rich comparison operations
like \lstinline{__eg__ ,__lt__} etc. For the \lstinline{cmp()} function, since it was removed in Python 3, we h
ave no option but to provide its improvised implementation for compatibility.
This maybe is where Python 3 needs a hack because if not, then I think the way the cmp() function is called
is different, on one hand it may be a stdlib function, no import required while if we have our own implementation
in a utility library maybe an import is required which may require a conditional for Python 3. Also, sorting can use
a cmp argument which was removed in python 3. Solving this requires using the key function instead for all
uses of the cmp argument. Also a function cmp style wrapper function can be written similar
to \lstinline{functools.cmp_to_key()}. To conclude, for the dunder \lstinline{__cmp__} and \lstinline{cmp()} function
we have to issue warnings because they are not supported in Python 3 but we have to take care of the differences
where \lstinline{__ne__} delegates to \lstinline{__eq__}. Even the use on the cmp arguments needs a warning because
the cmp argument should be replaced with the key function.
The general rationale is to by default issue warnings in \pygratetwo for as many cases as is technically
possible. In some exceptions we decide to move the warnings to pygrate3 under two circumstances: 1)
the need for precision and 2) implementation constraints.
\joannah{Laurie: Do we put our categorization table here?}
Depending on what changed in either syntax or semantics between Python 2 and 3, it is possible
that warning in \pygratethree will give us an opportunity to warn more accurately, with
less false positives. This in turn means less warnings to deal with for the user. For example,
if a method \lstinline{{M1} in python2 changes its name to \lstinline{{M2} in python3,
we get more false positives by adding \lstinline{M2} as an alias in \pygratetwo than we do
adding \lstinline{M1} as an alias in pygrate3. Note that aliasing is the best case scenario.
However, sometimes we also have to monitor the engineering trade offs of supporting the \lstinline{{M1}
syntax in pygrate3 for the warning purposes, it may be just impossible for both \lstinline{{M1} and
\lstinline{{M2} to coexist, case in point strings.
When Python transitioned to Python 3, not only did the syntax and semantics change, but the
Core team also took time to re-organize and clean up the implementation. It is not a black
and white decision to judge whether all decisions made in \pygratethree are cleaner.
In general several parts of the code are cleaner, some equally less modular and in general
other parts introduce complexity.
Modular code is generally easier to accommodate when warning compared to slots. When we have
to warn for features exposed as slots we have to either look for a version (\pygratetwo or \pygratethree)
with better modularity or change the implementation to be modular.
Executing warnings can introduce unwanted overhead for default language settings. To remedy this behavior,
we instead introduce optional flags \lstinline{{-2} for the \pygratethree variant and \lstinline{{-3} for
the \pygratetwo variant. This way the variants do not have to be completely throw-away.
Warnings are generated at all stages of the compilation process, by introspecting mostly higher order
abstractions, from for example the grammar, the abstract syntax tree, the compiler and the symbol
table to correctly identify both syntactic and semantic incompatibilities.
\subsection{Warning Anatomy}
Warnings are implemented as custom exceptions in both \pygratetwo and \pygratethree,
thereby introducing new exception classes \lstinline{Py2xWarning} and \lstinline{Py3xWarning}
respectively tied to their corresponding flags, instead of using a generic
\lstinline{Deprecation} type.
The CPython warnings module is extended to support compatibility warnings that capture
the warning message, potential fix and the level of the stack where the warning occurs.
The stack level helps us with other aspects related to dynamically and precisely warning
for certain changes where static warning is hard to achieve. The warning is however formatted
in a more useful way for display to the user to include even optionally a filename and line
number in addition to the message and fix in the following default format:
\lstinline{<Py3xWarning>: <message>: <fix>}
In the context of a language like Python, warnings are emitted at any phase of the interpreter.
Warnings are emitted on abstractions during scanning, parsing, and generating the symbol table
as well as in the standard library. When generating and processing the syntax as tokens, we inspect the tokens to identify any
incompatible token of interest for warning by performing basic string comparison. Likewise,
during parsing, we identify any syntax by inspecting the nodes of the abstract syntax tree,
and checking for any conditions in the nodes that require compatibility warnings:
\begin{lstlisting}[language=Python]
# Warn by inspecting tokens
c = tok_nextc(tok);
if (c == 'r' || c == 'R') {
if (Py_Py3XWarningFlag) {
warn()
}
}
#warn by inspecting the AST
ch = CHILD(ch, 1);
if (NCH(ch) != 1) {
if (Py_Py3kWarningFlag){
warn()
}
\end{lstlisting}
Warnings that relate to list comprehension and generators are much easier and accurately
tracked when generating the symbol table, therefore as we visit the statement blocks,
we inspect the statements in the block to check for any syntax or semantics that require
warning. Finally warnings are emitted in the standard library methods directly for
especially deprecated or renamed modules, written in either Python or C:
\begin{lstlisting}[language=Python]
#warn while building the symbol table
VISIT_IN_BLOCK(st, expr, elt, (void*)e);
if (Py_Py3kWarningFlag && st->st_cur->ste_generator) {
warn()
exit_block();
}
#warn in the standard library
#stringobject.py
static PyObject *
string_concat(register PyStringObject *a, register PyObject *bb)
{
...
if (Py_Py3kWarningFlag){
warn()
}
}
\end{lstlisting}
Most warnings are easy to emit by \emph{static} inspection of the syntax while others require
Some extra processing to dynamically and accurately extract the correct information required
for both the warning and the fix. An example of easy static inspection is the not equal
statement:
\begin{lstlisting}[language=Python]
int c2 = tok_nextc(tok);
int token = PyToken_TwoChars(c, c2);
if (Py_Py3kWarningFlag && token == NOTEQUAL && c == '<') {
if (PyErr_WarnExplicit(PyExc_Py3xWarning,
"<> not supported in 3.x; use !=",
tok->filename, tok->lineno,
NULL, NULL)){....}
\end{lstlisting}
By inspecting syntax from the outputs of the tokenizer, we can issue a warning when we
encounter the “<>” token quite easily. On the other hand when warning for the leaking
variable in list comprehension, we have do some extra processing to track the variable
in the list comp, in relation to the function where it is defined, to correctly warn
about the free variable:
\begin{lstlisting}[language=Python]
if (Py_Py3kWarningFlag) {
nc = asdl_seq_LEN(body);
for (i = 0; i < nc; i++) {
st = (stmt_ty)asdl_seq_GET(body, i);
islistcomp = compiler_islistcomp(st);
if (islistcomp) {
lstcomp = asdl_seq_GET(body, i);
var_of_int = asdl_seq_GET(lstcomp, 3);
for (y=i; y < nc; y++) {
var_st = (stmt_ty)asdl_seq_GET(body, y);
if ((var_st == var_of_int) &&
!ast_3x_warn(c, n, "This listcomp does not leak a variable in 3.x",
"assign the variable before use"))
return 0;
}
}
}
}
\end{lstlisting}
\section{Case Study}
To demonstrate the significance of TMVs, by providing a case study,
discussing our insight of the autonomy achieved towards reducing
Migration friction. For this case study, we port the Twisted, a Python
event-driven networking engine, that allows for easy custom network
application prototyping.
The main upstream fork of Twisted was ported to Python 3 about
5 years ago, so we searched back as far as May 2016, for an unported
revision of the project to facilitate our porting work. In terms of complexity
the revision we port contains the 19 modules supported by Twisted, and
had about 2663 tests skipped because they failed when the code ran on
Python 3.
Porting is done by first running the project tests on \pygratetwo, identifying
and fixing any warnings, then running the tests on \pygratethree to identify
any pending warnings. With this methodology, the assumption is that the
tests cover all the code paths in the project, which is usually the recommended
case, that to start any migration work, thorough test coverage should be
achieved for a realistic port. Table~\ref{} has a breakdown of the modifications
done to the code base for the major modules of the project.
\begin{table}
\begin{center}
\begin{tabular}{cccc}
Module name & Warnings issued & Impacted Tests & Total Tests \\ \midrule
Conch & 218 & 83 & 336 \\
Mail & 128 & 79 & 436 \\
words & 70 & 102 & 162 \\
logger & 182 & 8 & 182 \\ \bottomrule
\end{tabular}
\end{center}
\end{table}
\subsection{Discussion}
Much of porting Twisted is dealing with the complexities of strings, that did
not fully manage to find autonomy for but in addition, we discuss other major
areas of porting that were not trivial to engineer warnings.
\paragraph*{Strings and Bytes}
\section{Related work}
\subsection{Program Pattern Analysis}
Existing work has pattern languages like TXl~\cite{cordy06txl} and frameworks SCRUPLE~\cite{paul94framework} which
identify certain patterns in the program to be ported, transforming the programs according
to specified transformation rules.
Pattern analysis involves viewing or converting the source code in an intermediate form that
abates pattern identification. the two common cases involve, either converting the program into
a parse tree and feeding the modified parse tree to a compiler of the target language/version
or viewing the source code as a stream of strings~\cite{kontogiannis10code}.
Pattern idenfication for program migration is mostly achieved through pattern matching of the
original source code by matching strings and regular expressions~\cite{kontogiannis10code}
to identify statements that require modification to the target syntax. Tradititional pattern
maching can be extended genenerating parametized suffix tree introducing parametized strings
and pattern matching~\cite{baker93theory, baker95parameterized} to for
example identify duplicates in the code.
The Boyer-Moore-Type algorithm~\cite{watson03boyer} have been used to
perform efficient strings matching by comparing patterns with text identifying patterns
that can cause a shift rule to be invoked on the text from left to right or vice versa
before or after comparison to the pattern~\cite{baker95parameterized}. The original
Boyer-Moore-Type algorithm has gone through modifcations producing variations that make it run
faster~\cite{baker95parameterized, watson03boyer, waga16boyer}.
Patterns can also be identified through hashing algorithms that use finger print computations
to search for any source code changes or similarities between the source and target
user programs~\cite{watson03boyer}.
Patterns can equally be identified by maintaining and browsing a program database of the user
application, making queries on the code through the database on the files, functions and
variables~\cite{paul94framework}.
\subsection{Source Code Transformation}
For large code bases, the main aim of code transformation
tooling~\cite{guo03unique, kontogiannis10code}
is to make migration of extension modules to a new API a feasible
alternative, with strong syntactical guarantees. Semantic guarantees
are harder and thorough testing is required to verify the correctness
of the generated code.
We achieve code transformations through a specification using \emph{rules},
in the following format:
\lstinline{[pattern : action]}
The first part of a rule specification is a pattern in the code. The second corresponds to
the steps or procedures that must be taken when the first part matches. For example,
when performing the transformation for a block of code, the operation that manages the
conversion of a \lstinline{Statement} is invoked. A matching procedure isolates
the \lstinline{Statement} kind to be transformed and the relevant conversion
function is invoked.
\subsection{Patch Backporting}
When languages release new versions, it is common to maintain older versions in a branch
for use by programs that can not adopt early to new versions. Interventions either automated
or manual are used to port some critical features like security patches from the new version
to the old version~\cite{shariffdeen21automated}.
Backporting doesnt modfy the user program, instead the language is modified but depending
on the level of compatibility of the ported patch, the program can be modified.
\bibliographystyle{plain}
\bibliography{bib}
\end{document}