pygrate_paper.ltx

%&pygrate_paper_preamble
\endofdump

\begin{document}
\maketitle

\begin{abstract}
\noindent
Many changes one might wish to make to a programming language break existing
programs, causing disruption and complaint amongst users. Over time, this
causes ever more of a language to be considered unchangeable.

In this paper, we
consider a notably disruptive change to a mainstream programming language:
the transition from Python 2 to Python 3. We introduce
the concept of \emph{Temporary Migration Variants} (TMV) --- intermediate
language versions which allow
code to be gradually migrated until the TMV is
no longer needed. We implement TMVs for both CPython 2.x and CPython 3.x,
allowing code to run with forward/backward compatible fixes and
warnings. Our simple TMVs, though neither complete or perfect, suggest that
it is possible to keep languages evolving by providing the ability
for users to gradually migrate their code from one version to another.
\end{abstract}


\section{Introduction}

Programming languages evolve, either to fix existing imperfections, or because
of changes in the wider computing context. However, users -- for whom programming
languages are generally a means rather than an end -- often resist language evolution,
either because they worry that the language will become too complex,
or because their existing programs will no longer run correctly. In this paper
we are motivated by the latter challenge: many users have large, long-lived
codebases which they must keep running.

Though rarely discussed in the literature, there are many instances in
programming language history of users forcing a language's designers to
reconsider breaking changes, often in the early stages of design.
There are also instances, though harder to quantify, of languages
scaring away potential users after developing a reputation for `too frequent'
breaking changes. Language evolution thus ends up split in two: it
is relatively easy to add new features that don't interfere with existing
programs; but it is very hard to change existing features. This split
inevitably means that large parts of a language ossify.

An obvious way out of this conundrum is to automatically migrate programs as the
underlying language changes. Perhaps surprisingly, there are few automated
tools available to assist to do so. Some languages ship with tools
(e.g.~Python's \texttt{2to3} tool or Rust's \texttt{cargo
fix}) that can fix some changes, though these are typically restricted to
changes that can be detected syntactically, or with minimal static analysis.
As we shall see later, the effects of many language changes on a given program
can only be detected dynamically.

In this paper we introduce the concept of
\emph{Temporary Migration Variants}: variants of the `old' and `new'
versions of a language which reduce \emph{migration friction}. As well
as traditional static (i.e.~`compile-time') warnings of incompatibilities, TMVs aim to
deal with changes which can only be detected dynamically (i.e.~`run time').
Features can be backwards or forward ported between both versions, warning users of
unsupported, or changed, features while still allowing programs to run. Once
all warnings for a given program are fixed, the TMVs have done their job, and
the user can move wholesale to the `new' version of the language.

\begin{figure}[t]
\begin{center}
  \includegraphics[width=0.8\textwidth]{semantic_gap.pdf}
\end{center}
\caption{This paper's concepts in the context of Python 2 and Python 3. On
the top row we see the status quo: Python 2 and Python 3 have a substantial semantic gap.
Users who wish to migrate their programs from Python 2 to Python 3 must manually
bridge this gap. This involves running the program, detecting
undesirable changes (not all of which manifest as simple crashes), fixing
them, and then trying again. For much of this process the program will only
partly execute, making it difficult to understand the full effect of any changes made.
On the middle row we have the Platonic ideal of TMVs: variants of Python 2 and
Python 3 that have no semantic gap, and thus allow the smoothest possible migration
between the two. On the bottom row we have the reality of the \pygratetwo and
\pygratethree TMVs: a much smaller, though still perceptible, semantic gap
compared to Python 2 and Python 3. \pygratetwo and \pygratethree make
migrating programs from Python 2 to Python 3 easier, though they cannot remove
remove all sources of migration friction.}
\label{fig:semantic_gap}
\end{figure}

As a concrete case study, we explore TMVs in the face of the most
disruptive change to a mainstream language in living memory: the transition
from Python 2 to Python 3. We create two TMVs: \pygratetwo, a variant of
CPython 2.7 which warns when users use features that are missing or
incompatible in Python 3.x, and which backports some features from Python 3.x;
and \pygratethree, a variant of CPython 3.12 which forwards ports some features
from Python 2.x, warning when they are used. \pygratetwo and
\pygratethree narrow the semantic gap between Python 2 and Python 3,
allowing users to keep programs running during much of the migration process (see
\cref{fig:semantic_gap} for a visualisation of this idea). We migrate
several programs, including \emph{Twisted}, an event-driven networking engine
implemented in Python, showing that our TMVs do meaningfully ease
migration, even though neither yet supports all the differences between Python
2 and Python 3.

This paper thus serves two purposes: it introduces TMVs and provides a case study
of their use. Although we do not want to over-generalise from a single
example, \pygratetwo and \pygratethree suggest that TMVs meaningfully reduce
migration friction. This concept may prove useful for other languages, including
for languages which are very different from Python, in future migrations.


\section{The Migration Challenge}

Most readers will probably have experienced the need to adjust one of their
programs after a programming language has evolved. For example, to print a
value to \lstinline{stdout} in Python 2 one could write:

\begin{lstlisting}[language=Python]
print 2+3
\end{lstlisting}

However, this program fails to compile\footnote{Although often
described as `interpreted' languages, most implementations of languages like
Python first compile programs to another representation (sometimes called
`bytecode') which they then use as a basis for execution.} with Python 3,
reporting a syntax error. Python 2 has a \lstinline{print} statement built-in
to the language's grammar, but Python 3 does not.
Instead a global \lstinline{print} function is provided, which must be called
in the same way as other functions:

\begin{lstlisting}[language=Python]
print(2+3)
\end{lstlisting}

Since \lstinline{print} statements in Python 2 are common, fixing all of
them when migrating to Python 3 may appear tiresome. Fortunately, because
\lstinline{print} is a grammar-level construct in Python 2, it is possible
to statically pinpoint each place that needs fixing with 100\% accuracy. Python 3 comes with a tool called
\lstinline{2to3} which takes as input Python source code and reports, or (with
the `\lstinline{-w}' flag) modifies each \lstinline{print} statement which
needs fixing.

Unfortunately, not all such changes between Python 2 and Python 3 are
handled quite so easily. Consider this Python 2 fragment:

\begin{lstlisting}[language=Python]
x = cmp(2, 3)
\end{lstlisting}

The global \lstinline{cmp(x, y)} function in Python 2 compares its two arguments
and returns -1 if $x < y$, 0 if $x = y$, or 1 if $x > y$ --- in this example,
\lstinline{x} will be assigned the value -1. \lstinline{cmp} is
often used as the basis for determining the order to sort collections such
as lists.

Python 3 removes the global \lstinline{cmp} function, causing the fragment
above to produce a generic `\lstinline{name not defined}' error. Unfortunately,
the \lstinline{2to3} tool does not produce a warning when it sees uses of \lstinline{cmp},
let alone attempt to automatically fix them. The probable reason for this is that,
unlike \lstinline{print}, it is impossible to statically identify calls to
\lstinline{cmp} with 100\% accuracy. For example, the user can introduce
their own local function called \lstinline{cmp} which overrides the global
function or import a \lstinline{cmp} function from another module. Such examples
may seem like they can be statically detected, but it is easy to use Python's
dynamic features to alter namespaces. A deliberately ludicrous, but fully legal,
example shows the extent of this issue:

\begin{lstlisting}[language=Python]
self_mod = __import__(__name__)
n = input()
setattr(self_mod.__builtins__, input(), "xyz")
print(cmp)
\end{lstlisting}

This example first gets a reference to the current module using a dynamic name
space lookup (line 1). It then uses \lstinline{input} to read a string from
the user at run-time (line 2). The value of that string is then used to
create, or if it already exists alters, a global variable,
assigning to it the value of \lstinline{xyz} (line 3). Finally, the current value
of the \lstinline{cmp} global variable is printed out (line 4). For all but one
user input, this will print \lstinline{<built-in function cmp>} --- but if the
user happens to enter the string \lstinline{cmp} then it will print out
\lstinline{xyz}. The point of this example is not, to put it mildly, to showcase good programmer
practise, but to show that Python's highly dynamic nature defeats most seemingly
reasonable static analyses.

The impossibility of fully deriving a program's dynamic behaviour from a static
examination presents a difficult choice for static migration tools such as
\lstinline{2to3}. In this case, \lstinline{2to3} can either: assume that all
uses of the name \lstinline{cmp} point to the global function, risking false
positives; or assume that it can't tell, guaranteeing false negatives.
Neither outcome is desirable, but false positives are arguably worse
since they cause the user's source code to be incorrectly changed.
Perhaps for this reason, \lstinline{2to3} (and other similar tools we are
aware of) nearly always err on the side of caution, heavily favouring
false negatives over false positives.

The ease of migrating \lstinline{print}, and the difficulty of migrating
\lstinline{cmp}, between Python 2 and Python 3 show the opportunities and
challenges that language designers must consider when evolving a language.
Those changes (like
\lstinline{print}) that can be statically detected and fixed can be made with little fear
of substantial user complaint. However, changes which can only be detected
dynamically are far more dangerous. This heavily constrains the possible
changes that most language designers will consider plausible.

In this paper we are therefore
chiefly concerned with what happens when language designers wish to
consider, or actually carry out, changes to a language which can only be
detected dynamically. Must the burden of migrating programs be placed entirely
on users or can we help them carry out their migration? If so, what might `help'
look like and to what degree does it actually help?


\section{Temporary Migration Variants}

The underlying problem that motivates our work is that many desirable changes
to a programming language break user programs in ways that can only
reasonably be detected dynamically. Our solution to this problem is the concept
of \emph{Temporary Migration Variants}: implementations of the `old' and `new'
versions of a language which narrow the semantic gap, helping ease the
migration path. As we shall see, this basic concept can come in several flavours,
though we explore only one in depth in this paper. In the rest of this section
we present a semi-formal definition of TMVs to better clarify some of the
concepts we use in the rest of the paper.

We define a \emph{programming language} (henceforth just `language') $L$ to
have \emph{semantics} that are a combination of the `core'
language (e.g.~syntax and dynamic execution rules) and its standard libraries.
For a given language $L$ we have an `old' version of the language with
semantics $L_O$ and a `new' version with semantics $L_N$. We have corresponding
\emph{implementations} of each language $I_O$ and $I_N$. For brevities sake,
we conflate `language' and `implementation` except in those cases where it is
important to separate the two.

The challenge we address is that an arbitrary program $P$ written for $L_O$
may \emph{break} when executed on $L_N$: it may fail
compile-time checks (and thus not compile); it may run and \emph{abort}
execution when it tries to use a feature that has changed or been removed; or
it may run and \emph{diverge} from the expected execution path.

TMVs are variants of both a language's semantics and its corresponding
implementation. TMVs can be defined for both $L_O$ and $L_N$ though
there are subtle asymmetries in what the resulting TMVs can sensibly do.

If our starting point is $L_O$ then
defining a TMV requires defining a `new' language $L'\!_O$. At the very
least, we expect $L'\!_O$ to produce warnings when a program uses features
not supported in $L_N$. We also generally expect $L'\!_O$ to (fully or
partially) backport features from $L_N$, at which point $L'\!_O$ becomes
a true `intermediate' language. A program $P$ written for $L_O$ that has
been modified to $P'$ for $L'\!_O$ may no longer be correct for $L_O$ or
$L_N$ (though, relative to $P$, we would expect $P'$
to break in fewer ways under $L_N$ and $I_N$).

If our starting point is $L_N$ then a TMV $L'\!_N$ does not meaningfully have
the option to simply produce warnings for features not supported in $L_O$,
since those features are unsupported by definition. Instead, for $L'\!_N$ to be useful, it
must forward port (fully or partially) features from $L_O$. Doing so allows
$L'\!_N$ to warn when such features are used. In other words, it doesn't make
sense for $L'\!_N$ to produce warnings if it has not at least partly forwarded
the feature it is warning about.

It may not be obvious why it is useful to have both $L'\!_O$ and $L'\!_N$ ---
indeed, we at first assumed that only $L'\!_O$ was really necessary. However,
we eventually realised that false positives on $I'\!_P$ are common, because $P$
will rely on libraries for $L_O$ for which no partially migrated version exists
(though, in many cases, a fully migrated version for $L_N$ does exist). In
contrast, warnings on $L'\!_N$ are rarely false positives, since it is unlikely
that someone would try migrating $P$ before, or until, the associated library is
migrated. See \laurie{Section XYZ} for an example.

The definition of TMVs above has deliberately presented them as two
absolutes ($L'\!_O$ and $L'\!_N$), partly for simplicity, and partly because
that is how our case study is structured. However, conceptually, there is no
reason that one could not have multiple TMVs between $L_O$ and $L_N$: when a
particularly disruptive change occurs, it may be useful, perhaps even
necessary, to migrate in stages.


\section{\pygratetwo and \pygratethree}

\begin{figure}[t!]
     \centering
     \begin{subfigure}[b]{0.8\columnwidth}
         \centering
         \includegraphics[width=\columnwidth]{pygratetwo_ex.pdf}
         \caption{An example \pygrate2 warning for the Python 2 \lstinline{file()} method.}
         \label{fig:pygratetwo_example}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{0.8\columnwidth}
         \centering
         \includegraphics[width=\columnwidth]{pygratethree_ex.pdf}
         \caption{An example \pygrate3 warning for the Python 2 \lstinline{cmp()} method.}
         \label{fig:pygratethree_example}
     \end{subfigure}
      \caption{We show a warning for the \lstinline{file()} method that is deprecated
      in Python 3, and suggests to fix the code by using the \lstinline{open()} method that is neutral for this case. 
      This is an example where \lstinline{open()} is a backport from \pygratethree, facilitating a slow migration 
      where \lstinline{open()} mutually exists with \lstinline{file()} in Python 2. For context the \lstinline{cmp}
      function used as example in Figure \ref{fig:pygratethree_example} was replaced by multiple comparison
      operators in Python 3 like \lstinline{>, <, =}, for the user code to work, \lstinline{cmp()} is 
      also forward ported to Pygrate 3 to faciliate the warning.}
      \label{fig:pygrate_examples}
\end{figure}

In this paper, we use the challenge of migrating code from Python 2 to Python 3
as a concrete case study. We introduce two TMVs: \pygratetwo, a modification of
CPython 2.7.\laurie{which point version exactly?}; and \pygratethree, a
modification of CPython 3.\laurie{which version exactly?}.

Both TMVs provide various warnings and, when possible, suggest fixes.
\pygratetwo warns when features present in Python 2 are not present, or work
differently, in Python 3; \pygratetwo also backports some features from Python
3 so that users can fix more warnings while still having a running program.
\pygratethree forward ports features from Python 2 and warns when they are
used. As shown in \cref{fig:semantic_gap}, \pygratetwo and \pygratethree reduce
the semantic gap between Python 2 and Python 3, though they do not yet fully remove
it. We estimate that we have reduced the semantic gap to around 40\% of its original
size: further work could reduce that further, though \laurie{``for reasons
we will show later''?} it is unlikely that the gap can be eliminated entirely.

\subsection{Overview}

The goal of warnings in both  \pygratetwo and \pygratethree  is to give warnings with useful information 
on a like the warning message, fix and line number that the user can then use to adjust their code in 
a realistic way, rather than syntax errors that just fail the code without any useful immediate information.

With these warnings, the assumption is that the user runs all their code under  \pygratetwo, fixing all 
warnings using the provided hints on potential fixes. After which, they can run their code under \pygratethree, 
fixing all warnings again. At which point we hope that they can run their code unchanged in python3. 

Figure~\ref{fig:pygrate_examples} shows an example of the warnings shown in both \pygratetwo and \pygratethree.
The case is Figure~\ref{} is a simpler case where for gentle migration, the method \lstinline{open()} is backported
to \pygratetwo . The second case where \lstinline{cmp()} is forward ported is much more interesting. 
There are about three flavors of \lstinline{cmp()}, the dunder method \lstinline{(__cmp__)} , the function, 
\lstinline{cmp()} and the argument. I will describe each in the sequel. The dunder cmp requires that we replace 
the \lstinline{_cmp__()} definition with the corresponding rich comparison operations 
like \lstinline{__eg__ ,__lt__} etc. For the \lstinline{cmp()} function, since it was removed in Python 3, we h
ave no option but to provide its improvised implementation for compatibility. 

This maybe is where Python 3 needs a hack because if not, then I think the way the cmp() function is called 
is different, on one hand it may be a stdlib function, no import required while if we have our own implementation
in a utility library maybe an import is required which may require a conditional for Python 3. Also, sorting can use 
a cmp argument which was removed in python 3. Solving this requires using the key function instead for all 
uses of the cmp argument. Also a function cmp style wrapper function can be written similar 
to \lstinline{functools.cmp_to_key()}. To conclude, for the dunder \lstinline{__cmp__} and \lstinline{cmp()} function 
we have to issue warnings because they are not supported in Python 3 but we have to take care of the differences 
where \lstinline{__ne__} delegates to \lstinline{__eq__}. Even the use on the cmp arguments needs a warning because 
the cmp argument should be replaced with the key function. 


The general rationale is to by default issue warnings in \pygratetwo for as many cases as is technically 
possible. In some exceptions we decide to move the warnings to pygrate3 under two circumstances: 1) 
the need for precision and 2) implementation constraints.

\joannah{Laurie: Do we put our categorization table here?}

Depending on what changed in either syntax or semantics between Python 2 and 3, it is possible 
that warning in \pygratethree will give us an opportunity to warn more accurately, with 
less false positives. This in turn means  less warnings to deal with for the user. For example, 
if a method \lstinline{{M1} in python2 changes its name to \lstinline{{M2} in python3, 
we get more false positives by adding \lstinline{M2} as an alias in \pygratetwo than we do 
adding \lstinline{M1} as an alias in pygrate3. Note that aliasing is the best case scenario. 
However, sometimes we also have to monitor the engineering trade offs of supporting the \lstinline{{M1} 
syntax in pygrate3 for the warning purposes, it may be just impossible for both \lstinline{{M1} and 
\lstinline{{M2} to coexist, case in point strings.

When Python transitioned to Python 3, not only did the syntax and semantics change, but the
Core team also took time to re-organize and clean up the implementation. It is not a black 
and white decision to judge whether all decisions made in \pygratethree are cleaner. 
In general several parts of the code are cleaner, some equally less modular and in general 
other parts introduce complexity.

Modular code is generally easier to accommodate when warning compared to slots. When we have 
to warn for features exposed as slots we have to either look for a version (\pygratetwo or \pygratethree) 
with better modularity or change the implementation to be modular.

Executing warnings can introduce unwanted overhead for default language settings. To remedy  this behavior, 
we instead introduce optional flags \lstinline{{-2} for the \pygratethree variant and \lstinline{{-3} for 
the \pygratetwo variant. This way the variants do not have to be completely throw-away.

Warnings are generated at all stages of the compilation process, by introspecting mostly higher order 
abstractions, from for example the grammar, the abstract syntax tree, the compiler and the symbol 
table to correctly identify both syntactic and semantic incompatibilities.

\subsection{Warning Anatomy}

Warnings are implemented as custom exceptions in both \pygratetwo and \pygratethree, 
thereby introducing new exception classes \lstinline{Py2xWarning} and \lstinline{Py3xWarning} 
respectively tied to their corresponding flags, instead of  using a generic 
\lstinline{Deprecation} type.

The CPython warnings module is extended to support compatibility warnings that capture 
the warning message, potential fix and the level of the stack where the warning occurs. 
The stack level helps us with other aspects related to dynamically and precisely warning 
for certain changes where static warning is hard to achieve. The warning is however formatted 
in a more useful way for display to the user to include even optionally a filename and line 
number in addition to the message and fix in the following default format:

\lstinline{<Py3xWarning>: <message>: <fix>}

In the context of a language like Python, warnings are emitted at any phase of the interpreter. 
Warnings are emitted on abstractions during scanning, parsing, and generating the symbol table 
as well as in the standard library. When generating and processing the syntax as tokens, we inspect the tokens to identify any 
incompatible token of interest for warning by performing basic string comparison. Likewise, 
during parsing, we identify any syntax by inspecting the nodes of the abstract syntax tree, 
and checking for any conditions in the nodes that require compatibility warnings:

\begin{lstlisting}[language=Python]
# Warn by inspecting tokens
c = tok_nextc(tok);
if (c == 'r' || c == 'R') {
  if (Py_Py3XWarningFlag) {
    warn()
  }
}
#warn by inspecting the AST
ch = CHILD(ch, 1);
if (NCH(ch) != 1) {
  if (Py_Py3kWarningFlag){
    warn()                  
}
\end{lstlisting}

Warnings that relate to list comprehension and generators are much easier and accurately 
tracked when generating the symbol table, therefore as we visit the statement blocks, 
we inspect the statements in the block to check for any syntax or semantics that require 
warning. Finally warnings are emitted in the standard library methods directly for 
especially deprecated or renamed modules, written in either Python or C:

\begin{lstlisting}[language=Python]
#warn while building the symbol table
VISIT_IN_BLOCK(st, expr, elt, (void*)e);
  if (Py_Py3kWarningFlag && st->st_cur->ste_generator) {
    warn()
        exit_block();   
  }
#warn in the standard library
#stringobject.py
static PyObject *
string_concat(register PyStringObject *a, register PyObject *bb)
{
	...
  if (Py_Py3kWarningFlag){
		warn()
  }
}
\end{lstlisting}

Most warnings are easy to emit by \emph{static} inspection of the syntax while others require
Some extra processing to dynamically and accurately extract the correct information required 
for both the warning and the fix. An example of easy static inspection is the not equal 
statement:

\begin{lstlisting}[language=Python]
int c2 = tok_nextc(tok);
int token = PyToken_TwoChars(c, c2);
if (Py_Py3kWarningFlag && token == NOTEQUAL && c == '<') {
  if (PyErr_WarnExplicit(PyExc_Py3xWarning,
                         "<> not supported in 3.x; use !=",
                          tok->filename, tok->lineno,
                          NULL, NULL)){....}
\end{lstlisting}

By inspecting syntax from the outputs of the tokenizer, we can issue a warning when we 
encounter the “<>” token quite easily. On the other hand when warning for the leaking 
variable in list comprehension, we have do some extra processing to track the variable 
in the list comp, in relation to the function where it is defined, to correctly warn 
about the free variable:

\begin{lstlisting}[language=Python]
if (Py_Py3kWarningFlag) {
       nc = asdl_seq_LEN(body);
       for (i = 0; i < nc; i++) {
           st = (stmt_ty)asdl_seq_GET(body, i);
           islistcomp = compiler_islistcomp(st);
           if (islistcomp) {
               lstcomp = asdl_seq_GET(body, i);
               var_of_int = asdl_seq_GET(lstcomp, 3);
               for (y=i; y < nc; y++) {
                   var_st = (stmt_ty)asdl_seq_GET(body, y);
                   if ((var_st == var_of_int) &&
                       !ast_3x_warn(c, n, "This listcomp does not leak a variable in 3.x",
                                    "assign the variable before use"))
                       return 0;
               }
           }
       } 
   }
\end{lstlisting}

\section{Case Study}

To demonstrate the significance of TMVs, by providing a case study, 
discussing our insight of the autonomy achieved towards reducing 
Migration friction. For this case study, we port the Twisted, a Python
event-driven networking engine, that allows for easy custom network
application prototyping.

The main upstream fork of Twisted was ported to Python 3 about 
5 years ago, so we searched back as far as May 2016, for an unported
revision of the project to facilitate our porting work. In terms of complexity
the revision we port contains the 19 modules supported by Twisted, and 
had about 2663 tests skipped because they failed when the code ran on
Python 3.

Porting is done by first running the project tests on \pygratetwo, identifying
and fixing any warnings, then running the tests on \pygratethree to identify 
any pending warnings. With this methodology, the assumption is that the
tests cover all the code paths in the project, which is usually the recommended
case, that to start any migration work, thorough test coverage should be 
achieved for a realistic port. Table~\ref{} has a breakdown of the modifications
done to the code base for the major modules of the project.

\begin{table}
\begin{center}
\begin{tabular}{cccc}
    Module name & Warnings issued & Impacted Tests & Total Tests \\ \midrule
    Conch & 218 & 83 & 336 \\
    Mail & 128 & 79 & 436 \\
    words & 70 & 102 & 162 \\
    logger & 182  & 8 & 182 \\ \bottomrule
\end{tabular}
\end{center}
\end{table}

\subsection{Discussion}

Much of porting Twisted is dealing with the complexities of strings, that did
not fully manage to find autonomy for but in addition, we discuss other major 
areas of porting that were not trivial to engineer warnings.

\paragraph*{Strings and Bytes}


\section{Related work}

\subsection{Program Pattern Analysis}

Existing work has pattern languages like TXl~\cite{cordy06txl} and frameworks SCRUPLE~\cite{paul94framework} which
identify certain patterns in the program to be ported, transforming the programs according
to specified transformation rules.

Pattern analysis involves viewing or converting the source code in an intermediate form that
abates pattern identification. the two common cases involve, either converting the program into
a parse tree and feeding the modified parse tree to a compiler of the target language/version
or viewing the source code as a stream of strings~\cite{kontogiannis10code}.

Pattern idenfication for program migration is mostly achieved through pattern matching of the
original source code by matching strings and regular expressions~\cite{kontogiannis10code} 
to identify statements that require modification to the target syntax. Tradititional pattern 
maching can be extended genenerating parametized suffix tree introducing parametized strings 
and pattern matching~\cite{baker93theory, baker95parameterized} to for 
example identify duplicates in the code.

The Boyer-Moore-Type algorithm~\cite{watson03boyer} have been used to 
perform efficient strings matching by comparing patterns with text identifying patterns 
that can cause a shift rule to be invoked on the text from left to right or vice versa 
before or after comparison to the pattern~\cite{baker95parameterized}. The original 
Boyer-Moore-Type algorithm has gone through modifcations producing variations that make it run 
faster~\cite{baker95parameterized, watson03boyer, waga16boyer}.

Patterns can also be identified through hashing algorithms that use finger print computations
to search for any source code changes or similarities between the source and target 
user programs~\cite{watson03boyer}.

Patterns can equally be identified by maintaining and browsing a program database of the user 
application, making queries on the code through the database on the files, functions and 
variables~\cite{paul94framework}.


\subsection{Source Code Transformation}

For large code bases, the main aim of code transformation 
tooling~\cite{guo03unique, kontogiannis10code} 
is to make migration of extension modules to a new API a feasible 
alternative, with strong syntactical guarantees. Semantic guarantees
are harder and thorough testing is required to verify the correctness
of the generated code. 

We achieve code transformations through a specification using \emph{rules}, 
in the following format: 

\lstinline{[pattern : action]} 

The first part of a rule specification is a pattern in the code. The second corresponds to 
the steps or procedures that must be taken when the first part matches. For example, 
when performing the transformation for a block of code, the operation that manages the 
conversion of a \lstinline{Statement} is invoked. A matching procedure isolates 
the \lstinline{Statement} kind to be transformed and the relevant conversion 
function is invoked.

\subsection{Patch Backporting}

When languages release new versions, it is common to maintain older versions in a branch
for use by programs that can not adopt early to new versions. Interventions either automated
or manual are used to port some critical features like security patches from the new version 
to the old version~\cite{shariffdeen21automated}.

Backporting doesnt modfy the user program, instead the language is modified but depending
on the level of compatibility of the ported patch, the program can be modified.


\bibliographystyle{plain}
\bibliography{bib}

\end{document}