9_discussion.tex

\section{Beyond Interactivity}
\label{sec:discussion}
% Intro

GoTcha is a good first step into the interactive debugging of distributed applications. By relying on the idea of version control of objects, and exposing the version history at every node, we meet the minimum requirements for observing all forms of state changes in distributed applications. GoTcha fills a hole -- interactivity -- in the tools available for debugging distributed applications. Interactivity has been an elusive piece in this ecosystem, and not much is known about how it can be used in a distributed context. With GoTcha, we see both potential and challenges in the future development of interactive debuggers for distributed applications. Moreover, we envision the development of powerful tools by combining GoTcha with existing concepts related to distributed debugging. In this section, we expand on the potential and challenges of this vision.


\subsection{Scalability of Interactivity}
% What property of interactive debuggers depends on scalability
An inherent property of traditional interactive debuggers is their reliance on the user to explore the possible paths of failures. This is no less true for GoTcha. However, in large systems with large number of possibilities to observe, this interactivity can potentially be overwhelming to the user. Interactive debuggers for single threaded systems ignore this complexity by design, hoping that the advantages offered by the live exploration of the execution of code compensates for the disadvantage of not being able to explore every path and fixing all issues. This exploration space is much larger in distributed systems when compared to single threaded systems because there are more types of state changes that have to be considered. Therefore, in GoTcha, the advantages offered by live exploration of the execution heavily depends on the scalability of both the debugging system, and the user interface, when increasing the number of GoT nodes in the application being debugged.

% What is the problem with scaling the debugging system
\subsubsection{Scaling the Debugging System}
In an application with a few number of nodes, the exploration of the execution can be easily visualized and followed and the centralized approach of GoTcha does not hinder the debugging process. However, in applications with a large number of nodes, which is common in a distributed setting, the centralized approach can be a bottleneck. Since every primitive of the dataframe involved in reading or writing to the version history has to be rerouted through the GCN, execution of the application through the debugger is much slower. It would also take longer to hit the conditional breakpoints potentially making the whole debugging process tedious. An easy solution, and one that works with GoTcha as it is, would be to reduce the number of GoT nodes in the application during debugging. The user could debug issues in this much smaller application, fix the problems, and then scale the number of nodes back up. However, this approach might not be always possible and, therefore, changes to the debugging architecture might be needed to solve this problem.

% Looking at what the problem is.
To understand the difficulties involved, let us look at the flow of interactive debugging. There are essentially two modes that GoTcha executes in. First, a ``free-run'' mode, where the application executes as it would under normal conditions until it hits a breakpoint. Second, we have the slow and more deliberate ``step-by-step'' mode which activates when the free-run mode matches a breakpoint, or if the user is exploring execution paths. In the second mode, the user has control over the execution and is observing a small and very specific part of the entire application. The design of GoTcha is tailored towards the step-by-step mode. The central GCN helps coordinate these steps, and visualizations are created with this mode in mind. However, the same central GCN which enables total control during the step-by-step mode, is a bottleneck in the free-run mode when the number of GoT nodes becomes too large. A distributed approach to debugging would be as scalable as the distributed application it is debugging, but only during the free-run mode. However, such an approach would again have a hard time scaling with the number of nodes when the debugger has to control the application in the step-by-step mode. This incompatibility of designs and the multiple modes that interactive debuggers work in, is the underlying reason why the advantages of interactivity in debugging distributed systems diminish with scale.

% how to mitigate it
A possible solution is to change the architecture of GoTcha in each of the modes, matching the strength of each design with the mode that they work best with. In the free-run mode, nodes could communicate directly with each other, and log their activity with the GCN. Each node also receives the list of breakpoint conditions. When a break point is hit, the GCN receives this information and then instructs every other node to switch to the step-by-step mode, taking control of the application and handing it over to the user. 

% What are the problems with this approach.
% The problem is this approach is that, other nodes might go ahead and interact with each other while the instruction to stop execution is being transmitted, leading to an execution trace that can possibly never happen under normal conditions. Since the GCN has the logs of all activities, the GCN could attempt to roll back the execution states of each Node that went ahead. This system is quite complex, but could potentially be required when debugging large scale distributed applications.

% UI scaling.
\subsubsection{Scaling the User Interface}
With a large number of nodes, we also encounter the problem of the user interface having too much information. The network topology is certainly going to be difficult to read making it difficult for the user to take a deep dive into the GoT nodes and explore specific execution paths. Conditional breakpoints become the only way to explore the execution meaningfully making the debugger strictly for finding bugs whose symptoms are already known. A possible solution is to use grouping algorithms to show the topology of the application concisely. Alternately, algorithms like PageRank can be used to show only nodes that are heavily connected.

The view of the version history at each node can have a lot of information when there is significant interaction with the node. For example, a view of the version history at a Grouper node, working with a thousand WordCounter nodes, could potentially have a thousand steps pending at Grouper and waiting to be stepped through. To enhance the navigation of execution, the user interface could allow users to attach breakpoints to the end of the steps that are pending, allowing the user to skip large batches of steps without necessarily having to artificially promote specific the step that they wish to see, to the top of the pending list. Such a breakpoint would be a close match to the non-conditional breakpoints that exist in traditional debuggers.

\subsection{Integration with Alternate Debugging Concepts}
% GoTcha with traditional interactive debuggers
GoTcha is built as a stand alone system, over the GoT model, that helps debug the application by exposing the changes to the version history at each node. State changes within the application are observed from a version control point of view and is observed in broad strokes over several lines of code. The bugs to be found are, however, in these lines of code and integration with traditional interactive debuggers can help find these bugs. As such, GoTcha does not interfere with the use of traditional interactive debuggers at a single node. A single threaded interactive debugger can break down the state changes due to local execution and allow the user to debug the lines of code, while also being sure that the state cannot change in unpredictable ways from one line to the next.

% GoTcha as record and replay tool.
Integration with non interactive forms of distributed debugging are also possible. For example, GoTcha, during the free-run mode, is similar to a tool for record and replay. When a conditional breakpoint is hit, it would be possible for the user to cycle through the previous steps and observe the previous states of the application along with the interactions that occurred between the nodes. Cycling through the previous steps is important because conditional breakpoints are usually used to find the execution point where the symptom of the error manifests. This may not always be the point where the error is. The user can find these errors by observing previous states of the application. If the user does not put a conditional breakpoint and executes the entire application in free-run mode, the entire execution is recorded and can be replayed. Existing tools and research on record and replay can add value to the free-run mode of GoTcha and make it a more powerful tool. The integration of non interactive debugging tools would enhance the approach of finding errors during free-run as most of these tools deal with postmortem analysis of execution, while the interactivity of GoTcha would allow the user to observe errors during the exploration of live execution of the application.

% \subsection{Limitations}
% % not satisfying the second goal.
% While GoTcha meets our first goal to expose every form of state changes that occurs in a distribute system correctly, we have only partially satisfied the second goal of not interfering with the execution flow. We mitigate the bigger concern of the usefulness of GoTcha, by allowing the developer to reorder and interleave tasks to explore realistic scenarios, but it cannot solve all the problems. As discussed, there might be too many orderings possible for the user to observe and find the errors they were looking for. More research has to be done in order to solve this problem.

% % Breakpoint expressiveness
% Another limitation is in the expressiveness of the breakpoints. In traditional interactive debuggers, breakpoints are placed on the execution path, and augmented using conditions. In our debugger, breakpoints can only be predicates for the state of data. The future flow of data is non deterministic and therefore, a breakpoint cannot be placed in the same manner as in traditional interactive debuggers. A possible solution, that is yet to be implemented, is to provide a dataframe level API to pause GoTcha. This API call would effectively be equivalent to a breakpoint in the flow of execution.

% % Scalability
% A third limitation is in the scale. As discussed, while the tool can observe state changes in an application with few nodes, scaling to applications with a large number of nodes requires architectural changes that compromise on other aspects of the system. However, the developer does have the option to reduce the number of nodes, explore the communication and state changes in the application with reduce nodes, fix bugs, and then scale back to a large number of nodes.

% % Generalization
% Finally, while we discuss the notion of building an interactive debugger over any distributed model, we implemented our prototype on the GoT model using the Spacetime implementation. We still do not know if the generalizations we make from this implementation are applicable to other distributed computing models and is left for the future.