Merge branch 'master' of github.com:bruinfish/219ProjProposal

bruinfish · bruinfish · commit f68c26b8b9f9 · 2012-06-08T13:56:21.000-07:00
diff --git a/pgr_nectar.tex b/pgr_nectar.tex
@@ -1,81 +1,86 @@
-\section{Nectar: IP-based Solution}
-(The name of this section is tentative)
-This section
-briefly describe Nectar: a IP-based system designed to manage the intermediate
-results. With Nectar, the intermediate results can be shared across multiple
-programs or be used in the future computation incrementally.  Nectar realizes
-the interchangeability of data and computation by the following two elements:
-\emph{a)} a client-side library that rewrites user programs; \emph{b)} a
-cluster-wide cache server that manages the intermediate results.  Nectar client
-side library takes a \emph{DryadLINQ} program $P$ as input, consulting the
-cache service to rewrite it to an equivalent but more efficient program $P'$ in
-the following steps:
+\section{Nectar's solutions}
 
-\begin{itemize} 
+This section briefly describes the Nectar's solutions. Nectar names every piece of intermediate
+result by the Robin fingerprints of the sub-program and the input
+data that generated it.  The intermediate results are cached
+in a distributed file system and managed by a centralize cache
+service while all the access to the intermediate results must go through
+this cache service.  Then, Nectar adopts a rewrite-based approach at
+the compiler level to hook the program with the intermediate results.
+Before a program is submitted to the computation cluster, a
+client-side program rewriter will identify the intermediate data that
+will be useful to this program and rewrite the program in a way that when the program is being executed, it will
+retrieve these intermediate results from the cache storage and merge
+them with the new results. Also, the rewriter, based on the frequencies of
+the inquiring of each sub-program's result, decides whether to cache
+sub-programs' results.  Finally, because all the intermediate results
+are managed by a centralized service, Nectar simply implements a
+cost-benefit algorithm to garbage collect the obsolete intermediate
+results.
 
-\item Decompose program $P$ into a set of sub-expressions ${P1, P2, P3 ...}$
-equivalent to $P$ and probe the cache service for all cache hits for each
-\emph{prefix sub-expression} of $P$.
+While we agree with that Nectar is going toward a right direction to reduce the
+computation cost in data centers, we found that Nectar's rewriter-based
+solution imposes many unnecessary constraints to the uses of intermediate results.
+First, in order for rewriter to rewrite the programs, the
+programs must be written in the certain language that is
+understandable to the rewriter. In fact, Nectar can only support the programs that were
+written in the C# while many programs ran in data centers do not
+adopt C# or Microsoft's DryadLINQ. Also, as mentioned in the Nectar's paper,
+even if a program is written in the C#, if it invokes any external library written in
+other languages, these parts of computations cannot benefit
+from the intermediate results because rewriter cannot understand them. 
 
-\item Recursively applying the maximum-independent-set algorithm from the
-largest to the shortest prefix sub-expression, finding a subset of intermediate
-results which are disjoint on input data and provide the most saving on
-execution time.
+We argue that, a more flexible design is needed, so that such a idea
+of using intermediate can be widely applicable, and different programs written in the different
+languages can all benefit from the intermediate results. 
 
-%For each prefix sub-expression, apply the maximum-independent-sets algorithm
-%to find a subset of intermediate results which are disjoint on input data and
-%provide the most saving on execution time.  \item Pick a set of intermediate
-%results of a prefix sub-expressions that can provide the most saving. 
+Second, in Nectar, because rewriting is performed at the compilation time, the
+flexibility of using intermediate results is largely limited. The
+following example demonstrate this problem: Consider a program
+contains four steps of sub-computations: {p1, p2, p3, p4}. With Nectar, this program can only
+benefit from the intermediate results that were generated by the
+Prefix-sub-programs i.e. {p1, p2, p3, p4}, {p1, p2, p3}, {p1, p2} or
+{p1}, because when Nectar rewrites the program, it doesn't know the
+input data of the computation p2, p3 or p4 and so cannot identify the
+intermediate results that were generated by the sub-programs that
+start with these computations. Also, rewrite-based solution can only
+modify the programs based on the situation during the compilation time. If
+there comes other new intermediate data that could benefit the program after the
+compilation time, these data cannot be used in the programs. We argue
+that, with our design, which is more flexible, we can better utilize the
+potential benefit of intermediate results for the above-mentioned
+cases.
 
-\item Rewrite the program $P$ to $'P$ in a way that the program can benefit
-from the chosen subset of intermediate results. 
+Finally, Nectar also suffers the static granularity problems both in
+the program decomposition and the intermediate data cache. First, there is no space for
+program developers to decide how their programs are decomposed. Some
+programs need finner-grained decomposition because each small step of
+it might be useful to other programs. Meanwhile, some programs might
+not need such finner-grained decomposition. But with Nectar, all the programs
+are decomposed in the same way, which could either waste the
+storage, caching unnecessary intermediate results by too
+finner-grained decomposition, or fail to utilize
+the potential of intermediate results by too coarse-grained
+decomposition. 
 
-\end{itemize}
+On the other hand, in Nectar, with static granularity of intermediate
+result cache, two piece of intermediate results
+overlapping in the input data range cannot be merged.
+For example, two pieces of intermediate results: one is generated by
+input data ranging from 0 to 100, another one is generated by input
+data ranging from 90 to 190. A program can choose either of them to
+use, which loses considerable benefit of intermediate results.
 
-We found that, although the Nectar realizes the interchangeability of data and
-computation in a way transparent to programmers, its rewrite-based solution
-largely limits the flexibility of the use of intermediate results, and its
-IP-based caching system complicates the identifying and retrieving procedure of
-the intermediate results.  We argue that our NDN-based system can provide a
-more flexible, simple solution.  The following sections describe the problems
-of Nectar.
+In this project, we consider designing a NDN-based distributed
+computing system that can utilize the intermediate results. The
+reasons of adopt NDN are two-fold: first, in NDN, every piece of data
+inherently has a name and is transmitted in the network by this name.
+Therefore, with NDN, we don't need a complicated and inflexible rewriter and a centralized cache
+service to hardwire programs with intermediate results. We can solve the
+problems more naturally and elegantly. Second, there are some good
+features of NDN we can benefit from. For example, NDN allow the router
+to cache the data, reducing the network traffic. Therefore, by careful design, we
+can not only make a system using intermediate results become more
+flexible, but also more efficient than a IP-based solution, such as
+Nectar.
 
-\subsection{Inflexibility of rewrite-based solution} Nectar rewrites users'
-programs written in \emph{LINQ} into a new program which can benefit from
-intermediate results. Although, such a rewrite-based solution has the advantage
-of transparency, it largely limits the flexibility of caching system in two
-ways: \emph{a)} Nectar only decomposes and considers caching the computations
-it can recognize while failing to cache and benefit from the intermediate
-results belonging to the computations not recognizable to it.  \emph{b)} On the
-other hand, Nectar could wrongly decompose a computation into sub-expressions
-of which intermediate results are not worth to cache, and still try to cache
-them if they are inquired frequently.
-
-To design such a rewrite algorithm could be tricky and error-prone. Any corner
-case the rewriter fails to consider could cause data inconsistency without
-warnings. Meanwhile, such rewrite-based solution might not be applicable to
-other languages used in data centers.
-
-Nectar adopts such a rewrite-based solution while sacrificing the flexibility,
-because they assume it is too much effort for programmers to modify the
-programs. However, we argue that if we can provide a simple enough programing
-model to programmers, who know the programs best, such as with
-\emph{Map-Reduce}, the programmers can best tune the caching of intermediate
-result without too much effort.
-
-\subsection{Complication of identifying and retrieving intermediate results}
-Identifying and retrieving the intermediate results are extremely complicated
-in Nectar.  Each entry of intermediate results is indexed by the fingerprint of
-the sub-expression that have generated it, as well as, the fingerprint pairs of
-the first and last extents of the input dataset. 
-
-To identify the intermediate results, the rewriter has to compute the
-fingerprints of each prefix sub-expression, querying the cache service,
-recursively searching the best-saving subset of intermediate results, and
-rewriting the program. Then, the program retrieve the chosen subset of
-intermediate results from the specific location on a distributed storage given
-by the cache service.
-
-We found this procedure of identifying and retrieving the intermediate results
-can be largely simplified in a NDN-based network with a naming technique we
-will describe later.