Skip to content

Commit f68c26b

Browse files
committed
Merge branch 'master' of github.com:bruinfish/219ProjProposal
2 parents b2f5fae + 8c37fee commit f68c26b

File tree

1 file changed

+78
-73
lines changed

1 file changed

+78
-73
lines changed

pgr_nectar.tex

+78-73
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,86 @@
1-
\section{Nectar: IP-based Solution}
2-
(The name of this section is tentative)
3-
This section
4-
briefly describe Nectar: a IP-based system designed to manage the intermediate
5-
results. With Nectar, the intermediate results can be shared across multiple
6-
programs or be used in the future computation incrementally. Nectar realizes
7-
the interchangeability of data and computation by the following two elements:
8-
\emph{a)} a client-side library that rewrites user programs; \emph{b)} a
9-
cluster-wide cache server that manages the intermediate results. Nectar client
10-
side library takes a \emph{DryadLINQ} program $P$ as input, consulting the
11-
cache service to rewrite it to an equivalent but more efficient program $P'$ in
12-
the following steps:
1+
\section{Nectar's solutions}
132

14-
\begin{itemize}
3+
This section briefly describes the Nectar's solutions. Nectar names every piece of intermediate
4+
result by the Robin fingerprints of the sub-program and the input
5+
data that generated it. The intermediate results are cached
6+
in a distributed file system and managed by a centralize cache
7+
service while all the access to the intermediate results must go through
8+
this cache service. Then, Nectar adopts a rewrite-based approach at
9+
the compiler level to hook the program with the intermediate results.
10+
Before a program is submitted to the computation cluster, a
11+
client-side program rewriter will identify the intermediate data that
12+
will be useful to this program and rewrite the program in a way that when the program is being executed, it will
13+
retrieve these intermediate results from the cache storage and merge
14+
them with the new results. Also, the rewriter, based on the frequencies of
15+
the inquiring of each sub-program's result, decides whether to cache
16+
sub-programs' results. Finally, because all the intermediate results
17+
are managed by a centralized service, Nectar simply implements a
18+
cost-benefit algorithm to garbage collect the obsolete intermediate
19+
results.
1520

16-
\item Decompose program $P$ into a set of sub-expressions ${P1, P2, P3 ...}$
17-
equivalent to $P$ and probe the cache service for all cache hits for each
18-
\emph{prefix sub-expression} of $P$.
21+
While we agree with that Nectar is going toward a right direction to reduce the
22+
computation cost in data centers, we found that Nectar's rewriter-based
23+
solution imposes many unnecessary constraints to the uses of intermediate results.
24+
First, in order for rewriter to rewrite the programs, the
25+
programs must be written in the certain language that is
26+
understandable to the rewriter. In fact, Nectar can only support the programs that were
27+
written in the C# while many programs ran in data centers do not
28+
adopt C# or Microsoft's DryadLINQ. Also, as mentioned in the Nectar's paper,
29+
even if a program is written in the C#, if it invokes any external library written in
30+
other languages, these parts of computations cannot benefit
31+
from the intermediate results because rewriter cannot understand them.
1932

20-
\item Recursively applying the maximum-independent-set algorithm from the
21-
largest to the shortest prefix sub-expression, finding a subset of intermediate
22-
results which are disjoint on input data and provide the most saving on
23-
execution time.
33+
We argue that, a more flexible design is needed, so that such a idea
34+
of using intermediate can be widely applicable, and different programs written in the different
35+
languages can all benefit from the intermediate results.
2436

25-
%For each prefix sub-expression, apply the maximum-independent-sets algorithm
26-
%to find a subset of intermediate results which are disjoint on input data and
27-
%provide the most saving on execution time. \item Pick a set of intermediate
28-
%results of a prefix sub-expressions that can provide the most saving.
37+
Second, in Nectar, because rewriting is performed at the compilation time, the
38+
flexibility of using intermediate results is largely limited. The
39+
following example demonstrate this problem: Consider a program
40+
contains four steps of sub-computations: {p1, p2, p3, p4}. With Nectar, this program can only
41+
benefit from the intermediate results that were generated by the
42+
Prefix-sub-programs i.e. {p1, p2, p3, p4}, {p1, p2, p3}, {p1, p2} or
43+
{p1}, because when Nectar rewrites the program, it doesn't know the
44+
input data of the computation p2, p3 or p4 and so cannot identify the
45+
intermediate results that were generated by the sub-programs that
46+
start with these computations. Also, rewrite-based solution can only
47+
modify the programs based on the situation during the compilation time. If
48+
there comes other new intermediate data that could benefit the program after the
49+
compilation time, these data cannot be used in the programs. We argue
50+
that, with our design, which is more flexible, we can better utilize the
51+
potential benefit of intermediate results for the above-mentioned
52+
cases.
2953

30-
\item Rewrite the program $P$ to $'P$ in a way that the program can benefit
31-
from the chosen subset of intermediate results.
54+
Finally, Nectar also suffers the static granularity problems both in
55+
the program decomposition and the intermediate data cache. First, there is no space for
56+
program developers to decide how their programs are decomposed. Some
57+
programs need finner-grained decomposition because each small step of
58+
it might be useful to other programs. Meanwhile, some programs might
59+
not need such finner-grained decomposition. But with Nectar, all the programs
60+
are decomposed in the same way, which could either waste the
61+
storage, caching unnecessary intermediate results by too
62+
finner-grained decomposition, or fail to utilize
63+
the potential of intermediate results by too coarse-grained
64+
decomposition.
3265

33-
\end{itemize}
66+
On the other hand, in Nectar, with static granularity of intermediate
67+
result cache, two piece of intermediate results
68+
overlapping in the input data range cannot be merged.
69+
For example, two pieces of intermediate results: one is generated by
70+
input data ranging from 0 to 100, another one is generated by input
71+
data ranging from 90 to 190. A program can choose either of them to
72+
use, which loses considerable benefit of intermediate results.
3473

35-
We found that, although the Nectar realizes the interchangeability of data and
36-
computation in a way transparent to programmers, its rewrite-based solution
37-
largely limits the flexibility of the use of intermediate results, and its
38-
IP-based caching system complicates the identifying and retrieving procedure of
39-
the intermediate results. We argue that our NDN-based system can provide a
40-
more flexible, simple solution. The following sections describe the problems
41-
of Nectar.
74+
In this project, we consider designing a NDN-based distributed
75+
computing system that can utilize the intermediate results. The
76+
reasons of adopt NDN are two-fold: first, in NDN, every piece of data
77+
inherently has a name and is transmitted in the network by this name.
78+
Therefore, with NDN, we don't need a complicated and inflexible rewriter and a centralized cache
79+
service to hardwire programs with intermediate results. We can solve the
80+
problems more naturally and elegantly. Second, there are some good
81+
features of NDN we can benefit from. For example, NDN allow the router
82+
to cache the data, reducing the network traffic. Therefore, by careful design, we
83+
can not only make a system using intermediate results become more
84+
flexible, but also more efficient than a IP-based solution, such as
85+
Nectar.
4286

43-
\subsection{Inflexibility of rewrite-based solution} Nectar rewrites users'
44-
programs written in \emph{LINQ} into a new program which can benefit from
45-
intermediate results. Although, such a rewrite-based solution has the advantage
46-
of transparency, it largely limits the flexibility of caching system in two
47-
ways: \emph{a)} Nectar only decomposes and considers caching the computations
48-
it can recognize while failing to cache and benefit from the intermediate
49-
results belonging to the computations not recognizable to it. \emph{b)} On the
50-
other hand, Nectar could wrongly decompose a computation into sub-expressions
51-
of which intermediate results are not worth to cache, and still try to cache
52-
them if they are inquired frequently.
53-
54-
To design such a rewrite algorithm could be tricky and error-prone. Any corner
55-
case the rewriter fails to consider could cause data inconsistency without
56-
warnings. Meanwhile, such rewrite-based solution might not be applicable to
57-
other languages used in data centers.
58-
59-
Nectar adopts such a rewrite-based solution while sacrificing the flexibility,
60-
because they assume it is too much effort for programmers to modify the
61-
programs. However, we argue that if we can provide a simple enough programing
62-
model to programmers, who know the programs best, such as with
63-
\emph{Map-Reduce}, the programmers can best tune the caching of intermediate
64-
result without too much effort.
65-
66-
\subsection{Complication of identifying and retrieving intermediate results}
67-
Identifying and retrieving the intermediate results are extremely complicated
68-
in Nectar. Each entry of intermediate results is indexed by the fingerprint of
69-
the sub-expression that have generated it, as well as, the fingerprint pairs of
70-
the first and last extents of the input dataset.
71-
72-
To identify the intermediate results, the rewriter has to compute the
73-
fingerprints of each prefix sub-expression, querying the cache service,
74-
recursively searching the best-saving subset of intermediate results, and
75-
rewriting the program. Then, the program retrieve the chosen subset of
76-
intermediate results from the specific location on a distributed storage given
77-
by the cache service.
78-
79-
We found this procedure of identifying and retrieving the intermediate results
80-
can be largely simplified in a NDN-based network with a naming technique we
81-
will describe later.

0 commit comments

Comments
 (0)