|
1 |
| -\section{Nectar: IP-based Solution} |
2 |
| -(The name of this section is tentative) |
3 |
| -This section |
4 |
| -briefly describe Nectar: a IP-based system designed to manage the intermediate |
5 |
| -results. With Nectar, the intermediate results can be shared across multiple |
6 |
| -programs or be used in the future computation incrementally. Nectar realizes |
7 |
| -the interchangeability of data and computation by the following two elements: |
8 |
| -\emph{a)} a client-side library that rewrites user programs; \emph{b)} a |
9 |
| -cluster-wide cache server that manages the intermediate results. Nectar client |
10 |
| -side library takes a \emph{DryadLINQ} program $P$ as input, consulting the |
11 |
| -cache service to rewrite it to an equivalent but more efficient program $P'$ in |
12 |
| -the following steps: |
| 1 | +\section{Nectar's solutions} |
13 | 2 |
|
14 |
| -\begin{itemize} |
| 3 | +This section briefly describes the Nectar's solutions. Nectar names every piece of intermediate |
| 4 | +result by the Robin fingerprints of the sub-program and the input |
| 5 | +data that generated it. The intermediate results are cached |
| 6 | +in a distributed file system and managed by a centralize cache |
| 7 | +service while all the access to the intermediate results must go through |
| 8 | +this cache service. Then, Nectar adopts a rewrite-based approach at |
| 9 | +the compiler level to hook the program with the intermediate results. |
| 10 | +Before a program is submitted to the computation cluster, a |
| 11 | +client-side program rewriter will identify the intermediate data that |
| 12 | +will be useful to this program and rewrite the program in a way that when the program is being executed, it will |
| 13 | +retrieve these intermediate results from the cache storage and merge |
| 14 | +them with the new results. Also, the rewriter, based on the frequencies of |
| 15 | +the inquiring of each sub-program's result, decides whether to cache |
| 16 | +sub-programs' results. Finally, because all the intermediate results |
| 17 | +are managed by a centralized service, Nectar simply implements a |
| 18 | +cost-benefit algorithm to garbage collect the obsolete intermediate |
| 19 | +results. |
15 | 20 |
|
16 |
| -\item Decompose program $P$ into a set of sub-expressions ${P1, P2, P3 ...}$ |
17 |
| -equivalent to $P$ and probe the cache service for all cache hits for each |
18 |
| -\emph{prefix sub-expression} of $P$. |
| 21 | +While we agree with that Nectar is going toward a right direction to reduce the |
| 22 | +computation cost in data centers, we found that Nectar's rewriter-based |
| 23 | +solution imposes many unnecessary constraints to the uses of intermediate results. |
| 24 | +First, in order for rewriter to rewrite the programs, the |
| 25 | +programs must be written in the certain language that is |
| 26 | +understandable to the rewriter. In fact, Nectar can only support the programs that were |
| 27 | +written in the C# while many programs ran in data centers do not |
| 28 | +adopt C# or Microsoft's DryadLINQ. Also, as mentioned in the Nectar's paper, |
| 29 | +even if a program is written in the C#, if it invokes any external library written in |
| 30 | +other languages, these parts of computations cannot benefit |
| 31 | +from the intermediate results because rewriter cannot understand them. |
19 | 32 |
|
20 |
| -\item Recursively applying the maximum-independent-set algorithm from the |
21 |
| -largest to the shortest prefix sub-expression, finding a subset of intermediate |
22 |
| -results which are disjoint on input data and provide the most saving on |
23 |
| -execution time. |
| 33 | +We argue that, a more flexible design is needed, so that such a idea |
| 34 | +of using intermediate can be widely applicable, and different programs written in the different |
| 35 | +languages can all benefit from the intermediate results. |
24 | 36 |
|
25 |
| -%For each prefix sub-expression, apply the maximum-independent-sets algorithm |
26 |
| -%to find a subset of intermediate results which are disjoint on input data and |
27 |
| -%provide the most saving on execution time. \item Pick a set of intermediate |
28 |
| -%results of a prefix sub-expressions that can provide the most saving. |
| 37 | +Second, in Nectar, because rewriting is performed at the compilation time, the |
| 38 | +flexibility of using intermediate results is largely limited. The |
| 39 | +following example demonstrate this problem: Consider a program |
| 40 | +contains four steps of sub-computations: {p1, p2, p3, p4}. With Nectar, this program can only |
| 41 | +benefit from the intermediate results that were generated by the |
| 42 | +Prefix-sub-programs i.e. {p1, p2, p3, p4}, {p1, p2, p3}, {p1, p2} or |
| 43 | +{p1}, because when Nectar rewrites the program, it doesn't know the |
| 44 | +input data of the computation p2, p3 or p4 and so cannot identify the |
| 45 | +intermediate results that were generated by the sub-programs that |
| 46 | +start with these computations. Also, rewrite-based solution can only |
| 47 | +modify the programs based on the situation during the compilation time. If |
| 48 | +there comes other new intermediate data that could benefit the program after the |
| 49 | +compilation time, these data cannot be used in the programs. We argue |
| 50 | +that, with our design, which is more flexible, we can better utilize the |
| 51 | +potential benefit of intermediate results for the above-mentioned |
| 52 | +cases. |
29 | 53 |
|
30 |
| -\item Rewrite the program $P$ to $'P$ in a way that the program can benefit |
31 |
| -from the chosen subset of intermediate results. |
| 54 | +Finally, Nectar also suffers the static granularity problems both in |
| 55 | +the program decomposition and the intermediate data cache. First, there is no space for |
| 56 | +program developers to decide how their programs are decomposed. Some |
| 57 | +programs need finner-grained decomposition because each small step of |
| 58 | +it might be useful to other programs. Meanwhile, some programs might |
| 59 | +not need such finner-grained decomposition. But with Nectar, all the programs |
| 60 | +are decomposed in the same way, which could either waste the |
| 61 | +storage, caching unnecessary intermediate results by too |
| 62 | +finner-grained decomposition, or fail to utilize |
| 63 | +the potential of intermediate results by too coarse-grained |
| 64 | +decomposition. |
32 | 65 |
|
33 |
| -\end{itemize} |
| 66 | +On the other hand, in Nectar, with static granularity of intermediate |
| 67 | +result cache, two piece of intermediate results |
| 68 | +overlapping in the input data range cannot be merged. |
| 69 | +For example, two pieces of intermediate results: one is generated by |
| 70 | +input data ranging from 0 to 100, another one is generated by input |
| 71 | +data ranging from 90 to 190. A program can choose either of them to |
| 72 | +use, which loses considerable benefit of intermediate results. |
34 | 73 |
|
35 |
| -We found that, although the Nectar realizes the interchangeability of data and |
36 |
| -computation in a way transparent to programmers, its rewrite-based solution |
37 |
| -largely limits the flexibility of the use of intermediate results, and its |
38 |
| -IP-based caching system complicates the identifying and retrieving procedure of |
39 |
| -the intermediate results. We argue that our NDN-based system can provide a |
40 |
| -more flexible, simple solution. The following sections describe the problems |
41 |
| -of Nectar. |
| 74 | +In this project, we consider designing a NDN-based distributed |
| 75 | +computing system that can utilize the intermediate results. The |
| 76 | +reasons of adopt NDN are two-fold: first, in NDN, every piece of data |
| 77 | +inherently has a name and is transmitted in the network by this name. |
| 78 | +Therefore, with NDN, we don't need a complicated and inflexible rewriter and a centralized cache |
| 79 | +service to hardwire programs with intermediate results. We can solve the |
| 80 | +problems more naturally and elegantly. Second, there are some good |
| 81 | +features of NDN we can benefit from. For example, NDN allow the router |
| 82 | +to cache the data, reducing the network traffic. Therefore, by careful design, we |
| 83 | +can not only make a system using intermediate results become more |
| 84 | +flexible, but also more efficient than a IP-based solution, such as |
| 85 | +Nectar. |
42 | 86 |
|
43 |
| -\subsection{Inflexibility of rewrite-based solution} Nectar rewrites users' |
44 |
| -programs written in \emph{LINQ} into a new program which can benefit from |
45 |
| -intermediate results. Although, such a rewrite-based solution has the advantage |
46 |
| -of transparency, it largely limits the flexibility of caching system in two |
47 |
| -ways: \emph{a)} Nectar only decomposes and considers caching the computations |
48 |
| -it can recognize while failing to cache and benefit from the intermediate |
49 |
| -results belonging to the computations not recognizable to it. \emph{b)} On the |
50 |
| -other hand, Nectar could wrongly decompose a computation into sub-expressions |
51 |
| -of which intermediate results are not worth to cache, and still try to cache |
52 |
| -them if they are inquired frequently. |
53 |
| - |
54 |
| -To design such a rewrite algorithm could be tricky and error-prone. Any corner |
55 |
| -case the rewriter fails to consider could cause data inconsistency without |
56 |
| -warnings. Meanwhile, such rewrite-based solution might not be applicable to |
57 |
| -other languages used in data centers. |
58 |
| - |
59 |
| -Nectar adopts such a rewrite-based solution while sacrificing the flexibility, |
60 |
| -because they assume it is too much effort for programmers to modify the |
61 |
| -programs. However, we argue that if we can provide a simple enough programing |
62 |
| -model to programmers, who know the programs best, such as with |
63 |
| -\emph{Map-Reduce}, the programmers can best tune the caching of intermediate |
64 |
| -result without too much effort. |
65 |
| - |
66 |
| -\subsection{Complication of identifying and retrieving intermediate results} |
67 |
| -Identifying and retrieving the intermediate results are extremely complicated |
68 |
| -in Nectar. Each entry of intermediate results is indexed by the fingerprint of |
69 |
| -the sub-expression that have generated it, as well as, the fingerprint pairs of |
70 |
| -the first and last extents of the input dataset. |
71 |
| - |
72 |
| -To identify the intermediate results, the rewriter has to compute the |
73 |
| -fingerprints of each prefix sub-expression, querying the cache service, |
74 |
| -recursively searching the best-saving subset of intermediate results, and |
75 |
| -rewriting the program. Then, the program retrieve the chosen subset of |
76 |
| -intermediate results from the specific location on a distributed storage given |
77 |
| -by the cache service. |
78 |
| - |
79 |
| -We found this procedure of identifying and retrieving the intermediate results |
80 |
| -can be largely simplified in a NDN-based network with a naming technique we |
81 |
| -will describe later. |
|
0 commit comments