proposal.tex

\documentclass[journal]{IEEEtran}

\hyphenation{op-tical net-works semi-conduc-tor}


\begin{document}

\title{Distribtued Computation over NDN-based Data Center Network}

\author{Zhiyang Wang,
        Cheng-Kang Hsieh,
        and Yingdi Yu}

\maketitle

%\begin{abstract}
%The abstract goes here.
%\end{abstract}

%\begin{IEEEkeywords}
%Data Center Network, NDN, Distributed Computation
%\end{IEEEkeywords}

%\IEEEpeerreviewmaketitle


\section{Introduction}
\IEEEPARstart{D}{ata}-intensive distributed computing has been a major challenge
in data centers for a long time.  Without well-managed data and computation, a
large portion of computations in a data center have to be unnecessarily
repeated.  In order to avoid redundant computation, Pradeep, {\it et al.}
proposed Nectar \cite{gunda2010nectar}, a system in which intermediate computation
results can be cached so that they can be reused by following computation.
Evaluation results suggested that caching intermediate results can save at most
99\% redundant computations in some cases \cite{gunda2010nectar}.

However, there are still several unresolved issues in caching intermediate
computation results in current data center network.  First, a centralized cache
server, as proposed by Nectar, may become a hot spot when computing
significantly relies on cached intermediate results.  Second, when several
computing servers need to re-use the same cached intermediate result, they have
to fetch the cached data individually from the cache server.  Such a data
communication mode may work in some small-volume data-exchanging and highly
distributed system (e.g. DNS), but may become inefficient in data center network
where data exchanged are much more bigger and computing servers are relatively
close to each other in terms of network topology.

Named-Data Network \cite{jacobson2009networking}, a novel network architecture
which can cache data inside networks instead of servers, appears to be a
potential solution to the problems mentioned above.  In NDN, data are named and
fetch by their names, rather than through a end-to-end connection between a
requester and a data provider.  With a specific name, a data packet can be
cached along the path (more specifically speaking, NDN routers) it has traveled
through.  When some other requesters ask for the same data, the request can be
satisfied by data cached in NDN routers without reaching the corresponding data
provider.  In this way, NDN intrinsically provides a distributed caching
service, and avoids end-to-end connections. 

Although NDN has many attractive features, it is not very feasible to provide
intermediate computation results caching service in NDN.  There are two open
questions: 1) how to name intermediate computation results to maximize the
caching efficiency and 2) what is the best network topology for NDN-based data
center network.  These two issues will be discussed in detail in Section
\ref{sec:problem_statement}.

In this project, we plan to answer the two questions above by building a
prototype of NDN-based data center network.  In order to evaluate the
performance of the prototype, we will provide two specific applications
supported by the prototype: 1) simple SQL query in a distributed database and 2)
distributed word occurance analysis.  As we will describe in Section
\ref{sec:problem_statement}, each application represents a different type of
distributed data computing: incremental computing and sub-computing
respectively.  We will also compare the performance of our prototype with the
existing IP-based solution, Nectar, and try to derive some general conclusions
which can be used for future data center network designs.

\section{Problem Statement}\label{sec:problem_statement}
To make the problem discussion more clear, we first describe two applications
that will be implemented over our prototype. And then two related issues will be
discussed in detail.

The first application is a simple SQL query in a distributed database.  In this
database, data are stored on multiple servers in a distributed way, and queries
are processed in a MapReduce style.  When a query is issued, related entries
will be grouped by mapers, and then further processed by reducers.  When new
entries have been injected into the database, all related entries have to be
re-grouped if intermediate results caching is not used.  When the grouped
entries can be cached as intermediate results, mapers only need to group
recently injected entries which can be combined with cached results for reducer
to process.  Such a computing mode is called as incremental computing.  

The second application is a distributed word occurance analysis.  Unlike the
previous application, data set is persistent in this application.  In order to
get the final results, the computation can be divided into several steps.  Some
steps may require the computation results of some other steps.  And the same
computation results may be re-used by more than one computing servers.  Such a
computing mode is called as sub-computing.

\subsection{Name Intermediate Results}
We use the first application to discuss the naming issue in NDN-based network.
In traditional IP network, a maper will transfer data to a reducer, while in
NDN, a reducer is notified a task, and then fetches data needed for the task.
The reducer issues an {\sc Interest} message which contains the name of
requested data.  The name should be able to represent the timeliness of the
task.  For example, if a reducer needs all the grouped entries before a
timestamp $t_i$, and there are several pieces of entries injected at timestamps
$t_{i-3}$, $t_{i-2}$, $t_{i-1}$ before $t_i$, then all intermediate grouped
entries, if cached, should be able to be matched with a name containing the
timestamp $t_i$.  However, the sequential nature of timestamp conflicts with the
hierarchical structure of NDN's naming mechanism which determines caching
efficiency to a large extent. Therefore, we have to find a compatible naming
mechanism for incremental computing.

\subsection{Network Topology}
DCN over NDN should have a specific-designed topology. Current DCNs usually adopt
a hierarchical topology, such as fat-tree, to achieve scalability, fault tolerance,
and high capacity. Such a design assumes that the communication is end-to-end-based, 
but this is not the case for NDN. In NDN, a data requester is not necessary to 
retrieve data from the data producer but can retrieve the data from a router that 
have cached the data. If the network is based on a hierarchical topology, most traffic
will go through the core routers. For NDN, this means that the core routers should provide
larger cache to achieve higher cache hit rate. But, increasing the cache size also increases
the time for NDN routers to perform the longest-prefix matching routing procedure and makes
these upper-level routers become the bottleneck of the communications.

\subsection{Comparison With IP-based Solution}
The contribution and value of our work largely rely on whether our NDN based
solution could actually provide better performance than the traditional IP-based
centralized server solution, although conceptually the data-centric NDN network
should be suitable for such data-centric use case in data center. In the current
stage, there are a few advantages we assume: First, as decentralized, our
solution could be better in robustness and scalability. Second, as most of data
are cached in local router, the latency will be shorter for data fetching since
in some cases where two servers are under the same router or neighbor
routers. Third, the network topology and infrastructure is simpler since there’s
no extra special server which all outer servers must connect to. By leveraging
its job toward routers who by nature will do caching in NDN network, the whole
solution could be less costly. But since there’s no existing real well accepted
NDN product in the market, this point is hard to verify.  However, as
centralized, the centralized special server has all the data from every server,
and therefore could enable large-scale aggregation or analysis algorithm. We
need to verify whether it is important under our scenario and whether our NDN
solution could also provide such function in other efficient ways.  Moreover,
the advantages and disadvantages in this project could also provide an insight
to the more general comparison between these two solutions in other scenario,
such as data collection in participate sensing or other data center related
topics such as leverage the master’s task toward decentralized routers.

\section{Solutions}
In this section, we propose a set of solutions to deal with problems stated
above, demonstrate them in the two applications.
\subsection{Incremental Computation}
We consider a specific application here.  A distributed database is running in a
data center.  In this database, there is a table with the schema <{\sc index,
  name, score, timestamp}>.  Entries in this table are distributedly stored on
different servers in the data center.  New entries will be injected into the
table along the time as indicated by the {\sc timestamp} field.
%\begin{figure}
%\begin{center}
%\includegraphics{Fig/inc_cmpt.pdf}
%\end{center}
%\end{figure}

In order to keep
%\subsection{}

\section{Related Work}
In data center related work under traditional IP network, there are quite a few
existing work focusing on reducing redundant computations via caching, like
DryadInc \cite{Isard:2007:DDD:1272996.1273005}, Coment
\cite{He:2010:CBS:1807128.1807139}, the stateful bulk processing system
\cite{Logothetis:2010:SBP:1807128.1807138} and Nectar
\cite{gunda2010nectar}. And among them we choose Nectar as our main reference
and comparison target since it attempts to provide a more comprehensive solution
to the problem of automatic management of data and computation in data
center. In order to make direct and straightforward comparison, we borrowed the
data center use cases as incremental computing and sub-computation directly from
Nectar, attempting to prove that NDN solution could provide better solution than
Nectar, which to us, by somehow represents the existing most centralized special
server based solution under traditional IP network.  In NDN network, or even
content centric network as a wider scope, currently no existing working
published aiming at such solution for data center network scenario. The most
related work, if we just consider the appearance data center network scenario in
CCN related research, is in \cite{lee2010greening} where the author use data
center as one CCN use case to illustrate the energy saving performance of
CCN. Obviously, from the motivation to the evaluation, our work is totally
different from their work.  Currently the main concern of the community is still
in some fundamental problems such as general naming mechanism, data security,
routing, rather than such quite specific problem in specific scenario as data
center here. Good part of this is that our work could be relatively fresh in
ideas, but the negative part is that since plenty of more fundamental problem
existing to address, as well as the potential and future of NDN is unknown, our
work which focus on the quite particular data center intermediate result sharing
problem is like a building built on the unstable ground. Therefore the
contribution and value of our work quite largely depends on the success of
NDN. However, even through NDN were finally proved as unsuitable for the general
whole internet, for the individual data center, which is relatively small in
scale and centralized controlled by each companies or organizations, it still
could be promising if we could prove our NDN solution with attractive
improvement of performance compared with traditional IP – special servers based
solution. Since previous examples like optical circuit switches are getting more
popular and accepted, we do have the reason to believe that our work could have
quite much contribution if the solution could be enough promising, even if NDN
lost the bigger campaign.  About content centric network naming, which is one of
the key issues we want to address under the scenario, there’s quite a few
related works talking about the structures and rules of naming, such as
\cite{ghodsi2011naming} or \cite{primes}, but their main motivation is for
security or scalability under the scenario of the whole internet, which is very
different with ours.  There are also some existing studies on CCN caching. For
example, in \cite{carofiglio2011modeling}, the author developed an analytical
model for the performance evaluation of content transfer in CCN that allows
explicit characterization of steady state dynamics. Our work could draw
inspiration from this work in the network topology design part and therefore
design a better topology and caching strategy.

\section{Timeline}
Week 4-6: Paper review and system \& algorithm design 
Week 7:  System \& algorithm formulation 
Week 8: Experiments design, System \& algorithm implementation and 
evaluation 
Week 9: Final presentation and report preparation 


\bibliographystyle{abbrv}
\bibliography{proposal}

%\begin{thebibliography}{1}
%\bibitem{2010osdiGunda}
%Nectar
%\end{thebibliography}
\end{document}