-
Notifications
You must be signed in to change notification settings - Fork 13
synchronization optimistic
Optimistic synchronization is based on the notion of speculative processing. This technique, which has been effectively employed in a large variety of areas, like branch prediction in pipelined processing units , in file prefetching , and in transactional systems’ concurrency control , is based on the idea that whenever a computing resource is available, it should be used to perform a task which is likely correct and/or useful. Doing this work before it’s actually known if it were to be done can provide benefits when it is later discovered that it was actually needed. In case that part of work is a-posteriori known to be incorrect or not useful, the result is simply discarded. The advantage comes from the fact that waiting to know if a task is necessary might produce a delay in the actual execution of the task itself. If the speculative processing is correct, i.e. the guess on the work to perform was correct, this delay can be even reduced to zero. In case of a wrong guess, the system pays the same delay as if it were not speculative, except for the cost of discarding the wrong piece of work carried out.
In PDES, this approach was presented (under the name of optimistic synchronization) in the seminal paper , where the Time Warp mechanism was introduced. Differently from the conservative approach, optimistic synchronization selects, at each simulation kernel instance and at each worker thread, the next event to be processed among the local ones independently of their safety. This clearly negates the guarantees provided by the conservative synchronization scheme, so that the system can fall into an inconsistent simulation state by following a simulation trajectory which is affected by a causal violation, as previously depicted in [causality-violation].
Rollback Operation
It is therefore necessary to support a mechanism which is able to a-posteriori detect the arrival of a straggler message and the occurrence of a causality violation, to temporarily halt the execution of the simulation model, to restore the simulation state to a previous (consistent) point in ST, and then to restart the forward execution of the model by taking into account the straggler message. This operation, known as the rollback operation, is depicted in Figure Rollback Operation.

In the example provided by the picture we see that the rollback operation involves an additional problem. In particular, \(LP_j\) is at \(LTV_j = 20\) when it receives a straggler message \(e_{straggler}\) associated with timestamp \(T_{straggler} = 12\). Therefore, it rolls the execution back to ST \(LVT_j = 10\), i.e. it discards the state changes which were operated by events at timestamps \(T = 15\) and \(T = 20\). This is sufficient for restoring a consistent state for \(LP_j\), i.e. the subportion \(S_j\) of the global state is correctly restored to time \(T = 10\). Nevertheless, we notice that during the execution of the event at timestamp \(T = 15\), \(LP_j\) has sent a message to \(LP_k\) scheduled for ST \(T = 17\). This message was the result of a wrong simulation trajectory, as \(LP_j\) reached \(LVT_j = 15\) without taking into account the piece of information stored into event \(e_{straggler}\). As \(LP_j\) has now rolled back to ST \(LVT_j = 10\), it is possible that, upon the execution of \(e_{straggler}\), the simulation will follow a different trajectory, involving the fact that either that message scheduled at ST \(T = 17\) should have not been sent, or that the information associated with that message might change. Therefore, we have that:
-
\(LP_i\) has sent a straggler message \(e_{straggler}\) to \(LP_j\);
-
\(LP_j\) rolls back to the ST \(LVT_j\) such that \(T_{straggler} > LVT_j \geq T_{e_{max}}\) where \(T_{e_{max}} = \max \{ T_e | e \text{ is already processed} \}\);
-
This involves undoing an event processed by \(LP_j\) which caused the generation of a new event scheduled for \(LP_k\).
Due to the rollback operation which was involving \(LP_j\), \(LP_k\) has reached as well an inconsistent simulation state. Therefore, we must support a mechanism which notifies \(LP_k\) that an event which it has received belongs to an inconsistent portion of the simulation trajectory. This is done by relying on antimessages. An antimessage \(\bar{e}\) is a negative copy of an already sent (positive) message \(e\), scheduled by some LP to anyone else.
During the rollback operation, \(LP_j\) checks whether some
positive messages where sent during the execution of the events the
timestamps of which fall in the interval
\(
When \(LP_k\) receives an antimessage, two situations might arise:
-
the antimessage \(\bar{e}\) is associated with a timestamp \(T_{\bar{e}} > LVT_k\). This means that the positive message has not been processed yet. Then, the effect of the antimessage \(\bar{e}\) is to simply annihilate the positive message \(e\), i.e. \(e\) is removed from the event queue of \(LP_k\);
-
the antimessage \(\bar{e}\) is associated with a timestamp \(T_{\bar{e}} \leq LVT_k\). In this case, \(LP_k\) has already processed the event \(e\) which belongs to an inconsistent portion of the simulation trajectory, and it has reached an inconsistent simulation state as well.
In the second case, \(LP_k\) must rollback to a consistent simulation state as well (as in the example provided in Rollback Operation). This phenomenon is know as cascading rollback, and can in turn affect more than one LP, if the same situation (i.e., during the rollback some events which generated new events destined to different LP are undone) arises.
It is important to mention that the rollback operation can be handled completely by the simulation kernel, if the implementation is able to know where the simulation states \(S_i\) of the various LPs are located in memory. If the kernel is able to do so, then no modification to the model’s implementation must be done. On the other hand, the programmer must explicitly tell where the memory buffers storing the state are. This will be discussed more thoroughly in Chapter [dymelor], where we will approach transparency towards scattered-memory states and incremental state saving.
In order to support rollback operations, two main approaches have been proposed in literature. One, relies on the notion of state save & restore, and was the original proposal in . The other relies on reverse computation of simulation events, which assumes that events can be coupled with negative events, which are events that execute the same operations of the corresponding events, but in reverse order. It is important to make a clear distinction between negative events and antimessages, as they refer to a completely different logical operation. The former undoes the effect of processing one event on the simulation state, while the latter completely annihilates one event, which might or not be already processed. The two strategies will be discussed later in this section.
Global Virtual Time (GVT)
If the rollback operation is easy to be proven correct , it is harder to reason about the progress condition of the whole system among all the rollback operations being performed. Additionally, since events being processed might belong to a portion of the simulation trajectory that will be later discovered as inconsistent, how is it possible to check whether the ending condition has been reached? And how is it possible to handle error conditions or I/O operations? Although the problem of I/O operations will be thoroughly discussed in Chapter [io], the original proposal in provided a global control mechanism which is based on the concept of Global Virtual Time (GVT). GVT is a property of an instantaneous global snapshot of the system at WCT \(t\), and is defined as follows:
[def:GVT] GVT(\(t\)) is defined as the minimum timestamp of any unprocessed message or anti-message flowing in the system at WCT \(t\)
Definition [def:GVT] tells us that the value of the GVT at WCT \(t\) can be computed by inspecting the timestamps associated with all messages in the system that are not yet processed, but it does not say where the messages are likely to be found. In particular, while it is easy to inspect the timestamps stored into the event queues of the various simulation kernels, if a message has been sent from \(LP_i\) to \(LP_j\), but it has not been received yet by \(LP_j\) (due, e.g., to network latency in the case of remote simulation kernels), then the value of that message (which is referred to as in-transit message) must be taken into account as well.
There are several other definitions of GVT, and differentiated implementations which address different simulation scenarios and different computing architectures. Nevertheless, computing the GVT value is an essential component of (distributed) PDES, as it allows to define the commitment horizon of the simulation. In fact, we recall that the original DES model’s definition states that during the execution of an event \(e\) a new event \(e'\) can be generated only if associated with a timestamp \(T_{e'} \geq T_e\). Then, at any instant \(t\) of WCT, since the value of \(GVT(t)\) is the minimum timestamp among all the not-yet-processed events in the whole system, it is clear that no event executed by whichever LP in the system can generate a new event \(e'\) associated with a timestamp \(T_{e'} < GVT(t)\).
Thus, this property states that at WCT \(t\), no rollback operation can bring any LP \(i\) to a ST \(LVT_i < GVT(t)\). All the executed events associated with a timestamp \(T < GVT(t)\) are therefore committed, and can be used to verify, e.g., the simulation’s ending condition.
Rollback Supports: State Save & Restore
As mentioned before, to support the rollback operation, one of the two main approaches is state save & restore. The first proposal of this technique appears in conjunction with Time Warp, in . The technique is based on the idea that a snapshot of the simulation state of a LP can be taken (i.e., the memory region(s) containing the private state variables of \(S_i\) are copied into a separate buffer) and is associated with a timestamp (namely, the timestamp of the last executed event[1]). Again, depending on the programming model supported by the simulation kernel, this operation can either be done transparently, or shall require the user to explicitly specify which are the memory regions containing the LP’s simulation state \(S_i\).
Whenever a rollback operation occurs, the ST \(T_{rollback}\) is defined, and the state restore operation can be simply executed by retrieving the state snapshot \(s\) associated with timestamp \(T_s = T_{rollback}\) and restoring back in place the values of the state variables. In this way, if a straggler message is received and a rollback operation shall be executed, the system can select the simulation snapshot associated with the highest timestamp among the ones which are lower than the straggler’s[2].
It is interesting to note that taking state snapshot is a costly operation, both in terms of execution time and memory usage. To cope with memory usage, we can rely on the notion of GVT as presented in Definition [def:GVT]. In fact, since state snapshots are only required to support rollback operations, and given that at any WCT instant \(t\) it is not possible to execute a rollback operation to ST \(T < GVT(t)\), all the state snapshots \(s\) associated with a timestamp \(T_s < GVT(t)\) can be discarded, and the memory buffers can be retrieved for future usage. This operation, known as fossil collection , can be performed periodically. The frequency according to which the value \(GVT(t)\) is computed (and therefore the fossil collection operation is performed) can be either manually specified at simulation start-up, or might depend on the underlying hardware architecture . Of course, executing the fossil collection often might affect the overall simulation throughput, as more computing power is used to compute the GVT reduction and to release old (committed) buffers.
The issue of execution time for taking snapshots is, on the other hand, tackled by modifying the frequency of the state saving operation (which indirectly tackles memory usage as well), and by tuning the amount of information that is stored into one of them (namely, a snapshot can be either full or incremental).
Rollback Supports: Reverse Computation
A completely different approach to support the rollback operation is reverse computation , which essentially relies on the notion of reverse events to restore a previous simulation state. By relying on compiler techniques, the work proposes to automatically generate reverse events which are able, by executing the same operations of regular events but in reverse order, to undo the effects of one event on the simulation state. A rollback operation is therefore supported by executing at the rolling back LP \(i\) all the reverse events in backwards order, starting from ST \(LVT_i\) up to \(T_{rollback}\).
A reverse event is essentially a copy of a regular event which executes the same operations but in reverse order. If, for example, we consider this sample regular event shown in which models a cell transition of an ATM multiplexor model:
if(qlen > 0) { qlen--; sent++; }
The reverse event would be:
if(qlen "was" > 0) { sent--; qlen++; }
We note that while reversing arithmetic operations can be straightforward, the branching condition is not, as the reverse event must check an ``old'' state variables’ value, which is not available when processing it.
Thus, reverse computation relies on the modification of regular events by adding bit variables, which are transparently-added state variables which tell whether a particular branch was taken or not during the forward execution phase. The regular event presented in the example above would therefore become:
if(qlen > 0) { b = 1; qlen--; sent++; }
and the reverse event can now rely on a (bit) state variable which was set after the branch was taken:
if(b == 1) { sent--; qlen++; }
This approach increases the size of the simulation state, but relying on bit variables allows to do so in a very negligible way, as the state increase is \(\log(\# branches)\). This approach can be used to keep track of the execution of \(n\)-way if statements and switch/case constructs as well.
In reverse computation, nevertheless, we note that not all operations are reversible. For example, disruptive operations like = or %= must be handled relying on state saving techniques. Nevertheless, since these operations have a very fine granularity, and considering that the model’s executable is instrumented to alter the original logic to support the execution/generation of reversible events, this operation can be done using a technique similar to the one in described above, where memory updates are tracked and a fine-grain word-based log is generated for each particular memory update.
Loops can be handled by simply taking note of the number of iterations \(n\)—which is extremely important in case of variable number of iterations like with the while statement. The reverse loop will be then executed \(n\) times.
On the other hand, handling jump instructions (i.e., goto, break, and continue instructions) or function calls requires that some bit variables are used to store the actual flow of the forward events. The reverse events can then rely on a set of (automatically-generated) switch/cases to reproduce the actual (reverse) execution flow. This problem can generate a non-negligible state increase, depending on the complexity of the code and on the actual runtime execution flow of the model.
An important aspect in reverse computation is that of (pseudo) random number generators. This problem, which is similar to the one described in the case of State Saving, requires that repeated calls to random generators belonging to the same (logical) invocation return the same result, thus allowing a piece-wise deterministic execution of the model. This can be done by applying the same rules used to generate reverse events to the random number generator code, provided that they do not rely on lossy floating point operations, which are anyway hard to support in the generation of reverse events.
Overall, reverse computation adds a limited overhead to the forward execution of the simulation model, and can provide a significant reduction in the delay of rollback operations, provided that the rollback length is not very high. On the other hand, if the simulation model has a large number of disruptive operations, it can substantially fall back to ISS, but with an overall performance which is lower than ISS’.
ROOT-Sim © 1987-2019 -- HPDCS
-
Usage