Resolve review comments

sophimao · sophimao · commit ba0ad4e07ea6 · 2023-09-25T06:24:26.000-07:00
diff --git a/docs/eager_progress.md b/docs/eager_progress.md
@@ -91,13 +91,13 @@ The device op queue accepts and schedules device operations based on resource av
     </tbody>
 </table>
 
-As mentioned before, a single user level event might be decomposed into several device ops, which can be in either a proposed state or a committed state when live. When a device op is created, an free device op will be populated and added to the live queue, which changes its state to proposed. If any of the subsequent device operations generated from the same user level event fail to submit, then this device operation will be revoked and added back to the free queue. If all of the device operations generated from the same user level event are successfully created and submitted, then the runtime will change the state of those device operations to committed by adding them to the committed queue. After the device operation is done, its state will be changed back to free and added back to the free list.
+As mentioned before, a single user level event might be decomposed into several device ops, which can be in either a proposed state or a committed state when live. When a device op is created, a free device op will be populated and added to the live queue, which changes its state to proposed. If any of the subsequent device operations generated from the same user level event fail to submit, then this device operation will be revoked and added back to the free queue. If all of the device operations generated from the same user level event are successfully created and submitted, then the runtime will change the state of those device operations to committed by adding them to the committed queue. After the device operation is done, its state will be changed back to free and added back to the free list.
 
 |<img src="images/eager_progress/doq.jpg" alt="doq" width="500"/>|
 |:--:|
 |Figure 2:  Device operation queue|
 
-Each device op keeps track of two status, `status` and `execution_status`, the former is the status known by the device op queue whereas the later is the status known by the actual device, as a result `execution_status` should always be ahead of `status`. When `status` and `execution_status` do not match, that signifies a device update has been made for the first time to the runtime The runtime will propagate the device op status to its owning event only in this case. The following table details the four statuses and their update mechanisms only for `execution_status`, but device op status and event execution status can also take values from the listed four statuses (update mechanisms vary).
+Each device op keeps track of two status, `status` and `execution_status`, the former is the status known by the device op queue whereas the later is the status known by the actual device, as a result `execution_status` should always be ahead of `status`. When `status` and `execution_status` do not match, that signifies a device update has been made for the first time to the runtime. The runtime will propagate the device op status to its owning event only in this case. The following table details the four statuses and their update mechanisms only for `execution_status`, but device op status and event execution status can also take values from the listed four statuses (update mechanisms vary).
 | device op execution status | Updated by runtime | Updated by device interrupt |
 | --- | --- | --- |
 | `CL_QUEUED` | x | |
@@ -122,7 +122,7 @@ The hung happens when a long-running kernel is launched multiple times, and then
 |:--:|
 |Figure 3:  Runtime hung when launching the same kernel for multiple times, without fask kernel relaunch|
 
-This scenario is mitigated by the introduction of the fast kernel relaunch feature, which enables submission of the kernel device op generated by the launch of the same kernel by `fast_launch_depth` number of time, where `fast_launch_depth` is the depth of the kernel argument preload buffer. (For more details on the Fast Kernel Relaunch feature, please find it in the [Fast Kernel Relaunch FD](https://github.com/intel-innersource/applications.fpga.oneapi.products.acl-docs/blob/986fe42ff647c2a264ed9a440bec3d27dcec3a05/FDs/runtime/opencl_fast_kernel_relaunch_fd.docx).) With this feature, there can be `fast_launch_depth`+1 number of kernel device ops in either a `CL_SUBMITTED` or `CL_RUNNING` status, which would resolve the above example illustrated in Figure 3 as now the second kernel device op can be launched without the first kernel device op getting completed.
+This scenario is mitigated by the introduction of the fast kernel relaunch feature, which enables submission of the kernel device op generated by the launch of the same kernel by `fast_launch_depth` number of time, where `fast_launch_depth` is the depth of the kernel argument preload buffer. (For more details on the Fast Kernel Relaunch feature, please find it in the [Simple Host-Device Streaming page, Lower Bounds on Latency section](https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/optimization-guide/2023-2/simple-host-device-streaming.html).) With this feature, there can be `fast_launch_depth`+1 number of kernel device ops in either a `CL_SUBMITTED` or `CL_RUNNING` status, which would resolve the above example illustrated in Figure 3 as now the second kernel device op can be launched without the first kernel device op getting completed.
 
 |<img src="images/eager_progress/good_fkr.png" alt="good fkr" width=600/>|
 |:--:|
@@ -213,7 +213,7 @@ At the proposal phase of this feature, the first attempted locking scheme is as
 |:--:|
 |Figure 7:  Initial new locking scheme proposal for the multi-producer single consumer model|
 
-On the producer side, not much will be changed, all the producer functions will continue to hold the global lock and have a pseudo-single-threaded behaviour. Only only change is that the functions that submits new device ops to the device op queue, `acl_propose_device_op`, `acl_forget_proposed_device_ops`, and `acl_commit_proposed_device_ops` will need to hold the DoQ lock as they directly modifies structure of the device op queue. Note that when the producer threads calls these functions, the global lock will continue to be locked regardless of the locking of the DoQ lock. This is because we may have multiple producer threads that attempt to propose and commit to the device op queue at the same time. Without holding the global lock, one producer can just finish proposing and another producer can come in, propose, and forget device ops on error, then the second producer will forget not only the device ops proposed by itself, but also the device ops proposed by the first producer. Making the producers hold the global lock while submitting the device ops ensures that at any point when the global lock is released, there should not be any uncommitted device ops.
+On the producer side, not much will be changed, all the producer functions will continue to hold the global lock and have a pseudo-single-threaded behaviour. The only change is that the functions that submits new device ops to the device op queue, `acl_propose_device_op`, `acl_forget_proposed_device_ops`, and `acl_commit_proposed_device_ops` will need to hold the DoQ lock as they directly modifies structure of the device op queue. Note that when the producer threads calls these functions, the global lock will continue to be locked regardless of the locking of the DoQ lock. This is because we may have multiple producer threads that attempt to propose and commit to the device op queue at the same time. Without holding the global lock, one producer can just finish proposing and another producer can come in, propose, and forget device ops on error, then the second producer will forget not only the device ops proposed by itself, but also the device ops proposed by the first producer. Making the producers hold the global lock while submitting the device ops ensures that at any point when the global lock is released, there should not be any uncommitted device ops.
 
 On the consumer side, we will look at the three sub-functions separately. The first sub-function, `l_attempt_submit_single_device_op`, holds the DoQ lock and loops through the committed subqueue of the device op queue to check if there is any device ops that can be submitted to the device. If any of the device ops can be submitted, it will break the loop, release the DoQ lock (to avoid potential deadlock, more on this later) and submit that single device op. The goal is to not hold the global lock while looping through the committed queue and only acquire it during the actual submission of the device op[^2], since many of the device submission functions modifies broader runtime constructs, for example, fields under `cl_device` may be changed by device submission function `acl_program_device`. <!-- TODO: releasing the DoQ lock in the middle of the course might lead to lost wake-up. Say the consumer went through submit stage and nothing needs to be submitted, then it releases the lock. At this point, the producer comes in and commit something to the device op queue, then wake up the consumer, but the consumer is not waiting so this wake up will do nothing. The consumer continue to update and prune, find no more update, goes to wait, but it actually shouldn't. --> <!-- Note regarding my TODO: this might not be the case if our synchronization is semaphore  based -->