- 
                Notifications
    You must be signed in to change notification settings 
- Fork 794
[UR][L0v2] add support for batched queue submissions #19769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
649dc1d    to
    77ab98d      
    Compare
  
    77ab98d    to
    cd7dace      
    Compare
  
    cd7dace    to
    0caa66d      
    Compare
  
    0caa66d    to
    9115f1f      
    Compare
  
    9115f1f    to
    b2e0362      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you been able to run the SubmitKernel benchmarks? If so, can you please share results?
        
          
                unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                unified-runtime/source/adapters/level_zero/v2/queue_batched.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                unified-runtime/source/adapters/level_zero/v2/queue_batched.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                unified-runtime/source/adapters/level_zero/v2/queue_batched.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                unified-runtime/source/adapters/level_zero/v2/queue_batched.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
      c55074e    to
    c9accf7      
    Compare
  
    c9accf7    to
    12f9354      
    Compare
  
    89178cc    to
    5b556a8      
    Compare
  
    5b556a8    to
    32efe80      
    Compare
  
    32efe80    to
    0eb6253      
    Compare
  
    0eb6253    to
    a034dff      
    Compare
  
    Batched queues enable submission of operations to the driver in batches, therefore reducing the overhead of submitting every single operation individually. Similarly to command buffers in L0v2, they use regular command lists (later referenced as 'batches'). Operations enqueued on regular command lists are not executed immediately, but only after enqueueing the regular command list on an immediate command list. However, in contrast to command buffers, batched queues also handle submission of batches (regular command lists) instead of only collecting enqueued operations, by using an internal immediate command list. Batched queues introduce: - batch_manager stores the current batch, the command list manager with an immediate command list for batch submissions, the vector of submitted batches, the generation number of the current batch. - The current batch is a command list manager with a regular command list; operations requested by users are enqueued on the current batch. The current batch may be submitted for execution on the immediate command list, replaced by a new regular command list and stored for execution completion in the vector of submitted batches. - The number of regular command lists stored for execution is limited. - The generation number of the current batch is assigned to events associated with operations enqueued on the given batch. It is incremented during every replacement of the current batch. When an event created by a batched queue appears in an eventWaitList, the batch assigned to the given event might not have been executed yet and the event might never be signalled. Comparing generation numbers enables determining whether the current batch should be submitted for execution. If the generation number of the current batch is higher than the number assigned to the given event, the batch associated with the event has already been submitted for execution and additional submission of the current batch is not needed. - Regular command lists use the regular pool cache type, whereas immediate command lists use the immediate pool cache type. Since user-requested operations are enqueued on regular command lists and immediate command lists are only used internally by the batched queue implementation, events are not created for immediate command lists. - wait_list_view is modified. Previously, it only stored the waitlist (as a ze_event_handle buffer created from events) and the corresponding event count in a single container, which could be passed as an argument to the driver API. Currently, the constructor also ensures that all associated operations will eventually be executed. Since regular command lists are not executed immediately, but only after enqueueing on immediate lists, it is necessary to enqueue the regular command list associated with the given event. Otherwise, the event would never be signalled. Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has been added for native CPU, which is required by the enqueueTimestampRecording tests. Currently, enqueueTimestampRecording is not supported by batched queues. Batched queues can be enabled by setting UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally, through the environment variable UR_L0_FORCE_BATCHED=1. Benchmark results for default in-order queues (sycl branch, commit hash: b76f12e) and batched queues: api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs api_overhead_benchmark_ur SubmitKernel batched: 12.183 μs
a034dff    to
    a9b22d9      
    Compare
  
    a9b22d9    to
    3a13d92      
    Compare
  
    | // immediately, but only after enqueueing on immediate lists, it is necessary to | ||
| // enqueue the regular command list associated with the given event. Otherwise, | ||
| // the event would never be signalled. The enqueueing is performed in | ||
| // onWaitListView(). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called onWaitListUse in your code.
| } | ||
|  | ||
| // At most one additional event might be added after creating the given waitlist | ||
| void wait_list_view::addEvent(ur_event_handle_t Event) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that this additional event is the primary event related to given wait list, in that case appendPrimaryEvent or something among those lines would be more expressive.
Also, lowercase event or phEvent.
|  | ||
| // At most one additional event might be added after creating the given waitlist | ||
| void wait_list_view::addEvent(ur_event_handle_t Event) { | ||
| if (Event) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!event) {
    return;
}Less nesting.
| UR_CALL(appendGenericCommandListsExp(1, &commandBufferCommandList, phEvent, | ||
| waitListView, | ||
| UR_COMMAND_ENQUEUE_COMMAND_BUFFER_EXP, | ||
| /* already synchronized */ nullptr)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before your change, even though the event was already synchronized (with zeEventHostSynchronize) if non-null, it was still passed here. Now, it looks like appendGenericCommandListsExp is always being called with additionalWaitEvent param as nullptr.
Adding a new feature: batched queue submissions.
Batched queues enable submission of operations to the driver in batches, therefore reducing the overhead of submitting every single operation individually. Similarly to command buffers in L0v2, they use regular command lists (later referenced as 'batches'). Operations enqueued on regular command lists are not executed immediately, but only after enqueueing the regular command list on an immediate command list. However, in contrast to command buffers, batched queues also handle submission of batches (regular command lists) instead of only collecting enqueued operations, by using an internal immediate command list.
Batched queues introduce:
Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has been added for native CPU, which is required by the enqueueTimestampRecording tests. Currently, enqueueTimestampRecording is not supported by batched queues.
Batched queues can be enabled by setting UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally, through the environment variable UR_L0_FORCE_BATCHED=1.
Benchmark results for default in-order queues (sycl branch, commit hash: b76f12e) and batched queues:
api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs
api_overhead_benchmark_ur SubmitKernel batched: 12.183 μs
For tests in CI, batched queues are enabled by default, which may cause failures in tests dedicated for other types of queues, i.e. in-order. This would be reset after tests in CI.