Skip to content

oneAPI backend update: kernel and layer optimizations #1246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

jmitrevs
Copy link
Contributor

Description

This is a replacement of #1218, moving the branch to the main repository for easier contribution by others.

Type of change

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

This PR introduces improvements to the oneAPI inference backend, focusing on:

  • Utilizing sideband signals (sop and eop) in StreamingBeat for multi-kernel synchronization.
  • Refactoring core layers (Dense & ReLU) to employ always-run kernels and non-blocking I/O.
  • Introducing compile-time type extraction utilities for streamlined template handling.
  • Adding DMA-based data movement for generic execution.
  • Automated code generation.

Sideband Signal Support

  • Added start-of-packet (sop) and end-of-packet (eop) signals for kernel synchronization.
  • The following using-directive is generated per inter-kernel pipe and hostpipe. This ensures multiple kernels can operate in sync.
    using InputBeatT = sycl::ext::intel::experimental::StreamingBeat<
        data_T, // Data type
        true,    // Enable start-of-packet
        true>;   // Enable end-of-packet

Updated Dense and ReLU Layer for Always-Running Execution

  • Uses sop/eop sideband signals for synchronization.
  • Implements non-blocking reads for seamless streaming.
  • Utilizes while loop for always-on kernel execution.

Added DMA Kernels for Hardware Execution

  • DMA-based data movement for improved memory transfer:
    • DMA_convert_data and DMA_convert_data_back move data between host and FPGA efficiently.
    template <class srcType, class dest_pipe, size_t num_iterations> struct DMA_convert_data {};
    template <class src_pipe, class dstType, size_t num_iterations> struct DMA_convert_data_back {};
  • Modification to the way that testbench starts
    q.single_task(DMA_convert_data<float, Conv1DInputPipe, num_iterations>{vals_ptr});
    q.single_task(Myproject{});
    q.single_task(DMA_convert_data_back<Layer4OutPipe, float, num_iterations>{output_ptr}).wait();

Utility Functions for Compile-Time Type Extraction

  • Added helper structs to extract data types from pipes and StreamingBeat:

Tests

Tested the updated layers in emulation, simulation, and hardware run. Tests conducted by generating the project file using the oneAPI backend code generator, and compiling for the binary using cmake.

Test Configuration:

  • Configure the Quartus Prime Pro software with environment variables correctly setup (needed for simulation and bitstream generation.)
  • Configure the oneAPI environment with the extension Environment Configurator for oneAPI Toolkits.
  • Source the setvars script.

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have installed and run pre-commit on the files I edited or added.
  • I have added tests that prove my fix is effective or that my feature works.

@jmitrevs jmitrevs changed the title oneAPI backend update: kernel and layer optimizations (replace #1218) oneAPI backend update: kernel and layer optimizations Mar 26, 2025
@jmitrevs jmitrevs marked this pull request as draft March 26, 2025 19:28
@haoyanwa haoyanwa mentioned this pull request Apr 1, 2025
// and send to the sink. Adaptive to SYCL HLS and hardware acceleration flow.
template <class src_T, class dest_pipe> struct DMA_convert_data {
#if !defined(IS_BSP)
// When targeting a device family, we instantiate an Avalon Memory Mapped Host for
Copy link
Contributor Author

@jmitrevs jmitrevs Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all the DMA_convert_data things should be moved to a different file. In the SYCL HLS style they are effectively part of the testbench, so I think should be in a different file. In the accelerator flow, they still are different kernels, utility kernels in a way, so I think they should be separate.

@@ -13,22 +13,28 @@
namespace nnet {

template <class srcType, class dest_pipe, size_t SIZE> void convert_data(sycl::queue &q, srcType *src) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss what happens with this function vs the new DMA versions of these.

@jmitrevs
Copy link
Contributor Author

jmitrevs commented May 2, 2025

I noticed, by the way, that ReLU uses blocking reads, and all the components use blocking writes. Is there a requirement to use nonblocking reads and writes? Note, we do need to handle back-pressure, which is much more natural to do with blocking I/O.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants