Skip to content

Coyote accelerator backend #1347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

bo3z
Copy link
Contributor

@bo3z bo3z commented Jul 28, 2025

Description

📝 This PR introduces a new accelerator backend, CoyoteAccelerator, which leverages the open-source Coyote shell for deploying models on a PCI-attached FPGA.

Generally, Coyote offers several advantages, when compared to some other shells, including:

  • Networking support, so the backend can easily be extended to support distributed inference. Also interesting for in-network ML.
  • GPU - FPGA integration, so models can be executed on a combination of hardware
  • Dynamic reconfiguration, which could allow run-time reconfiguration of models
  • Multi-tenancy, so multiple models could be deployed concurrently.

The backend is briefly described in Section 9.7 of the paper: https://arxiv.org/pdf/2504.21538.

Type of change

  • New feature (non-breaking change which adds functionality)
  • A new research paper code implementation

Tests

This backend was compared agains a modified* version of the VivadoAccelerator backend: the backend was modified to run HLS synthesis with Vitis instead of Vivado (also using Vitis templates and optimizers), while the rest of the backend infrastructure (drivers, data movers remained the same since they also work in newer version of Vivado). Results are attached below - clearly indicating an advantage in Coyote, for two reasons (1) optimised data movement, bypassing card memory and (2) optimised host-side library (Python, C++).

In principle, the correct test would be to compare against VitisAccelerator (#991), but only after the io_parallel issues are resolved. However, the expectation is that the result will stay mostly the same, sine the underlying platform requires a data copy between host and card memory.

Will add some more results, also for io-stream CNN, and comparisons to VitisAccelerator.

Screenshot 2025-07-28 at 12 11 24

Figure above: comparison of CoyoteAccelerator with modified Vivado Accelerator for the UNSW-NB15 dataset in io_parallel.

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have installed and run pre-commit on the files I edited or added.
  • I have added tests that prove my fix is effective or that my feature works.

while (coyote_thread.checkCompleted(coyote::CoyoteOper::LOCAL_TRANSFER) != batch_size) {
std::this_thread::sleep_for(std::chrono::nanoseconds(50));
}
while (coyote_thread.checkCompleted(coyote::CoyoteOper::LOCAL_TRANSFER) != batch_size) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause 100% CPU usage while the program is polling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one of the cores, yes.

But sleep for less than 50us is not well-defined on most Linux platforms. Hence, measured latency can go from ~4us to >50us even though the "true" execution latency is still 4us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants