Skip to content

Optimised USB receive code - interrupts, no locking, block transfers,… #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
101 changes: 101 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
Optimised host <-> Arduino Due streaming data transfer over native USB port
============================================================================

Streaming data over the native USB to the Arduino Due was disappointingly slow, about 60kb/s, yet transmission from Arduino to host computer could run at Mb/s, suggesting that the USB connection was not limiting. Inspection of the code showed that data was received byte by byte through a chain of functions with an overhead of massively redundant checks.

This repository is a fork of the official code and contains optimisations of the USB code. With them, it was possible to perform round-trip streaming of data to and from the Arduino Due at 2.5Mb/s.

Only 3 files have been changed. If you don't wish to deal with the whole git repository, the files CDC.cpp, USBAPI.h and USBCore.cpp in the ArduinoCore-sam/cores/arduino/USB/ directory can be transferred into your local board package, which under my linux is found here:
~/.arduino15/packages/arduino/hardware/sam/<version>/cores/arduino/USB/

A pull request has been submitted but not acted upon.

Changes to the arduino libraries
================================

A (non-blocking) overloaded read function that accepts as parameters a buffer and size is now provided as a member of the SerialUSB class. If neither of the read functions is used, for instance during a DMA application, the user may need to call "SerialUSB.accept()" periodically if there is a danger that the buffer will sometimes be too full to accept a full FIFO (512 bytes) of data upon interrupt, as this will cause reception to block. The CDC_SERIAL_BUFFER_SIZE can be increased (in the library code) from the original 512 bytes to reduce this risk.

The SerialUSB (Serial_) class has been modified to remove all mention of the RingBuffer used elsewhere in the Arduino code but NOT here. This was confusing at best.

The ring_buffer that IS used in SerialUSB has been made a member of the class. This enables access to it during DMA applications (e.g. streaming to DAC), eliminating needless copy operations.

The implementation of the ring_buffer has been altered slightly to facilitate DMA applications; head and tail are now ever-increasing 64-bit integers.

The Arduino Due code uses a poor-man's interrupts, scheduling them between loop iterations. Proper USB interrupts can now be enabled via new member functions and interrupt-driven code can handle most, and in some scenarios all, data reception.

The code has been reworked to remove the need for locking even with interrupts enabled. This is achieved by using the FIFO signals for synchronisation.

Block transfers are now used throughout the reception chain and their overhead has been minimised. The accept function has been rewritten.

The changes to the code are under the same licences as the original files.

Examples
========

Speed test
----------

This is a simple speed test. The Arduino sketch just reads available data on the native USB serial port using the new block read member and sends it back, in a loop. On the host computer, a large array is written to and read back from the serial tty by a short C++ program making use of the "select" call for efficient sequencing of the i/o operations. The port is specified as a parameter ("0" in the example).

$ g++ -O3 -o speed_test speed_test.cpp

$ time ./speed_test 0

Test round-trip streaming with 100000000 bytes.

/dev/ttyACM0

Arrays equal!


real 0m37.852s

user 0m0.288s

sys 0m1.576s

100 Mb in ~40s is 2.5 Mb/s.

Bidirectional streaming with DAC and ADC DMA
--------------------------------------------

This example is much more involved, but reflects the motivation for this project. An array is streamed from the host computer to the arduino, where it is transferred to the DAC by DMA. At the same time, two ADC channels are acquired at the same total frequency and streamed back to the host. Here the speed is limited by the maximum ADC rate of 1 MHz, corresponding to an arduino -> host data rate of 2 Mb/s (with 1 Mb/s flowing in the opposite direction). A timer library is included from https://github.com/OliviliK/DueTC
A file is generated by the python script that contains a few control parameters in the header and data for the DAC, as well as space for the ADC data to be acquired, an error flag and a timestamp. The file is memory-mapped to enable simultaneous i/o. The path to the data file and the tty port are given as parameters.

Connect DAC0 to A0 and GND to A1 (for instance).

$ python genfile.py

$ g++ -O3 -o bidi bidi.cpp

$ time ./bidi test.dat /dev/ttyACM0

test.dat

1000000 42000000 42 2

2019-03-05_12:26:33

real 0m2.011s

user 0m0.144s

sys 0m1.802s

$ python display.py

header (1000000, 42000000, 42, 2)

error 0

timestamp 2019-03-04_01:05:30

<plot of input and uninterleaved outputs>

If you find the streaming to be unreliable (an error is raised), there is probably a bottleneck somewhere. Things to try:
- use an SSD instead of a normal disk
- modify the host code to work only in memory
- increase the sizes of the buffers in the arduino library code or for the DAC, ADC buffers.


The examples are licensed in the public domain.
117 changes: 70 additions & 47 deletions cores/arduino/USB/CDC.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
/* Copyright (c) 2011, Peter Barrett
** Copyright (c) 2017, Boris Barbour
**
** Permission to use, copy, modify, and/or distribute this software for
** any purpose with or without fee is hereby granted, provided that the
Expand All @@ -21,23 +22,12 @@

#ifdef CDC_ENABLED

#define CDC_SERIAL_BUFFER_SIZE 512

/* For information purpose only since RTS is not always handled by the terminal application */
#define CDC_LINESTATE_DTR 0x01 // Data Terminal Ready
#define CDC_LINESTATE_RTS 0x02 // Ready to Send

#define CDC_LINESTATE_READY (CDC_LINESTATE_RTS | CDC_LINESTATE_DTR)

struct ring_buffer
{
uint8_t buffer[CDC_SERIAL_BUFFER_SIZE];
volatile uint32_t head;
volatile uint32_t tail;
};

ring_buffer cdc_rx_buffer = { { 0 }, 0, 0};

typedef struct
{
uint32_t dwDTERate;
Expand Down Expand Up @@ -173,45 +163,45 @@ void Serial_::end(void)

void Serial_::accept(void)
{
static uint32_t guard = 0;

// synchronized access to guard
do {
if (__LDREXW(&guard) != 0) {
__CLREX();
return; // busy
}
} while (__STREXW(1, &guard) != 0); // retry until write succeed

// Use fifocon to synchronise. Leave if there is no data.
if (!Is_udd_fifocon(CDC_RX)) return;
// This rearms interrupt, but FIFO must be released before it
// can retrigger. Moved here from the interrupt service
// routine because we may come to this function directly.
if (Is_udd_out_received(CDC_RX)) udd_ack_out_received(CDC_RX);
ring_buffer *buffer = &cdc_rx_buffer;
uint32_t i = (uint32_t)(buffer->head+1) % CDC_SERIAL_BUFFER_SIZE;

// if we should be storing the received character into the location
// just before the tail (meaning that the head would advance to the
// current location of the tail), we're about to overflow the buffer
// and so we don't write the character or advance the head.
while (i != buffer->tail) {
uint32_t c;
if (!USBD_Available(CDC_RX)) {
udd_ack_fifocon(CDC_RX);
break;
}
c = USBD_Recv(CDC_RX);
// c = UDD_Recv8(CDC_RX & 0xF);
buffer->buffer[buffer->head] = c;
buffer->head = i;

i = (i + 1) % CDC_SERIAL_BUFFER_SIZE;
uint32_t b = CDC_SERIAL_BUFFER_SIZE;
uint32_t u = UDD_FifoByteCount(CDC_RX);
uint32_t s = b - (uint32_t)(buffer->head - buffer->tail);
uint32_t r = min(s, u);
while(r) {
// May only be able to fill to the end of the buffer in first call.
uint32_t h = (buffer->head)%b;
uint32_t g = min(r, b-h);
UDD_Recv(CDC_RX, &(buffer->buffer[h]), g);
r -= g;
buffer->head += g;
}
// Don't release FIFO if not all data was transferred.
if (!UDD_FifoByteCount(CDC_RX)) UDD_ReleaseRX(CDC_RX);
}

// release the guard
guard = 0;
void Serial_::enableInterrupts()
{
udd_enable_out_received_interrupt(CDC_RX);
udd_enable_endpoint_interrupt(CDC_RX);
}

void Serial_::disableInterrupts()
{
udd_disable_out_received_interrupt(CDC_RX);
udd_disable_endpoint_interrupt(CDC_RX);
}

int Serial_::available(void)
{
ring_buffer *buffer = &cdc_rx_buffer;
return (unsigned int)(CDC_SERIAL_BUFFER_SIZE + buffer->head - buffer->tail) % CDC_SERIAL_BUFFER_SIZE;
return (unsigned int)(buffer->head - buffer->tail);
}

int Serial_::availableForWrite(void)
Expand All @@ -231,29 +221,62 @@ int Serial_::peek(void)
}
else
{
return buffer->buffer[buffer->tail];
uint32_t b = CDC_SERIAL_BUFFER_SIZE;
return buffer->buffer[(buffer->tail)%b];
}
}

int Serial_::read(void)
{
ring_buffer *buffer = &cdc_rx_buffer;

// Give "accept" a chance to catch up if data is ready.
// Interrupt shouldn't be able to fire in this condition.
//if (Is_udd_fifocon(CDC_RX))
accept();

// if the head isn't ahead of the tail, we don't have any characters
if (buffer->head == buffer->tail)
{
return -1;
}
else
{
unsigned char c = buffer->buffer[buffer->tail];
buffer->tail = (unsigned int)(buffer->tail + 1) % CDC_SERIAL_BUFFER_SIZE;
if (USBD_Available(CDC_RX))
accept();
uint32_t b = CDC_SERIAL_BUFFER_SIZE;
unsigned char c = buffer->buffer[(buffer->tail)%b];
buffer->tail++;
return c;
}
}

int Serial_::read(uint8_t *d, size_t s)
{
ring_buffer *buffer = &cdc_rx_buffer;
uint32_t b = CDC_SERIAL_BUFFER_SIZE;
uint32_t a = (uint32_t) (buffer->head - buffer->tail);
// Number of bytes to read is the smaller of those available and those requested.
uint32_t r = min(a, s);
uint32_t k = r;
// May reach end of buffer before completing transfer.
while(r) {
uint32_t tm = (buffer->tail)%b;
uint32_t g = min(r, b-tm);
for (int i = 0 ; i < g; i++) {
d[i] = buffer->buffer[tm + i];
}
d += g;
r -= g;
buffer->tail += g;
}
// Give "accept" a chance to catch up if data is ready.
// Interrupt shouldn't be able to fire in this condition.
// if (Is_udd_fifocon(CDC_RX)) {
if ((a-k) < b) accept();
//}
return k;
}


void Serial_::flush(void)
{
USBD_Flush(CDC_TX);
Expand Down
48 changes: 44 additions & 4 deletions cores/arduino/USB/USBAPI.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
/*
Copyright (c) 2012 Arduino. All right reserved.
Copyright (c) 2012 Arduino.
Copyright (c) 2017 Boris Barbour
All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
Expand All @@ -21,7 +23,6 @@

#if defined __cplusplus

#include "RingBuffer.h"
#include "Stream.h"

//================================================================================
Expand All @@ -40,15 +41,53 @@ class USBDevice_
};
extern USBDevice_ USBDevice;

// Best to use a power of 2 to enable simplified modulo calculations
// (the compiler will automatically use &). USB2 packets can contain
// up to 1kb (isochronous mode), but usually less. The FIFOs seem to
// be 512 bytes on the Due/SAM.
#define CDC_SERIAL_BUFFER_SIZE 512

// This could go into a separate header and file, but it's small and
// users may need access to the size definition. Note that this is
// (confusingly) distinct from the general RingBuffer declared in
// RingBuffer.h. The implementation has been changed. Instead of
// continuously taking the modulus of head and tail, we now have
// ever-increasing longs, whose modulus is taken only to address the
// buffer. This may add a small overhead, but ensures that the tail
// overrunning the head can be detected even with interrupt and DMA
// applications. As a minor side-effect, the buffer can hold one more
// byte, since head==tail (empty) can now be distinguished from
// head==tail+size (full). The use of 64-bit uints ensures that they
// never overflow in the lifetime of the universe; 32-bit uints might
// do so in a matter of hours or days at top rates.
struct ring_buffer
{
uint8_t buffer[CDC_SERIAL_BUFFER_SIZE];
volatile uint64_t head;
volatile uint64_t tail;
};

//================================================================================
//================================================================================
// Serial over CDC (Serial1 is the physical port)

class Serial_ : public Stream
{
private:
RingBuffer *_cdc_rx_buffer;
public:
// The ring buffer implementation is public to allow user DMA access.
ring_buffer cdc_rx_buffer = { { 0 }, 0, 0};
// Standard arduino only schedules interrupts between "loop"
// iterations, so this is the default and the user will be
// responsible for scheduling reception of data via "accept"
// directly or indirectly by calling one of the "read"
// functions, which call "accept" (the read functions are not
// used during DMA applications). Even when interrupts are
// enabled, if the receive buffer doesn't have space when the
// interrupt service routine is called, it can be necessary to
// call "accept" manually to complete the transfer and prevent
// blocking.
void enableInterrupts();
void disableInterrupts();
void begin(uint32_t baud_count);
void begin(uint32_t baud_count, uint8_t config);
void end(void);
Expand All @@ -58,6 +97,7 @@ class Serial_ : public Stream
virtual void accept(void);
virtual int peek(void);
virtual int read(void);
virtual int read(uint8_t *d, size_t t);
virtual void flush(void);
virtual size_t write(uint8_t);
virtual size_t write(const uint8_t *buffer, size_t size);
Expand Down
Loading