Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/data/nav/aitransport.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ export default {
name: 'Token streaming limits',
link: '/docs/ai-transport/token-streaming/token-rate-limits',
},
{
name: 'Publish from your server',
link: '/docs/ai-transport/token-streaming/server-publishing',
},
],
},
{
Expand Down
71 changes: 71 additions & 0 deletions src/pages/docs/ai-transport/token-streaming/server-publishing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Publish from your server
meta_description: "Learn how to publish AI response tokens from your server over a Realtime WebSocket connection, covering ordering, channel limits, and concurrent streams."
---

When streaming AI responses with [message per response](/docs/ai-transport/token-streaming/message-per-response) or [message per token](/docs/ai-transport/token-streaming/message-per-token), your server should publish tokens to Ably channels using a Realtime client. Realtime clients maintain persistent WebSocket connections to the Ably service, which provide the low-latency, ordered delivery needed for token streaming.

## Realtime connections <a id="realtime"/>

Use a Realtime client for server-side publishing with `message.append` or `message.create`. A Realtime client maintains a WebSocket connection to the Ably service, which provides low-latency publishing and guarantees that messages published on the same connection are delivered to subscribers in the order they were published. For more information, see [Realtime and REST](/docs/basics#realtime-and-rest).

While `publish()` and `appendMessage()` are available on both [REST](/docs/api/rest-sdk) and Realtime clients, REST does not guarantee [message ordering](/docs/platform/architecture/message-ordering) at high publish rates. Use a Realtime client when publishing at the rates typical of LLM token streaming.

Create a Realtime client on your server:

<Code>
```javascript
const ably = new Ably.Realtime({
key: '{{API_KEY}}',
echoMessages: false
});
```
```python
ably = AblyRealtime(
key='{{API_KEY}}',
echo_messages=False
)
```
```java
ClientOptions options = new ClientOptions();
options.key = "{{API_KEY}}";
options.echoMessages = false;
AblyRealtime ably = new AblyRealtime(options);
```
</Code>

<Aside data-type="note">
Set [`echoMessages`](/docs/api/realtime-sdk/types#client-options) to `false` on server-side clients to prevent the server from receiving its own published messages. This avoids unnecessary message delivery and billing for [echoed messages](/docs/pub-sub/advanced#echo).
</Aside>

## Message ordering <a id="ordering"/>

Ably guarantees that messages published on a single connection are delivered to subscribers in the order they were published. This ordering guarantee is essential for token streaming, because tokens must arrive in sequence for the final message to be correct.

This guarantee is per-connection. If you use [multiple clients](#multiple-streams) to handle higher concurrency, route all operations for a given channel through the same client. This ensures that all `publish()` and `appendMessage()` calls for a given response maintain their order.

For more detail on how Ably preserves message order across its globally distributed infrastructure, see [message ordering](/docs/platform/architecture/message-ordering).

## Transient publishing and channel limits <a id="transient"/>

In a typical AI application, your server publishes responses to many distinct channels, often one per user session. When your server publishes to a channel without attaching first, the SDK uses a [transient publish](/docs/pub-sub/advanced#transient-publish). Transient publishes do not count toward the limit on the [number of channels per connection](/docs/platform/pricing/limits#connection).

<Aside data-type="note">
The server must attach to the channel in order to subscribe to it. In this case, the SDK client instance will not use transient publishing.
</Aside>

Because transient publishing removes the channel limit as a constraint, the limiting factor becomes the [per-connection inbound message rate](/docs/platform/pricing/limits#connection). Typically, each channel represents a conversation between a user and an agent, so a single connection publishes to multiple channels simultaneously. The per-connection and per-channel inbound rate limits are the same value, but the connection limit is shared across all channels on that connection, making it the constraint you will reach first. See [per-connection rate limits](#rate-limits) for how this interacts with each token streaming pattern.

## Per-connection rate limits <a id="rate-limits"/>

Each connection has an [inbound message rate limit](/docs/platform/pricing/limits#connection) that caps how many messages per second can be published on that connection. How this limit interacts with your publish rate depends on the token streaming pattern you use:

- With [message per response](/docs/ai-transport/token-streaming/message-per-response), Ably rolls up multiple appends into fewer published messages. The rollup window determines how many concurrent streams a single connection can support. See [token streaming limits](/docs/ai-transport/token-streaming/token-rate-limits#rollup) for rollup configuration and concurrent stream capacity.
- With [message per token](/docs/ai-transport/token-streaming/message-per-token), each token is a separate publish with no rollup. Every publish counts directly against the connection's rate limit. See [token streaming limits](/docs/ai-transport/token-streaming/token-rate-limits#per-token) for strategies to stay within limits.

## Multiple concurrent streams <a id="multiple-streams"/>

When your server handles more concurrent AI response streams than a single connection supports, create additional Realtime clients. Each client uses its own connection with its own message rate budget, so throughput scales linearly with the number of clients.

Route all operations for a given channel through the same client to preserve [message ordering](#ordering). Use a hash of the channel name to deterministically select a client, ensuring that all `publish()` and `appendMessage()` calls for a given response go through the same connection.

32 changes: 21 additions & 11 deletions src/pages/docs/ai-transport/token-streaming/token-rate-limits.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,26 +16,36 @@ The limits in the second category, however, cannot be increased arbitrarily and

## Message-per-response <a id="per-response"/>

The [message-per-response](/docs/ai-transport/token-streaming/message-per-response) pattern includes automatic rate limit protection. AI Transport prevents a single response stream from reaching the message rate limit for a connection by rolling up multiple appends into a single published message:
The [message-per-response](/docs/ai-transport/token-streaming/message-per-response) pattern includes automatic rate limit protection. AI Transport rolls up multiple appends into a single published message before applying [connection](/docs/platform/pricing/limits#connection) and [channel](/docs/platform/pricing/limits#channel) rate limits. This means the rate limits apply to the rolled-up messages, not to individual `appendMessage()` calls. The rollup works as follows:

1. Your agent streams tokens to the channel at the model's output rate
2. Ably publishes the first token immediately, then automatically rolls up subsequent tokens on receipt
3. Clients receive the same content, delivered in fewer discrete messages
2. Ably publishes the first token immediately, then automatically rolls up subsequent tokens as they are received
3. Clients receive the same content, delivered in fewer discrete messages and as larger contiguous chunks

By default, Ably delivers a single response stream at 25 messages per second or the model output rate, whichever is lower. This means you can publish two simultaneous response streams on the same channel or connection with any [Ably package](/docs/platform/pricing#packages), because each stream uses half of the [connection inbound message rate](/docs/platform/pricing/limits#connection). Ably charges for the number of published messages, not for the number of streamed tokens.
By default, Ably delivers a single response stream at 25 messages per second or the model output rate, whichever is lower. This means you can publish two simultaneous response streams on the same channel or connection with any [Ably package](/docs/platform/pricing#packages), because each stream uses half of the [connection inbound message rate](/docs/platform/pricing/limits#connection). Ably charges for the number of published messages after rollup, not for the number of streamed tokens.

### Configure rollup behaviour <a id="rollup"/>

Ably concatenates all appends for a single response that are received during the rollup window into one published message. You can specify the rollup window for a particular connection by setting the `appendRollupWindow` [transport parameter](/docs/api/realtime-sdk#client-options). This allows you to determine how much of the connection message rate can be consumed by a single response stream and control your consumption costs.

For example, if your server publishes appends at 100 tokens per second with a 40ms rollup window:

| `appendRollupWindow` | Maximum message rate for a single response |
|---|---|
| 0ms | Model output rate |
| 20ms | 50 messages/s |
| 40ms *(default)* | 25 messages/s |
| 100ms | 10 messages/s |
| 500ms *(max)* | 2 messages/s |
- 40ms window = 25 windows per second (1000ms / 40ms)
- Ably produces 25 rolled-up messages per second, each containing approximately 4 tokens
- The 25 msg/s rolled-up rate is what counts against the connection's inbound message rate limit

The following table shows how different rollup windows affect the rate of messages received by subscribers and the number of concurrent token streams a single connection can support, assuming the model output is 50 tokens per second or greater:

| `appendRollupWindow` | Subscriber delivery rate | Concurrent streams per connection |
|---|---|---|
| 20ms | 50 msg/s | 1 |
| 40ms *(default)* | 25 msg/s | 2 |
| 100ms | 10 msg/s | 5 |
| 500ms *(max)* | 2 msg/s | 25 |

_Concurrent streams per connection is calculated based on the [connection inbound message rate limit](/docs/platform/pricing/limits#connection)._

A longer rollup window allows more concurrent streams per connection, but subscribers receive tokens in larger, less frequent batches. The default 40ms window provides smooth delivery for up to two concurrent streams. Increasing the window to 100ms or 200ms accommodates more streams with a modest reduction in delivery granularity.

The following example code demonstrates establishing a connection to Ably with `appendRollupWindow` set to 100ms:

Expand Down