Skip to content

[Bug]: response.body() returns mojibake (double-encoded UTF-8) for SSE streaming responses #3023

@RFC2109

Description

@RFC2109

Version

1.54.0

Steps to reproduce

Dependencies

pip install flask playwright
playwright install chromium

server.py

from flask import Flask, Response
import time

app = Flask(__name__)

@app.route("/")
def index():
    return """
<!DOCTYPE html>
<html>
<head><meta charset="utf-8"><title>SSE Test</title></head>
<body>
    <h1>SSE Test Page</h1>
    <button id="btn">Start SSE</button>
    <script>
        document.getElementById('btn').addEventListener('click', function() {
            const evtSource = new EventSource('/sse');
            evtSource.onmessage = function(event) { console.log(event.data); };
            evtSource.onerror = function() { evtSource.close(); };
        });
    </script>
</body>
</html>
"""

@app.route("/sse")
def sse():
    def generate():
        messages = ["你好,这是第一条消息", "测试中文:😀🎉"]
        for msg in messages:
            yield f"data: {msg}\n\n".encode('utf-8')
            time.sleep(0.3)
    return Response(generate(), headers={
        "Content-Type": "text/event-stream; charset=utf-8",
        "Cache-Control": "no-cache",
    })

if __name__ == "__main__":
    app.run(port=5000)

client.py

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()

        # Method 1: route.fetch() - WORKS CORRECTLY
        def handle_route(route):
            response = route.fetch()
            body = response.body()
            print("\n[route.fetch()] - CORRECT")
            print(f"  Raw bytes: {body!r}")
            print(f"  Decoded:   {body.decode('utf-8')!r}")
            route.fulfill(response=response)

        page.route("**/sse", handle_route)

        # Method 2: response event - BUG
        def on_response(response):
            if "/sse" in response.url:
                body = response.body()
                print("\n[response.body()] - BUG")
                print(f"  Raw bytes: {body!r}")

        page.on("response", on_response)

        page.goto("http://localhost:5000")
        page.click("#btn")
        page.wait_for_timeout(3000)
        browser.close()

if __name__ == "__main__":
    main()

Run

  1. Start the server: python server.py
  2. Run the client: python client.py

Expected behavior

response.body() should return the raw UTF-8 bytes as sent by the server:

[route.fetch()] - CORRECT
  Raw bytes: b'data: \xe4\xbd\xa0\xe5\xa5\xbd...'
  Decoded:   'data: 你好,这是第一条消息\n\ndata: 测试中文:😀🎉\n\n'

[response.body()] - CORRECT
  Raw bytes: b'data: \xe4\xbd\xa0\xe5\xa5\xbd...'

Actual behavior

response.body() returns double-encoded (mojibake) bytes:

[route.fetch()] - CORRECT
  Raw bytes: b'data: \xe4\xbd\xa0\xe5\xa5\xbd...'
  Decoded:   'data: 你好,这是第一条消息\n\ndata: 测试中文:😀🎉\n\n'

[response.body()] - BUG
  Raw bytes: b'data: \xc3\xa4\xc2\xbd\xc2\xa0\xc3\xa5\xc2\xa5\xc2\xbd...'

This is the classic pattern of UTF-8 → Latin-1 decode → UTF-8 encode (mojibake).

The double-encoding can be verified:

correct = "你好".encode('utf-8')  # b'\xe4\xbd\xa0\xe5\xa5\xbd'
mojibake = correct.decode('latin-1').encode('utf-8')  # b'\xc3\xa4\xc2\xbd\xc2\xa0\xc3\xa5\xc2\xa5\xc2\xbd'

The response.body() output matches the mojibake pattern exactly.

Additional context

  • The browser DevTools Network tab shows the correct response
  • curl also returns the correct bytes
  • Only response.body() and CDP Network.getResponseBody have this issue
  • Tested with both Python and JavaScript bindings - same bug occurs

Root cause analysis: Likely a CDP (Chrome DevTools Protocol) issue

I tested calling CDP Network.getResponseBody directly:

Method Returns Result
route.fetch() bytes ✅ Correct \xe4\xbd\xa0
CDP Network.getResponseBody str ❌ Mojibake (already decoded incorrectly)
response.body() bytes ❌ Mojibake (derived from CDP)

Key finding: CDP Network.getResponseBody returns a string (not bytes), and the string is already mojibake - meaning the incorrect decoding happens at the CDP layer, not in Playwright.

CDP test code (test_cdp.py):

from playwright.sync_api import sync_playwright

def test_cdp():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        
        client = page.context.new_cdp_session(page)
        client.send("Network.enable")
        
        responses = {}
        
        def on_response_received(params):
            if "/sse" in params.get("response", {}).get("url", ""):
                responses[params["requestId"]] = params["response"]["url"]
        
        def on_loading_finished(params):
            if params["requestId"] in responses:
                result = client.send("Network.getResponseBody", {"requestId": params["requestId"]})
                print(f"CDP Network.getResponseBody:")
                print(f"  base64Encoded: {result.get('base64Encoded')}")
                print(f"  body type: {type(result.get('body'))}")  # <class 'str'> !!!
                print(f"  body: {result.get('body')[:50]!r}...")  # Already mojibake
        
        client.on("Network.responseReceived", on_response_received)
        client.on("Network.loadingFinished", on_loading_finished)
        
        page.goto("http://localhost:5000")
        page.click("#btn")
        page.wait_for_timeout(3000)
        browser.close()

test_cdp()

CDP test output:

CDP Network.getResponseBody:
  base64Encoded: False
  body type: <class 'str'>
  body: 'data: ä½\xa0好,这是第一æ\x9d¡æ¶ˆæ\x81¯...'  # Already mojibake!

JavaScript test (client.js):

const { chromium } = require('playwright');
(async () => {
    const browser = await chromium.launch({ headless: false });
    const page = await browser.newPage();
    
    await page.route('**/sse', async route => {
        const res = await route.fetch();
        console.log('[route.fetch()]', (await res.body()));
        await route.fulfill({ response: res });
    });
    
    page.on('response', async res => {
        if (res.url().includes('/sse'))
            console.log('[response.body()]', (await res.body()));
    });
    
    await page.goto('http://localhost:5000');
    await page.click('#btn');
    await page.waitForTimeout(3000);
    await browser.close();
})();

JavaScript output:

[route.fetch()] <Buffer 64 61 74 61 3a 20 e4 bd a0 e5 a5 bd ...>  ✅ Correct
[response.body()] <Buffer 64 61 74 61 3a 20 c3 a4 c2 bd c2 a0 ...>  ❌ Mojibake

Why this is a blocking issue (no viable workaround)

While route.fetch() returns correct bytes, it cannot be used as a workaround for real-world SSE streams:

  1. SSE streams can last for minutes (e.g., LLM streaming responses, real-time data feeds)
  2. route.fetch() blocks until the entire response is complete
  3. route.fulfill() can only be called after route.fetch() returns
  4. This means the browser receives no data until the stream ends (minutes later)

My use case: I'm testing an AI chat application where SSE responses stream for 2-5 minutes. I need to capture the response content for automated testing, but:

  • response.body() gives mojibake
  • route.fetch() blocks for minutes, making the test useless

The only remaining option is to use page.expose_function() and capture data via JavaScript in the browser, which is a hacky workaround that shouldn't be necessary.

Environment

- Operating System: Windows 10 Pro (10.0.19045)
- CPU: Intel Core i5-10500 @ 3.10GHz
- Browser: Chrome 143.0.7499.170
- Python Version: 3.10.16
- Node.js Version: 18.13.0
- Other info: Tested with both Python and JavaScript bindings

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions