Scalability and Parallelism

This guide covers the scalability features of erlang_python, including execution modes, rate limiting, and parallel execution.

Execution Modes

erlang_python automatically detects the optimal execution mode based on your Python version:

%% Check current execution mode
py:execution_mode().
%% => free_threaded | subinterp | multi_executor

%% Check number of executor threads
py:num_executors().
%% => 4 (default)

Mode Comparison

Mode	Python Version	Parallelism	GIL Behavior	Best For
free_threaded	3.13+ (nogil build)	True N-way	None	Maximum throughput
subinterp	3.12+	True N-way	Per-interpreter	CPU-bound, isolation
multi_executor	Any	GIL contention	Shared, round-robin	I/O-bound, compatibility

Free-Threaded Mode (Python 3.13+)

When running on a free-threaded Python build (compiled with --disable-gil), erlang_python executes Python calls directly without any executor routing. This provides maximum parallelism for CPU-bound workloads.

Sub-interpreter Mode (Python 3.12+)

Uses Python's sub-interpreter feature with per-interpreter GIL (Py_GIL_OWN). Each sub-interpreter runs in its own dedicated thread with its own GIL, enabling true parallel execution across interpreters.

Architecture:

Thread pool manages N subinterpreters (default: number of schedulers)
Each subinterpreter has its own thread, GIL, and Python state
Requests are routed to subinterpreters via py_context_router
25-30% faster cast operations compared to worker mode

Note: Each sub-interpreter has isolated state. Use the Shared State API to share data between workers.

Explicit Context Selection:

%% Get a specific context by index (1-based)
Ctx = py:context(1),
{ok, Result} = py:call(Ctx, math, sqrt, [16]).

%% Or use automatic scheduler-affinity routing
{ok, Result} = py:call(math, sqrt, [16]).

Multi-Executor Mode (Python < 3.12)

Runs N executor threads that share the GIL. Requests are distributed round-robin across executors. Good for I/O-bound workloads where Python releases the GIL during I/O operations.

Choosing the Right Mode

Mode Comparison

Aspect	Free-Threaded	Subinterpreter	Multi-Executor
Parallelism	True N-way	True N-way	GIL contention
State Isolation	Shared	Isolated	Shared
Memory Overhead	Low	Higher (per-interp)	Low
Module Compatibility	Limited	Most modules	All modules
Python Version	3.13+ (nogil)	3.12+	Any

When to Use Each Mode

Use Free-Threaded (Python 3.13t) when:

You need maximum parallelism with shared state
Your libraries are GIL-free compatible
You're running CPU-bound workloads
Memory efficiency is important

Use Subinterpreters (Python 3.12+) when:

You need parallelism with state isolation
You want crash isolation between contexts
You're running untrusted or unstable code
You need predictable per-request state

Use Multi-Executor (Python < 3.12) when:

Running on older Python versions
Your workload is I/O-bound (GIL released during I/O)
You need compatibility with all Python modules
Shared state between workers is required

Pros and Cons

Subinterpreter Mode Pros:

True parallelism without GIL contention
Complete isolation (crashes don't affect other contexts)
Each context has clean namespace (no state bleed)
25-30% faster cast operations vs worker mode

Subinterpreter Mode Cons:

Higher memory usage (each interpreter loads modules separately)
Some C extensions don't support subinterpreters
No shared state between contexts (use Shared State API)
asyncio event loop integration requires main interpreter

Free-Threaded Mode Pros:

True parallelism with shared state
Lower memory overhead than subinterpreters
Simplest mental model (like regular threading)

Free-Threaded Mode Cons:

Requires Python 3.13+ built with --disable-gil
Many C extensions not yet compatible
Shared state requires careful synchronization
Still experimental

Subinterpreter Architecture

Design Overview

┌─────────────────────────────────────────────────────────────────┐
│                     Erlang VM (BEAM)                            │
├─────────────────────────────────────────────────────────────────┤
│  py_context_router                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Scheduler 1 ──► Context 1 (pid)                        │   │
│  │  Scheduler 2 ──► Context 2 (pid)                        │   │
│  │  Scheduler N ──► Context N (pid)                        │   │
│  └─────────────────────────────────────────────────────────┘   │
│         │              │              │                         │
│         ▼              ▼              ▼                         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                   │
│  │ Context  │   │ Context  │   │ Context  │                   │
│  │ Process  │   │ Process  │   │ Process  │                   │
│  │ (gen_srv)│   │ (gen_srv)│   │ (gen_srv)│                   │
│  └────┬─────┘   └────┬─────┘   └────┬─────┘                   │
└───────┼──────────────┼──────────────┼───────────────────────────┘
        │              │              │
        ▼              ▼              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Subinterpreter Thread Pool                     │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │   Thread 1   │  │   Thread 2   │  │   Thread N   │         │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │         │
│  │ │  Interp  │ │  │ │  Interp  │ │  │ │  Interp  │ │         │
│  │ │  (GIL 1) │ │  │ │  (GIL 2) │ │  │ │  (GIL N) │ │         │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │         │
│  └──────────────┘  └──────────────┘  └──────────────┘         │
│                                                                 │
│  Each thread owns its interpreter's GIL (Py_GIL_OWN)           │
│  No GIL contention between threads                              │
└─────────────────────────────────────────────────────────────────┘

Key Components

py_context_router: Routes requests to context processes based on scheduler affinity or explicit binding.

py_context_process: Gen_server that owns a Python context reference and handles call/eval/exec operations.

Subinterpreter Thread Pool (C): Manages N threads, each with its own Python subinterpreter created with Py_NewInterpreterFromConfig() and Py_GIL_OWN.

Request Flow

Erlang process calls py:call(Module, Func, Args)
py_context_router selects context based on scheduler ID
Request sent to py_context_process gen_server
Gen_server calls NIF which executes on subinterpreter's thread
Result returned through gen_server to caller

Pool Size

The subinterpreter pool size is configured at two levels:

Level	Default	Max
Erlang (py_context_router)	`erlang:system_info(schedulers)`	configurable
C pool (py_subinterp_pool)	32	64

On a typical 8-core machine, 8 context processes are started, each with one subinterpreter slot.

Configuration via sys.config:

{erlang_python, [
    {num_contexts, 16}  %% Override scheduler count
]}

Configuration at runtime:

%% Start with custom pool size
py_context_router:start(#{contexts => 16}).

Thread Safety

Each subinterpreter has its own GIL (no cross-interpreter contention)
NIF calls are serialized per-context via gen_server
Erlang message passing provides synchronization
C code uses atomics for cross-thread state (thread_running flag)

Rate Limiting

All Python calls pass through an ETS-based counting semaphore that prevents overload:

%% Check semaphore status
py_semaphore:max_concurrent().  %% => 29 (schedulers * 2 + 1)
py_semaphore:current().         %% => 0 (currently running)

%% Dynamically adjust limit
py_semaphore:set_max_concurrent(50).

How It Works

┌─────────────────────────────────────────────────────────────┐
│                      py_semaphore                           │
│                                                             │
│  ┌─────────┐    ┌─────────────────────────────────────┐    │
│  │ Counter │◄───│  ets:update_counter (atomic)        │    │
│  │  [29]   │    │  {write_concurrency, true}          │    │
│  └─────────┘    └─────────────────────────────────────┘    │
│                                                             │
│  acquire(Timeout) ──► increment ──► check ≤ max?           │
│                       │                 │                   │
│                       │             yes │ no                │
│                       │                 │  │                │
│                       │              ok │  └──► backoff     │
│                       │                 │       loop        │
│  release() ──────────►└──── decrement ──┘                   │
└─────────────────────────────────────────────────────────────┘

Overload Protection

When the semaphore is exhausted, py:call returns an overload error instead of blocking forever:

{error, {overloaded, Current, Max}} = py:call(module, func, []).

This allows your application to implement backpressure or shed load gracefully.

Configuration

%% sys.config
[
    {erlang_python, [
        %% Maximum concurrent Python operations (semaphore limit)
        %% Default: erlang:system_info(schedulers) * 2 + 1
        {max_concurrent, 50},

        %% Number of executor threads (multi_executor mode only)
        %% Default: 4
        {num_executors, 8},

        %% Worker pool sizes
        {num_workers, 4},
        {num_async_workers, 2},
        {num_subinterp_workers, 4}
    ]}
].

Parallel Execution with Sub-interpreters

For CPU-bound workloads on Python 3.12+, erlang_python provides true parallelism via OWN_GIL subinterpreters.

Check Support

%% Check if subinterpreters are supported (Python 3.12+)
true = py:subinterp_supported().

%% Check current execution mode
subinterp = py:execution_mode().

Using the Context Router

The context router automatically distributes calls across subinterpreters:

%% Start contexts (usually done by application startup)
{ok, _} = py:start_contexts().

%% Calls are automatically routed to subinterpreters
{ok, 4.0} = py:call(math, sqrt, [16]).
{ok, 6} = py:eval(<<"2 + 4">>).
ok = py:exec(<<"x = 42">>).

Explicit Context Selection

For fine-grained control, use explicit context selection:

%% Get a specific context by index (1-based)
Ctx = py:context(1),

%% All operations on this context share state
ok = py:exec(Ctx, <<"my_var = 'hello'">>),
{ok, <<"hello">>} = py:eval(Ctx, <<"my_var">>),
{ok, 4.0} = py:call(Ctx, math, sqrt, [16]).

%% Different context has isolated state
Ctx2 = py:context(2),
{error, _} = py:eval(Ctx2, <<"my_var">>).  %% Not defined in Ctx2

Context Router API

%% Start router with default number of contexts (scheduler count)
{ok, Contexts} = py_context_router:start().

%% Start with custom number of contexts
{ok, Contexts} = py_context_router:start(#{contexts => 8}).

%% Get context for current scheduler (automatic affinity)
Ctx = py_context_router:get_context().

%% Get specific context by index
Ctx = py_context_router:get_context(1).

%% Bind current process to a specific context
ok = py_context_router:bind_context(Ctx).

%% Unbind (return to scheduler-based routing)
ok = py_context_router:unbind_context().

%% Get number of active contexts
N = py_context_router:num_contexts().

%% Stop all contexts
ok = py_context_router:stop().

Parallel Execution

Execute multiple calls in parallel across subinterpreters:

%% Execute multiple calls in parallel
{ok, Results} = py:parallel([
    {math, sqrt, [16]},
    {math, sqrt, [25]},
    {math, sqrt, [36]}
]).
%% => {ok, [{ok, 4.0}, {ok, 5.0}, {ok, 6.0}]}

Each call runs in its own sub-interpreter with its own GIL, enabling true parallelism.

Testing with Free-Threading

To test with a free-threaded Python build:

1. Install Python 3.13+ with Free-Threading

# macOS with Homebrew
brew install python@3.13 --with-freethreading

# Or build from source
./configure --disable-gil
make && make install

# Or use pyenv
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0

2. Verify Free-Threading is Enabled

python3 -c "import sys; print('GIL disabled:', hasattr(sys, '_is_gil_enabled') and not sys._is_gil_enabled())"

3. Rebuild erlang_python

# Clean and rebuild with free-threaded Python
rebar3 clean
PYTHON_CONFIG=/path/to/python3.13-config rebar3 compile

4. Verify Mode

1> application:ensure_all_started(erlang_python).
2> py:execution_mode().
free_threaded

Performance Tuning

For CPU-Bound Workloads

Use py:parallel/1 with sub-interpreters (Python 3.12+)
Or use free-threaded Python (3.13+)
Increase max_concurrent to match available CPU cores

For I/O-Bound Workloads

Multi-executor mode works well (GIL released during I/O)
Increase num_executors to handle more concurrent I/O
Use asyncio integration for async I/O

For Mixed Workloads

Balance max_concurrent based on memory constraints
Monitor py_semaphore:current() for load metrics
Implement application-level backpressure based on overload errors

Monitoring

%% Current load
Load = py_semaphore:current(),
Max = py_semaphore:max_concurrent(),
Utilization = Load / Max * 100,
io:format("Python load: ~.1f%~n", [Utilization]).

%% Execution mode info
Mode = py:execution_mode(),
Executors = py:num_executors(),
io:format("Mode: ~p, Executors: ~p~n", [Mode, Executors]).

%% Memory stats
{ok, Stats} = py:memory_stats(),
io:format("GC stats: ~p~n", [maps:get(gc_stats, Stats)]).

Shared State

Since workers (and sub-interpreters) have isolated namespaces, erlang_python provides ETS-backed shared state accessible from both Python and Erlang:

from erlang import state_set, state_get, state_incr, state_decr

# Share configuration across workers
config = state_get('app_config')

# Thread-safe metrics
state_incr('requests_total')
state_incr('bytes_processed', len(data))

%% Set config that all workers can read
py:state_store(<<"app_config">>, #{model => <<"gpt-4">>, timeout => 30000}).

%% Read metrics
{ok, Total} = py:state_fetch(<<"requests_total">>).

The state is backed by ETS with {write_concurrency, true}, making atomic counter operations fast and lock-free. See Getting Started for the full API.

Reentrant Callbacks

erlang_python supports reentrant callbacks where Python code calls Erlang functions that themselves call back into Python. This is handled without deadlocking through a suspension/resume mechanism:

%% Register an Erlang function that calls Python
py:register_function(compute_via_python, fun([X]) ->
    {ok, Result} = py:call('__main__', complex_compute, [X]),
    Result * 2  %% Erlang post-processing
end).

%% Python code that uses the callback
py:exec(<<"
def process(x):
    from erlang import call
    # Calls Erlang, which calls Python's complex_compute
    result = call('compute_via_python', x)
    return result + 1
">>).

How Reentrant Callbacks Work

┌─────────────────────────────────────────────────────────────────┐
│                     Reentrant Callback Flow                      │
│                                                                 │
│  1. Python calls erlang.call('func', args)                      │
│     └──► Returns suspension marker, frees dirty scheduler       │
│                                                                 │
│  2. Erlang executes the registered callback                     │
│     └──► May call py:call() to run Python (on different worker) │
│                                                                 │
│  3. Erlang calls resume_callback with result                    │
│     └──► Schedules dirty NIF to return result to Python         │
│                                                                 │
│  4. Python continues with the callback result                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Benefits

No Deadlocks: Dirty schedulers are freed during callback execution
Nested Callbacks: Multiple levels of Python→Erlang→Python→... are supported
Transparent: From Python's perspective, erlang.call() appears synchronous
No Configuration: Works automatically with all execution modes

Performance Considerations

Reentrant callbacks have slightly higher overhead due to suspension/resume
For tight loops, consider batching operations to reduce callback overhead
Concurrent reentrant calls are fully supported and scale well

Example: Nested Callbacks

%% Each level alternates between Erlang and Python
py:register_function(level, fun([N, Max]) ->
    case N >= Max of
        true -> N;
        false ->
            {ok, Result} = py:call('__main__', next_level, [N + 1, Max]),
            Result
    end
end).

py:exec(<<"
def next_level(n, max):
    from erlang import call
    return call('level', n, max)

def start(max):
    from erlang import call
    return call('level', 1, max)
">>).

%% Test 10 levels of nesting
{ok, 10} = py:call('__main__', start, [10]).

Example

See examples/reentrant_demo.erl and examples/reentrant_demo.py for a complete demonstration including:

Basic reentrant calls with arithmetic expressions
Fibonacci with Erlang memoization
Deeply nested callbacks (10+ levels)
OOP-style class method callbacks

# Run the demo
rebar3 shell
1> reentrant_demo:start().
2> reentrant_demo:demo_all().

Building for Performance

Standard Build

rebar3 compile

Uses -O2 optimization and standard compiler flags.

Performance Build

For production deployments where maximum performance is needed:

# Clean and rebuild with aggressive optimizations
rm -rf _build/cmake
mkdir -p _build/cmake && cd _build/cmake
cmake ../../c_src -DPERF_BUILD=ON
cmake --build . -j$(nproc)

The PERF_BUILD option enables:

Flag	Effect
`-O3`	Aggressive optimization level
`-flto`	Link-Time Optimization
`-march=native`	CPU-specific instruction set
`-ffast-math`	Relaxed floating-point math
`-funroll-loops`	Loop unrolling

Caveats:

Binaries are not portable (tied to build machine's CPU)
Build time increases due to LTO
-ffast-math may affect floating-point precision

Verifying the Build

%% Check that the NIF loaded successfully
1> application:ensure_all_started(erlang_python).
{ok, [erlang_python]}

%% Run basic verification
2> py:eval("1 + 1").
{ok, 2}

Uh oh!

FilesExpand file tree

scalability.md

Latest commit

History

scalability.md

File metadata and controls

Scalability and Parallelism

Execution Modes

Mode Comparison

Free-Threaded Mode (Python 3.13+)

Sub-interpreter Mode (Python 3.12+)

Multi-Executor Mode (Python < 3.12)

Choosing the Right Mode

Mode Comparison

When to Use Each Mode

Pros and Cons

Subinterpreter Architecture

Design Overview

Key Components

Request Flow

Pool Size

Thread Safety

Rate Limiting

How It Works

Overload Protection

Configuration

Parallel Execution with Sub-interpreters

Check Support

Using the Context Router

Explicit Context Selection

Context Router API

Parallel Execution

Testing with Free-Threading

1. Install Python 3.13+ with Free-Threading

2. Verify Free-Threading is Enabled

3. Rebuild erlang_python

4. Verify Mode

Performance Tuning

For CPU-Bound Workloads

For I/O-Bound Workloads

For Mixed Workloads

Monitoring

Shared State

Reentrant Callbacks

How Reentrant Callbacks Work

Benefits

Performance Considerations

Example: Nested Callbacks

Example

Building for Performance

Standard Build

Performance Build

Verifying the Build

See Also