Skip to content

Latest commit

 

History

History
618 lines (466 loc) · 21.9 KB

File metadata and controls

618 lines (466 loc) · 21.9 KB

Scalability and Parallelism

This guide covers the scalability features of erlang_python, including execution modes, rate limiting, and parallel execution.

Execution Modes

erlang_python automatically detects the optimal execution mode based on your Python version:

%% Check current execution mode
py:execution_mode().
%% => free_threaded | subinterp | multi_executor

%% Check number of executor threads
py:num_executors().
%% => 4 (default)

Mode Comparison

Mode Python Version Parallelism GIL Behavior Best For
free_threaded 3.13+ (nogil build) True N-way None Maximum throughput
subinterp 3.12+ True N-way Per-interpreter CPU-bound, isolation
multi_executor Any GIL contention Shared, round-robin I/O-bound, compatibility

Free-Threaded Mode (Python 3.13+)

When running on a free-threaded Python build (compiled with --disable-gil), erlang_python executes Python calls directly without any executor routing. This provides maximum parallelism for CPU-bound workloads.

Sub-interpreter Mode (Python 3.12+)

Uses Python's sub-interpreter feature with per-interpreter GIL (Py_GIL_OWN). Each sub-interpreter runs in its own dedicated thread with its own GIL, enabling true parallel execution across interpreters.

Architecture:

  • Thread pool manages N subinterpreters (default: number of schedulers)
  • Each subinterpreter has its own thread, GIL, and Python state
  • Requests are routed to subinterpreters via py_context_router
  • 25-30% faster cast operations compared to worker mode

Note: Each sub-interpreter has isolated state. Use the Shared State API to share data between workers.

Explicit Context Selection:

%% Get a specific context by index (1-based)
Ctx = py:context(1),
{ok, Result} = py:call(Ctx, math, sqrt, [16]).

%% Or use automatic scheduler-affinity routing
{ok, Result} = py:call(math, sqrt, [16]).

Multi-Executor Mode (Python < 3.12)

Runs N executor threads that share the GIL. Requests are distributed round-robin across executors. Good for I/O-bound workloads where Python releases the GIL during I/O operations.

Choosing the Right Mode

Mode Comparison

Aspect Free-Threaded Subinterpreter Multi-Executor
Parallelism True N-way True N-way GIL contention
State Isolation Shared Isolated Shared
Memory Overhead Low Higher (per-interp) Low
Module Compatibility Limited Most modules All modules
Python Version 3.13+ (nogil) 3.12+ Any

When to Use Each Mode

Use Free-Threaded (Python 3.13t) when:

  • You need maximum parallelism with shared state
  • Your libraries are GIL-free compatible
  • You're running CPU-bound workloads
  • Memory efficiency is important

Use Subinterpreters (Python 3.12+) when:

  • You need parallelism with state isolation
  • You want crash isolation between contexts
  • You're running untrusted or unstable code
  • You need predictable per-request state

Use Multi-Executor (Python < 3.12) when:

  • Running on older Python versions
  • Your workload is I/O-bound (GIL released during I/O)
  • You need compatibility with all Python modules
  • Shared state between workers is required

Pros and Cons

Subinterpreter Mode Pros:

  • True parallelism without GIL contention
  • Complete isolation (crashes don't affect other contexts)
  • Each context has clean namespace (no state bleed)
  • 25-30% faster cast operations vs worker mode

Subinterpreter Mode Cons:

  • Higher memory usage (each interpreter loads modules separately)
  • Some C extensions don't support subinterpreters
  • No shared state between contexts (use Shared State API)
  • asyncio event loop integration requires main interpreter

Free-Threaded Mode Pros:

  • True parallelism with shared state
  • Lower memory overhead than subinterpreters
  • Simplest mental model (like regular threading)

Free-Threaded Mode Cons:

  • Requires Python 3.13+ built with --disable-gil
  • Many C extensions not yet compatible
  • Shared state requires careful synchronization
  • Still experimental

Subinterpreter Architecture

Design Overview

┌─────────────────────────────────────────────────────────────────┐
│                     Erlang VM (BEAM)                            │
├─────────────────────────────────────────────────────────────────┤
│  py_context_router                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Scheduler 1 ──► Context 1 (pid)                        │   │
│  │  Scheduler 2 ──► Context 2 (pid)                        │   │
│  │  Scheduler N ──► Context N (pid)                        │   │
│  └─────────────────────────────────────────────────────────┘   │
│         │              │              │                         │
│         ▼              ▼              ▼                         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                   │
│  │ Context  │   │ Context  │   │ Context  │                   │
│  │ Process  │   │ Process  │   │ Process  │                   │
│  │ (gen_srv)│   │ (gen_srv)│   │ (gen_srv)│                   │
│  └────┬─────┘   └────┬─────┘   └────┬─────┘                   │
└───────┼──────────────┼──────────────┼───────────────────────────┘
        │              │              │
        ▼              ▼              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Subinterpreter Thread Pool                     │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │   Thread 1   │  │   Thread 2   │  │   Thread N   │         │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │         │
│  │ │  Interp  │ │  │ │  Interp  │ │  │ │  Interp  │ │         │
│  │ │  (GIL 1) │ │  │ │  (GIL 2) │ │  │ │  (GIL N) │ │         │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │         │
│  └──────────────┘  └──────────────┘  └──────────────┘         │
│                                                                 │
│  Each thread owns its interpreter's GIL (Py_GIL_OWN)           │
│  No GIL contention between threads                              │
└─────────────────────────────────────────────────────────────────┘

Key Components

py_context_router: Routes requests to context processes based on scheduler affinity or explicit binding.

py_context_process: Gen_server that owns a Python context reference and handles call/eval/exec operations.

Subinterpreter Thread Pool (C): Manages N threads, each with its own Python subinterpreter created with Py_NewInterpreterFromConfig() and Py_GIL_OWN.

Request Flow

  1. Erlang process calls py:call(Module, Func, Args)
  2. py_context_router selects context based on scheduler ID
  3. Request sent to py_context_process gen_server
  4. Gen_server calls NIF which executes on subinterpreter's thread
  5. Result returned through gen_server to caller

Pool Size

The subinterpreter pool size is configured at two levels:

Level Default Max
Erlang (py_context_router) erlang:system_info(schedulers) configurable
C pool (py_subinterp_pool) 32 64

On a typical 8-core machine, 8 context processes are started, each with one subinterpreter slot.

Configuration via sys.config:

{erlang_python, [
    {num_contexts, 16}  %% Override scheduler count
]}

Configuration at runtime:

%% Start with custom pool size
py_context_router:start(#{contexts => 16}).

Thread Safety

  • Each subinterpreter has its own GIL (no cross-interpreter contention)
  • NIF calls are serialized per-context via gen_server
  • Erlang message passing provides synchronization
  • C code uses atomics for cross-thread state (thread_running flag)

Rate Limiting

All Python calls pass through an ETS-based counting semaphore that prevents overload:

%% Check semaphore status
py_semaphore:max_concurrent().  %% => 29 (schedulers * 2 + 1)
py_semaphore:current().         %% => 0 (currently running)

%% Dynamically adjust limit
py_semaphore:set_max_concurrent(50).

How It Works

┌─────────────────────────────────────────────────────────────┐
│                      py_semaphore                           │
│                                                             │
│  ┌─────────┐    ┌─────────────────────────────────────┐    │
│  │ Counter │◄───│  ets:update_counter (atomic)        │    │
│  │  [29]   │    │  {write_concurrency, true}          │    │
│  └─────────┘    └─────────────────────────────────────┘    │
│                                                             │
│  acquire(Timeout) ──► increment ──► check ≤ max?           │
│                       │                 │                   │
│                       │             yes │ no                │
│                       │                 │  │                │
│                       │              ok │  └──► backoff     │
│                       │                 │       loop        │
│  release() ──────────►└──── decrement ──┘                   │
└─────────────────────────────────────────────────────────────┘

Overload Protection

When the semaphore is exhausted, py:call returns an overload error instead of blocking forever:

{error, {overloaded, Current, Max}} = py:call(module, func, []).

This allows your application to implement backpressure or shed load gracefully.

Configuration

%% sys.config
[
    {erlang_python, [
        %% Maximum concurrent Python operations (semaphore limit)
        %% Default: erlang:system_info(schedulers) * 2 + 1
        {max_concurrent, 50},

        %% Number of executor threads (multi_executor mode only)
        %% Default: 4
        {num_executors, 8},

        %% Worker pool sizes
        {num_workers, 4},
        {num_async_workers, 2},
        {num_subinterp_workers, 4}
    ]}
].

Parallel Execution with Sub-interpreters

For CPU-bound workloads on Python 3.12+, erlang_python provides true parallelism via OWN_GIL subinterpreters.

Check Support

%% Check if subinterpreters are supported (Python 3.12+)
true = py:subinterp_supported().

%% Check current execution mode
subinterp = py:execution_mode().

Using the Context Router

The context router automatically distributes calls across subinterpreters:

%% Start contexts (usually done by application startup)
{ok, _} = py:start_contexts().

%% Calls are automatically routed to subinterpreters
{ok, 4.0} = py:call(math, sqrt, [16]).
{ok, 6} = py:eval(<<"2 + 4">>).
ok = py:exec(<<"x = 42">>).

Explicit Context Selection

For fine-grained control, use explicit context selection:

%% Get a specific context by index (1-based)
Ctx = py:context(1),

%% All operations on this context share state
ok = py:exec(Ctx, <<"my_var = 'hello'">>),
{ok, <<"hello">>} = py:eval(Ctx, <<"my_var">>),
{ok, 4.0} = py:call(Ctx, math, sqrt, [16]).

%% Different context has isolated state
Ctx2 = py:context(2),
{error, _} = py:eval(Ctx2, <<"my_var">>).  %% Not defined in Ctx2

Context Router API

%% Start router with default number of contexts (scheduler count)
{ok, Contexts} = py_context_router:start().

%% Start with custom number of contexts
{ok, Contexts} = py_context_router:start(#{contexts => 8}).

%% Get context for current scheduler (automatic affinity)
Ctx = py_context_router:get_context().

%% Get specific context by index
Ctx = py_context_router:get_context(1).

%% Bind current process to a specific context
ok = py_context_router:bind_context(Ctx).

%% Unbind (return to scheduler-based routing)
ok = py_context_router:unbind_context().

%% Get number of active contexts
N = py_context_router:num_contexts().

%% Stop all contexts
ok = py_context_router:stop().

Parallel Execution

Execute multiple calls in parallel across subinterpreters:

%% Execute multiple calls in parallel
{ok, Results} = py:parallel([
    {math, sqrt, [16]},
    {math, sqrt, [25]},
    {math, sqrt, [36]}
]).
%% => {ok, [{ok, 4.0}, {ok, 5.0}, {ok, 6.0}]}

Each call runs in its own sub-interpreter with its own GIL, enabling true parallelism.

Testing with Free-Threading

To test with a free-threaded Python build:

1. Install Python 3.13+ with Free-Threading

# macOS with Homebrew
brew install python@3.13 --with-freethreading

# Or build from source
./configure --disable-gil
make && make install

# Or use pyenv
PYTHON_CONFIGURE_OPTS="--disable-gil" pyenv install 3.13.0

2. Verify Free-Threading is Enabled

python3 -c "import sys; print('GIL disabled:', hasattr(sys, '_is_gil_enabled') and not sys._is_gil_enabled())"

3. Rebuild erlang_python

# Clean and rebuild with free-threaded Python
rebar3 clean
PYTHON_CONFIG=/path/to/python3.13-config rebar3 compile

4. Verify Mode

1> application:ensure_all_started(erlang_python).
2> py:execution_mode().
free_threaded

Performance Tuning

For CPU-Bound Workloads

  • Use py:parallel/1 with sub-interpreters (Python 3.12+)
  • Or use free-threaded Python (3.13+)
  • Increase max_concurrent to match available CPU cores

For I/O-Bound Workloads

  • Multi-executor mode works well (GIL released during I/O)
  • Increase num_executors to handle more concurrent I/O
  • Use asyncio integration for async I/O

For Mixed Workloads

  • Balance max_concurrent based on memory constraints
  • Monitor py_semaphore:current() for load metrics
  • Implement application-level backpressure based on overload errors

Monitoring

%% Current load
Load = py_semaphore:current(),
Max = py_semaphore:max_concurrent(),
Utilization = Load / Max * 100,
io:format("Python load: ~.1f%~n", [Utilization]).

%% Execution mode info
Mode = py:execution_mode(),
Executors = py:num_executors(),
io:format("Mode: ~p, Executors: ~p~n", [Mode, Executors]).

%% Memory stats
{ok, Stats} = py:memory_stats(),
io:format("GC stats: ~p~n", [maps:get(gc_stats, Stats)]).

Shared State

Since workers (and sub-interpreters) have isolated namespaces, erlang_python provides ETS-backed shared state accessible from both Python and Erlang:

from erlang import state_set, state_get, state_incr, state_decr

# Share configuration across workers
config = state_get('app_config')

# Thread-safe metrics
state_incr('requests_total')
state_incr('bytes_processed', len(data))
%% Set config that all workers can read
py:state_store(<<"app_config">>, #{model => <<"gpt-4">>, timeout => 30000}).

%% Read metrics
{ok, Total} = py:state_fetch(<<"requests_total">>).

The state is backed by ETS with {write_concurrency, true}, making atomic counter operations fast and lock-free. See Getting Started for the full API.

Reentrant Callbacks

erlang_python supports reentrant callbacks where Python code calls Erlang functions that themselves call back into Python. This is handled without deadlocking through a suspension/resume mechanism:

%% Register an Erlang function that calls Python
py:register_function(compute_via_python, fun([X]) ->
    {ok, Result} = py:call('__main__', complex_compute, [X]),
    Result * 2  %% Erlang post-processing
end).

%% Python code that uses the callback
py:exec(<<"
def process(x):
    from erlang import call
    # Calls Erlang, which calls Python's complex_compute
    result = call('compute_via_python', x)
    return result + 1
">>).

How Reentrant Callbacks Work

┌─────────────────────────────────────────────────────────────────┐
│                     Reentrant Callback Flow                      │
│                                                                 │
│  1. Python calls erlang.call('func', args)                      │
│     └──► Returns suspension marker, frees dirty scheduler       │
│                                                                 │
│  2. Erlang executes the registered callback                     │
│     └──► May call py:call() to run Python (on different worker) │
│                                                                 │
│  3. Erlang calls resume_callback with result                    │
│     └──► Schedules dirty NIF to return result to Python         │
│                                                                 │
│  4. Python continues with the callback result                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Benefits

  • No Deadlocks: Dirty schedulers are freed during callback execution
  • Nested Callbacks: Multiple levels of Python→Erlang→Python→... are supported
  • Transparent: From Python's perspective, erlang.call() appears synchronous
  • No Configuration: Works automatically with all execution modes

Performance Considerations

  • Reentrant callbacks have slightly higher overhead due to suspension/resume
  • For tight loops, consider batching operations to reduce callback overhead
  • Concurrent reentrant calls are fully supported and scale well

Example: Nested Callbacks

%% Each level alternates between Erlang and Python
py:register_function(level, fun([N, Max]) ->
    case N >= Max of
        true -> N;
        false ->
            {ok, Result} = py:call('__main__', next_level, [N + 1, Max]),
            Result
    end
end).

py:exec(<<"
def next_level(n, max):
    from erlang import call
    return call('level', n, max)

def start(max):
    from erlang import call
    return call('level', 1, max)
">>).

%% Test 10 levels of nesting
{ok, 10} = py:call('__main__', start, [10]).

Example

See examples/reentrant_demo.erl and examples/reentrant_demo.py for a complete demonstration including:

  • Basic reentrant calls with arithmetic expressions
  • Fibonacci with Erlang memoization
  • Deeply nested callbacks (10+ levels)
  • OOP-style class method callbacks
# Run the demo
rebar3 shell
1> reentrant_demo:start().
2> reentrant_demo:demo_all().

Building for Performance

Standard Build

rebar3 compile

Uses -O2 optimization and standard compiler flags.

Performance Build

For production deployments where maximum performance is needed:

# Clean and rebuild with aggressive optimizations
rm -rf _build/cmake
mkdir -p _build/cmake && cd _build/cmake
cmake ../../c_src -DPERF_BUILD=ON
cmake --build . -j$(nproc)

The PERF_BUILD option enables:

Flag Effect
-O3 Aggressive optimization level
-flto Link-Time Optimization
-march=native CPU-specific instruction set
-ffast-math Relaxed floating-point math
-funroll-loops Loop unrolling

Caveats:

  • Binaries are not portable (tied to build machine's CPU)
  • Build time increases due to LTO
  • -ffast-math may affect floating-point precision

Verifying the Build

%% Check that the NIF loaded successfully
1> application:ensure_all_started(erlang_python).
{ok, [erlang_python]}

%% Run basic verification
2> py:eval("1 + 1").
{ok, 2}

See Also