Skip to content

Pathfinder fails on GPU for moderate-sized models #188

Open
@fonnesbeck

Description

@fonnesbeck

The Pathfinder implementation fails for any model of moderate size when run on a GPU, quickly running out of GPU memory:

XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4464000000 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    4.16GiB
              constant allocation:         0B
        maybe_live_out allocation:    4.16GiB
     preallocated temp allocation:         0B
                 total allocation:    8.31GiB
              total fragmentation:         0B (0.00%)
Peak buffers:
	Buffer 1:
		Size: 4.16GiB
		Entry Parameter Subshape: f64[31,200,300,300]
		==========================

	Buffer 2:
		Size: 4.16GiB
		Operator: op_name="jit(fn)/jit(main)/mul" source_file="/usr/local/lib/python3.10/dist-packages/pytensor/link/jax/dispatch/scalar.py" source_line=103
		XLA Label: fusion
		Shape: f64[31,200,300,300]
		==========================

	Buffer 3:
		Size: 8B
		Entry Parameter Subshape: s64[]
		==========================

I've tried setting some appropriate XLA environment variables as follows:

os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"]="false"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]=".10"
os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"]="platform"

But this has not effect on the behavior. For a reproducible example, try running the classification model in the latent GP example notebook:

with pm.Model() as model:
    ell = pm.InverseGamma("ell", mu=1.0, sigma=0.5)
    eta = pm.Exponential("eta", lam=1.0)
    cov = eta**2 * pm.gp.cov.ExpQuad(1, ell)

    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior("f", X=x[:, None])

    # logit link and Bernoulli likelihood
    p = pm.Deterministic("p", pm.math.invlogit(f))
    y_ = pm.Bernoulli("y", p=p, observed=y)

    # idata = pm.sample(1000, chains=2, cores=2, nuts_sampler="numpyro")
    idata = pmx.fit(method='pathfinder')

I've yet to run pathfinder successfully on anything but toy models; given that VI is primarily an approximation for large models not able to be sampled quickly, this limits the use of pathfinder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions