Skip to content

Conversation

@DerGuteMoritz
Copy link
Collaborator

Manifold has a debug mechanism which instruments deferreds so that a warning is logged when they are in an error state with no error handler attached at garbage collection time. Such errors are referred to as "dropped errors" and the instrumented deferreds are called "leak-aware".

The dropped error logging feature is enabled by default but for performance reasons, only every 1024th deferred is made leak-aware (see manifold.deferred/deferred). However, deferreds constructed via manifold.deferred/error-deferred are always leak-aware.

This patch introduces a new option manifold.debug/*leak-aware-deferred-rate* which allows changing the rate of leak-aware deferreds being instantiated, with default still being 1024. When this option is set to to 1 (= every deferred will be leak-aware), dropped errors can be detected more reliably e.g. during testing.

To that end, the patch also includes a test util for instrumenting all tests in a namespace so that they fail when a dropped error was detected. All test namespaces are now instrumented like this (except for :benchmark and :stress tests), which already uncovered a few latent dropped errors. These will be fixed in separate patches.

This was prompted by investigating the causes of clj-commons/aleph#766. One of these was found to originate in Manifold itself which will be addressed in a follow-up PR.

This PR is equivalent to #243 but the source branch lives in this here repository which allows us to create stacked PRs with the fixes for it.

@DerGuteMoritz
Copy link
Collaborator Author

One more patch coming up to fix the remaining dropped error in manifold.go-off but gotta step away for a while now.

@KingMob
Copy link
Collaborator

KingMob commented Sep 21, 2025

@DerGuteMoritz Sorry to say this, but I no longer remember Manifold well enough to review these, I think. Nor do I really have the time atm. I would suggest soliciting some helpers from the #manifold slack channel.

[f handle-dropped-errors]
(assert (nil? dropped-errors) "with-dropped-error-detection may not be nested")
;; Flush out any pending dropped errors from before
(System/gc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told before that System/gc is more of a suggestion to the VM than an actual call

Did you observe this working reliably?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is no guarantee that System/gc actually triggers a GC, let alone whether it reclaims all garbage. Quoting from the docs

Calling the gc method suggests that the Java Virtual Machine expend effort toward recycling unused objects in order to make the memory they currently occupy available for reuse by the Java Virtual Machine. When control returns from the method call, the Java Virtual Machine has made a best effort to reclaim space from all unused objects. There is no guarantee that this effort will recycle any particular number of unused objects, reclaim any particular amount of space, or complete at any particular time, if at all, before the method returns or ever. There is also no guarantee that this effort will determine the change of reachability in any particular number of objects, or that any particular number of Reference objects will be cleared and enqueued.

However, it does work pretty well in practice, see for example the failing tests for this PR which uses OpenJDK 8. Locally, I've also successfully used it with OpenJDK 21. So yeah, it's not deterministic but still quite useful as evidenced by the issues it uncovered 🙂

@DerGuteMoritz
Copy link
Collaborator Author

#249 is the aforementioned patch for manifold.go-off (it wasn't auto-linked here because it's based on #248).

@KingMob KingMob removed their request for review September 25, 2025 06:19
Manifold has a debug mechanism which instruments deferreds so that a warning is logged when they are
in an error state with no error handler attached at garbage collection time. Such errors are
referred to as "dropped errors" and the instrumented deferreds are called "leak-aware".

The dropped error logging feature is enabled by default but for performance reasons, only every
1024th deferred is made leak-aware (see `manifold.deferred/deferred`). However, deferreds
constructed via `manifold.deferred/error-deferred` are always leak-aware.

This patch introduces a new option `manifold.debug/*leak-aware-deferred-rate*` which allows changing
the rate of leak-aware deferreds being instantiated, with default still being 1024. When this option
is set to to 1 (= every deferred will be leak-aware), dropped errors can be detected more reliably
e.g. during testing.

To that end, the patch also includes a test util for instrumenting all tests in a namespace so that
they fail when a dropped error was detected. All test namespaces are now instrumented like
this (except for `:benchmark` and `:stress` tests), which already uncovered a few latent dropped
errors. These will be fixed in separate patches.
Also, make test-alt a bit more reliable and make the exception messages distinct so that a dropped
error can easily be matched up with the deferred.
Involving `d/future` makes these tests needlessly flaky. By decoupling the `d/alt` invocation from
dereferencing its result, we can deterministically control the order of events.
The result deferred `d` may end up in an error state when `(f)` throws an exception or returns an
error deferred. Now since the cancellation callback was only attached via `d/chain` for
side-effect (i.e. the resulting deferred is not returned), the dropped error detection would be
triggered in this case. The fix is simply to also attach an error handler with `d/on-realized`.

Furthermore, putting the result deferred in an error state would not have cancelled the timer
because there was no error handler attached to it. This is now also fixed and the tests are extended
to cover the cancellation API, too.

The latter uncovered an oversight in the mock clock implementation where the `in` method didn't
return a cancellation function, resulting in an NPE. This is now also fixed.
This was a mistake, this metadata flag is supposed to go on the test *var* only.
When putting the result deferred `d` in an error state, deferred returned by `chain` would be
detected as a dropped error because it's only used to attach a side-effecting callback for
cancelling the timeout. The fix then is to use `on-realized` instead to attach a handler for both
cases. As a consequence, timeouts are now also properly cancelled when the result deferred is put in
an error state, freeing up resources in a more timely fashion.

This fixes one of the causes of clj-commons/aleph#766, as well as 2 of the
3 dropped errors in `manifold.go-off-test`.

Also, add dedicated tests for `d/timeout!`.
When `in` is closed and `out` is also closed by the time the chained handler runs, closing `s'`
would fail because it's `::none`. This non-deterministic case was occasionally observed with
`manifold.stream-test/test-cleanup` failing due to a leaked error deferred like so:

    WARNING: unconsumed deferred in error state, make sure you're using `catch`.
    java.lang.ClassCastException: clojure.lang.Keyword cannot be cast to manifold.stream.core.IEventStream
            at manifold.stream$concat$this__3189__auto____14958$fn__14959$fn__14960.invoke(stream.clj:819)
            at manifold.deferred$eval3018$chain_SINGLEQUOTE____3039.invoke(deferred.clj:904)
            at manifold.stream$concat$this__3189__auto____14958$fn__14959.invoke(stream.clj:816)
            at manifold.stream$concat$this__3189__auto____14958.invoke(stream.clj:815)
            at clojure.lang.AFn.applyToHelper(AFn.java:154)
            at clojure.lang.AFn.applyTo(AFn.java:144)
            at clojure.core$apply.invokeStatic(core.clj:669)
            at clojure.core$apply.invoke(core.clj:662)
            at manifold.stream$concat$this__3189__auto____14958$fn__14984.invoke(stream.clj:815)
Timeout tests are changed so that every timeout exception has a uniquely identifiable message which
makes it easier to match up the dropped error warning with the originating timeout.
This saves a bit of overhead relative to `d/chain` + `d/catch`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants