Skip to content

Conversation

@mikeash
Copy link
Contributor

@mikeash mikeash commented Oct 21, 2025

This is currently disabled by default. Building the client library can be enabled with the CMake option SWIFT_BUILD_CLIENT_RETAIN_RELEASE, and using the library can be enabled with the flags -Xfrontend -enable-client-retain-release.

To improve retain/release performance, we build a static library containing optimized implementations of the fast paths of swift_retain, swift_release, and the corresponding bridgeObject functions. This avoids going through a stub to make a cross-library call.

IRGen gains awareness of these new functions and emits calls to them when the functionality is enabled and the target supports them. Two options are added to force use of them on or off: -enable-client-retain-release and -disable-client-retain-release. When enabled, the compiler auto-links the static library containing the implementations.

The new calls also use LLVM's preserve_most calling convention. Since retain/release doesn't need a large number of scratch registers, this is mostly harmless for the implementation, while allowing callers to improve code size and performance by spilling fewer registers around refcounting calls. (Experiments with an even more aggressive calling convention preserving x2 and up showed an insignificant savings in code size, so preserve_most seems to be a good middle ground.)

Since the implementations are embedded into client binaries, any change in the runtime's refcounting implementation needs to stay compatible with this new fast path implementation. This is ensured by having the implementation use a runtime-provided mask to check whether it can proceed into its fast path. The mask is provided as the address of the absolute symbol _swift_retainRelease_slowpath_mask_v1. If that mask ANDed with the object's current refcount field is non-zero, then we take the slow path. A future runtime that changes the refcounting implementation can adjust this mask to match, or set the mask to all 1s to disable the old embedded fast path entirely (as long as the new representation never uses 0 as a valid refcount field value).

As part of this work, the overall approach for bridgeObjectRetain is changed slightly. Previously, it would mask off the spare bits from the native pointer and then call through to swift_retain. This either lost the spare bits in the return value (when tail calling swift_retain) which is problematic since it's supposed to return its parameter, or it required pushing a stack frame which is inefficient. Now, swift_retain takes on the responsibility of masking off spare bits from the parameter and preserving them in the return value. This is a trivial addition to the fast path (just a quick mask and an extra register for saving the original value) and makes bridgeObjectRetain quite a bit more efficient when implemented correctly to return the exact value it was passed.

The runtime's implementations of swift_retain/release are now also marked as preserve_most so that they can be tail called from the client library. preserve_most is compatible with callers expecting the standard calling convention so this doesn't break any existing clients. Some ugly tricks were needed to prevent the compiler from creating unnecessary stack frames with the new calling convention. Avert your eyes.

To allow back deployment, the runtime now has aliases for these functions called swift_retain_preservemost and swift_release_preservemost. The client library brings weak definitions of these functions that save the extra registers and call through to swift_retain/release. This allows them to work correctly on older runtimes, with a small performance penalty, while still running at full speed on runtimes that have the new preservemost symbols.

Although this is only supported on Darwin at the moment, it shouldn't be too much work to adapt it to other ARM64 targets. We need to ensure the assembly plays nice with the other platforms' assemblers, and make sure the implementation is correct for the non-ObjC-interop case.

rdar://122595871

ARGS(RefCountedPtrTy),
ATTRS(NoUnwind, FirstParamReturned, WillReturn),
EFFECT(RuntimeEffect::RefCounting),
UNKNOWN_MEMEFFECTS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the same as it used to be, but remind me why retain has unknown memeffects? I see why release would

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked my team and it seems to be inadvertent or left over.


// Load-exclusive of the current value in the refcount field when using LLSC.
// stxr does not update x16 like cas does, so this load must be inside the loop.
// ldxr/stxr are not guaranteed to make forward progress if there are memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the comment here because I was wondering about exactly this

casl x16, x17, [x1]

// The previous value of the refcount field is now in x16. We succeeded if that
// value is the same as the old value we had before. If we failed, retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle speculation, please feel free to ignore: I wonder if there's a way to shrink the fast path further by assuming it never fails the CAS, since contended atomics are slow enough that falling back to the slow path probably wouldn't hurt much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I don't think there's anything that could be removed. We'd basically change the cbnz below to jump to Lslowpath instead, no real difference.

#define CALL_IMPL_SWIFT_REFCOUNT_CC(name, args) \
do { \
if (SWIFT_UNLIKELY( \
_swift_enableSwizzlingOfAllocationAndRefCountingFunctions_forInstrumentsOnly \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Instruments never ended up adopting this, in case that's useful info. They pointed out that exposing a C++ atomic this way risks ABI issues, and asked for a function to call instead.

@mikeash mikeash force-pushed the emit-into-client-retain-release branch 2 times, most recently from a148e8a to cc1b0f5 Compare October 21, 2025 20:45

#elif USE_LDX_STX
// Try to store the updated value.
stxr w16, x17, [x1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe naive question, but why is this using x17 when x3 is used in the CAS case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really quite simple. I changed the CAS case and forgot to update the non-CAS case. Pushed a fix now. Thanks for pointing that out, well spotted.

@mikeash
Copy link
Contributor Author

mikeash commented Oct 22, 2025

@swift-ci please test

@mikeash mikeash force-pushed the emit-into-client-retain-release branch from cc1b0f5 to 64f4fd8 Compare October 22, 2025 22:44
@mikeash
Copy link
Contributor Author

mikeash commented Oct 22, 2025

@swift-ci please test


// Save or load all of the registers that we promise to preserve that aren't
// preserved by the standard calling convention. The macro parameter is either
// step or ldp to save or load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// step or ldp to save or load.
// stp or ldp to save or load.

Comment on lines 67 to 87
.macro SAVE_LOAD_REGS inst, pushStack
.if \pushStack
\inst x2, x3, [sp, #-0x70]!
.else
\inst x2, x3, [sp, #0x0]
.endif
\inst x4, x5, [sp, #0x10]
\inst x6, x7, [sp, #0x20]
\inst x8, x9, [sp, #0x30]
\inst x10, x11,[sp, #0x40]
\inst x12, x13,[sp, #0x50]
\inst x14, x15,[sp, #0x60]
.endmacro

.macro SAVE_REGS
SAVE_LOAD_REGS stp, 1
.endmacro

.macro LOAD_REGS
SAVE_LOAD_REGS ldp, 0
.endmacro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to use these, which is probably a good thing because in the non-pushStack case we aren't actually popping the stack.

Suggested change
.macro SAVE_LOAD_REGS inst, pushStack
.if \pushStack
\inst x2, x3, [sp, #-0x70]!
.else
\inst x2, x3, [sp, #0x0]
.endif
\inst x4, x5, [sp, #0x10]
\inst x6, x7, [sp, #0x20]
\inst x8, x9, [sp, #0x30]
\inst x10, x11,[sp, #0x40]
\inst x12, x13,[sp, #0x50]
\inst x14, x15,[sp, #0x60]
.endmacro
.macro SAVE_REGS
SAVE_LOAD_REGS stp, 1
.endmacro
.macro LOAD_REGS
SAVE_LOAD_REGS ldp, 0
.endmacro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they were left over from earlier versions. In the end I decided I preferred to write it out in each place rather than macroize it. Deleted now.

add fp, sp, #0x40

// Clear the unused bits from the pointer
and x0, x0, #0x0ffffffffffffff8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a constant for 0x0ffffffffffffff8? It appears in a few places, and it'd avoid having e.g. the wrong number of fs somewhere by mistake, as well as making it easier to update in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Comment on lines 118 to 121
CONDITIONAL PTRAUTH, \
retab
CONDITIONAL !PTRAUTH, \
ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd be inclined to do something like

.macro ret_maybe_ab
.if PTRAUTH
  retab
.else
  ret
.endif
.endmacro

(or equivalent using the C preprocessor, since that's where PTRAUTH is coming from anyway).

and just replace this with

Suggested change
CONDITIONAL PTRAUTH, \
retab
CONDITIONAL !PTRAUTH, \
ret
ret_maybe_ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, done (and also added maybe_pacibsp for the start).

Comment on lines +112 to +114
// WASM says yes to __has_attribute(musttail) but doesn't support using it, so
// exclude WASM from SWIFT_MUSTTAIL.
#if __has_attribute(musttail) && !defined(__wasm__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we filed a bug report about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed one internally, rdar://162366004.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External version: llvm/llvm-project#163256

// contents, to eliminate one load instruction when using it. This is imported
// weakly, which makes its address zero when running against older runtimes.
// ClientRetainRelease references it using an addend of 0x8000000000000000,
// whicrh produces the appropriate mask in that case. Since the mask is still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// whicrh produces the appropriate mask in that case. Since the mask is still
// which produces the appropriate mask in that case. Since the mask is still

… ARM64.

This is currently disabled by default. Building the client library can be enabled with the CMake option SWIFT_BUILD_CLIENT_RETAIN_RELEASE, and using the library can be enabled with the flags -Xfrontend -enable-client-retain-release.

To improve retain/release performance, we build a static library containing optimized implementations of the fast paths of swift_retain, swift_release, and the corresponding bridgeObject functions. This avoids going through a stub to make a cross-library call.

IRGen gains awareness of these new functions and emits calls to them when the functionality is enabled and the target supports them. Two options are added to force use of them on or off: -enable-client-retain-release and -disable-client-retain-release. When enabled, the compiler auto-links the static library containing the implementations.

The new calls also use LLVM's preserve_most calling convention. Since retain/release doesn't need a large number of scratch registers, this is mostly harmless for the implementation, while allowing callers to improve code size and performance by spilling fewer registers around refcounting calls. (Experiments with an even more aggressive calling convention preserving x2 and up showed an insignificant savings in code size, so preserve_most seems to be a good middle ground.)

Since the implementations are embedded into client binaries, any change in the runtime's refcounting implementation needs to stay compatible with this new fast path implementation. This is ensured by having the implementation use a runtime-provided mask to check whether it can proceed into its fast path. The mask is provided as the address of the absolute symbol _swift_retainRelease_slowpath_mask_v1. If that mask ANDed with the object's current refcount field is non-zero, then we take the slow path. A future runtime that changes the refcounting implementation can adjust this mask to match, or set the mask to all 1s to disable the old embedded fast path entirely (as long as the new representation never uses 0 as a valid refcount field value).

As part of this work, the overall approach for bridgeObjectRetain is changed slightly. Previously, it would mask off the spare bits from the native pointer and then call through to swift_retain. This either lost the spare bits in the return value (when tail calling swift_retain) which is problematic since it's supposed to return its parameter, or it required pushing a stack frame which is inefficient. Now, swift_retain takes on the responsibility of masking off spare bits from the parameter and preserving them in the return value. This is a trivial addition to the fast path (just a quick mask and an extra register for saving the original value) and makes bridgeObjectRetain quite a bit more efficient when implemented correctly to return the exact value it was passed.

The runtime's implementations of swift_retain/release are now also marked as preserve_most so that they can be tail called from the client library. preserve_most is compatible with callers expecting the standard calling convention so this doesn't break any existing clients. Some ugly tricks were needed to prevent the compiler from creating unnecessary stack frames with the new calling convention. Avert your eyes.

To allow back deployment, the runtime now has aliases for these functions called swift_retain_preservemost and swift_release_preservemost. The client library brings weak definitions of these functions that save the extra registers and call through to swift_retain/release. This allows them to work correctly on older runtimes, with a small performance penalty, while still running at full speed on runtimes that have the new preservemost symbols.

Although this is only supported on Darwin at the moment, it shouldn't be too much work to adapt it to other ARM64 targets. We need to ensure the assembly plays nice with the other platforms' assemblers, and make sure the implementation is correct for the non-ObjC-interop case.

rdar://122595871
@mikeash mikeash force-pushed the emit-into-client-retain-release branch from 64f4fd8 to 55893f9 Compare October 24, 2025 21:21
@mikeash
Copy link
Contributor Author

mikeash commented Oct 24, 2025

@swift-ci please test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants