Even before the introduction of std::function
in C++11, callbacks had been extensively used in C++. Most well-known libraries and frameworks also have their own implementations, including Qt's Signals and Slots, the superseded Boost.Function
, and ETLCpp's delegate
. C++23 and C++26 are also going to introduce more variants: move_only_function
and function_ref
.
Observing many confusions around the trade-offs between these classes, I've written this article to discuss them thoroughly. This article will cover the problem domain, the design trade-offs, and their impact on performance. Finally, I will implement these trade-offs in a separate policy-based callback class and measure its performance.
What is a callback function? Intuitively speaking, it needs to support two operations:
operator()
, a member function pointer, and even more.In the following paragraphs, I will use the following notions for the sake of brevity:
cb
denotes the callback instance.obj
denotes the invocable object.The above two requirements roughly map to:
Callback cb{obj};
is valid.cb(args...)
should result in the same behavior as obj(args...)
.Apparently, the above definition leaves many unexplained bits about the semantics of each operation. For now, let's put these aside and make a few more assumptions to keep this article short.
Most callback implementations assume that each Callback
has an associated function signature. For example, std::function<double(int)> cb;
can only be created from an obj
that has an obj(int) -> double
interface, and passing a string
to the interface results in a compile-time error. This approach is very easy to reason about as enforcing signature checks prevents potential argument mismatch errors, which are considered one of the biggest benefits of statically-typed languages. We will limit our discussion to it in the remaining paragraphs.
A tiny fraction of callback implementations allow the signature to be determined at runtime, which means (1) functions could take an AnyCallback anyCB
parameter, (2) any callable could be stored in AnyCallback
, and (3) invoking it with unmatched arguments versus the parameters of the stored callable object leads to a runtime error.
Some signal-slot implementations allow users to register
many receivers to a slot
. Calling cb(args...)
will notify all registered (also called connected) receivers. This pattern is typically used in GUI programming like GTK or QT. This article will not discuss it, as multi-receivers could be considered an additional subscriber pattern over simple callbacks.
Before diving into real-world implementations, let's discuss the approach of passing the callback type as a template parameter, akin to standard-library-style callbacks:
This method has its advantages and disadvantages:
transform
) becomes specialized against the concrete callback type (unary_op
), potentially allowing inlining of the function body to enhance performance.unary_op
directly into this function is virtually nonexistent.Compare
type is manifested in the type of std::set
, potentially leading to further template bloat when users introduce various callback types.Verdict:
The forthcoming function_ref
in C++26, among other implementations, stores only a reference to the callable object. This results in a fixed callback size, requiring storage for only a trampoline function pointer plus the object (either a pointer to a callable object or a member function pointer). On the x86-64 LLP64 platform, this equates to 24 bytes (8 for the pointer plus 16 for the object), offering savings compared to std::function
's 32 bytes. Additionally, it eliminates the virtual dispatch seen in std::function
, generally improving performance.
When accepting a function_ref
as a parameter, it's crucial for the caller to ensure the reference isn't stored beyond the call duration to prevent dangling references. A typical use case is illustrated below:
It's important to note that performance enhancements aren't solely attributed to taking a reference. In theory, passing a raw function pointer or a std::ref
to a plain callback implementation should yield equivalent outcomes. This will be further explored in subsequent sections.
Following our discussion, we arrive at a type-erased callback interface:
std::function
or its contemporary counterpart, std::copyable_function
, mandates that the callable be copyable, whereas std::move_only_function
requires the callable to be movable.
operator()
InterfaceIn C++, a callable's call operator may have various qualifiers. std::copyable_function
permits the user to specify these in the callback signature, applying them to operator()
:
Comparing it to its predecessor, std::function
:
As proposed here, including const and other qualifiers in the callback interface can help prevent unintended modifications to the callback itself:
After exploring various interface designs and their impacts on users, we now turn our attention to the implementation of these interfaces. We'll assess performance across three dimensions:
We posit that many prior analyses failed to adequately distinguish between distinct subdomains. Our discussion hinges on two critical choices:
The tests were performed on my personal laptop, equipped with an AMD 6800H processor. Frequency scaling was disabled, and the performance governor was set to performance
. The system runs Linux 6.7.4 and glibc 2.39. Performance metrics were gathered using Catch2's benchmarking default, which executes 100
runs with 100,000
resamples. Compilation was done with gcc 13.2.1
using the -O2
optimization flag:
The question arises: What operations require virtual dispatch? We identify four candidates:
function_ref
(including function pointers and member function pointers), or trivially-destructible objects)function_ref
, non-copyable, or trivially-copyable objects)function_ref
, non-moveable, or trivially-moveable objects)The most versatile approach is undoubtedly VIRTCALL, which involves declaring all required methods as virtual within a base class and creating a wrapper for each callable object to implement these virtual methods.
This methodology is employed by std::function
and numerous other implementations that accommodate non-trivial destructors, copy constructors, and move constructors. Regarding overhead, each operation—invocation, copying, moving, and destruction—necessitates a virtual call. Additionally, the object incorporates an 8-byte vptr
to enable these virtual calls.
Is it feasible to eliminate the virtual call? Remember, when only the polymorphic call operator is necessary, we can store a function pointer to a trampoline function as shown below:
This approach eliminates one level of indirection in the call operation. The storage overhead remains equivalent to VIRTCALL, at 8 bytes. However, since a separate trampoline pointer is needed for each dynamic method, this strategy is best when there's only one virtual method. Additionally, as of 2024, compilers do not typically optimize trampoline pointers for devirtualization, so the missed optimization opportunity may be a significant factor to consider.
This technique was initially introduced in Impossibly Fast C++ Delegates and further explored in its sequel. It is particularly suitable for delegates as they do not own the objects they reference, thus eliminating the need for polymorphic move/copy/destructor features.
While the methods mentioned above are widely utilized, performance can be further enhanced by leveraging platform-specific implementations.
void*
Pointer at the End of Trampoline's Parameter ListMost calling conventions assign arguments to registers or stack locations based on their order. By placing the additional reference at the end of the argument list, we can minimize the overhead associated with shifting Args... args
. Although these register moves are typically less than a cycle due to speculative execution and move elimination, this adjustment still contributes to reduced code size:
void*
at start:
void*
at end:
Member pointers, which occupy 16 bytes and typically entail slower dereferencing than direct calls via lambdas, especially when the target function is inline-able, have been observed to cause a significant slowdown. We noted a 130% slowdown from member pointers on an inline-able method versus its lambda equivalent:
For methods that cannot be inlined, the performance degradation is less pronounced, with timings of 8.53ms versus 7.15ms (a 19% difference).
The underlying reasons for this slowdown are threefold: (1) member pointers are 16 bytes in size, while a plain lambda occupies only 1 byte (for an empty lambda); (2) the use of a member pointer, a form of type-erasure, prevents the compiler from inlining; (3) invoking a function pointer necessitates additional offset arithmetic to navigate virtual inheritance. Thus, avoiding member function pointers is generally advisable.
Although various techniques exist for optimizing around member pointers by examining ABI-specific internals (with one example here), practical experience suggests that converting these member pointers into appropriate lambda wrappers is often a simpler and more effective strategy.
invoke
Currently, we employ ReturnT invokeTrampoline(Args&&... args, void* obj)
as the trampoline function signature. When a Callback<void(int)> cb{ /* function pointer of void(int) */ }
is specified, the trampoline function receives an int&&
, compelling the callback class to store the passed int
on the stack, transfer its address into %rdi
, and then dereference it within the trampoline to invoke the underlying void(int)
function. This conversion process could be bypassed by directly passing int
by value to the trampoline. A similar approach is utilized in abseil
's implementation.
When constructing a callback like this:
The precise function address becomes obscured as soon as its address is taken, leading to a situation where trampoline or virtual call wrappers, which are specialized based on the Callable
type, must rely on an indirect call via the function pointer to access the underlying callback. This approach limits compiler optimization opportunities and increases the runtime size of the callback.
An alternative is for users to pass a strongly-typed lambda as an argument:
This enables compilers to more aggressively inline within the trampoline implementation, as illustrated below:
While gcc may automatically optimize this through function cloning and constant propagation in some instances, other situations may necessitate a manual lambda wrapper. It's important to note that this can lead to additional specialization and potentially increase template bloat, so this strategy should be reserved for performance-critical paths.
This approach is exemplified by Matt's delegate implementation, which includes a .bind<&func>()
method. This method effectively transforms each unique global function into a distinct type, mirroring the benefits of using strongly-typed lambdas.
This section delves into several additional optimization ideas that remain largely unexplored in production environments. Many of these concepts are highly dependent on specific platforms and may not integrate well with compiler optimizations. Furthermore, they are entirely untested, so it is advisable to approach them with caution.
One intriguing possibility involves leveraging self-modifying code to achieve polymorphism in the call method, rather than relying on a virtual pointer or trampoline function pointer. In the simplest scenario of storing a function pointer, the indirect callq *(%rax)
instruction could be replaced with a relative callq {target_offset}
instruction, with the {target_offset}
field being modifiable in the source code. This approach might enhance performance by facilitating speculative execution. Moreover, when the callback's signature precisely aligns with the target's signature and employs the same calling convention, it becomes feasible to transform the body of the callback into a jmp {target_offset}
instruction, effectively eliminating all overhead associated with parameter passing during the conversion from T
to T&&
.
However, it is important to note that self-modifying code tends to disrupt branch predictors following any modification. Consequently, the potential benefits may only justify the complexities involved if a callback is expected to be invoked numerous times.
Considering the trampoline implementation discussed earlier, when invoking a function pointer, the callback class must initially call the trampoline function ReturnT trampoline(Args&&... args, void* obj)
, followed by the actual call to obj(forward<Args>(args)...)
. An alternative strategy involves substituting the trampoline function pointer with the function pointer itself, thereby reducing one level of indirection. However, this simplification is subject to certain constraints:
void*
argument should not interfere with the passing conventions or stack layout of the existing args
.These ideas represent the frontier of callback optimization techniques, where the balance between innovation and practical utility must be carefully navigated.
The utilization of Small Buffer Optimization (SBO) is a common practice among callback implementations, where a small buffer is allocated on the stack for objects that can fit within this space. While numerous online articles detail the implementation of SBO (one example), this discussion will focus on the overhead introduced by SBO and various implementation nuances rather than reiterating well-covered material. To provide context, a typical SBO implementation for callbacks can be outlined as follows:
On amd64 LLP64 platforms, the size of the object is calculated as STACK_SIZE + sizeof(unique_ptr)
, which equals STACK_SIZE + 8
bytes. This formula highlights that expanding the buffer to accommodate heap allocations introduces an 8-byte overhead. Additionally, this design necessitates a conditional branch in getActiveStorage
and some extra logic in resizeToContain
.
A notable pitfall with this SBO approach when storing callable objects involves the direct use of reinterpret_cast<Callable*>(dataPtr)
. This action can lead to undefined behavior, as type punning from unsigned char[]
to another object type is generally prohibited. To avoid potential optimization issues, it's recommended to use std::launder
when casting pointers.
Furthermore, specifying alignof(max_align_t)
for the SBO buffer may be necessary to ensure proper alignment for callable objects. While some implementations may forego specifying alignment to leverage x86-64's forgiving alignment requirements, such 'optimizations' are generally discouraged in callback implementations due to the standard alignment practices for many other fields.
Reflecting on our earlier assertion, we identified that performance hinges on two pivotal design choices:
Drawing from the discussions, here's a comparative table showcasing various features and their associated performance overheads:
Feature | Overhead |
---|---|
Base dynamic dispatch (virtcall or trampoline) | 8 bytes |
Supports polymorphic destruction | - |
Supports polymorphic move | - |
Supports polymorphic copy | (When any of these three is necessary, VIRTCALL is implemented to enable dynamic dispatch, adding an extra layer of indirection but not affecting callback size.) |
Base SBO storage size | STACK_SIZE |
Need for SBO to expand onto heap | +8 bytes, plus some additional overhead |
With this framework, we can delineate common examples into different Policies based on these features:
std::function
: Incorporates all features, equipped with a 16-byte SBO buffer, culminating in a total size of 16 + 8 + 8 = 32 bytes.absl::function_ref
: Exclusively facilitates the storage of a callable's reference or a function pointer, employing an SBO of 8 bytes. It does not support polymorphic destruction, resulting in a size of 8 + 8 = 16 bytes, and is realized through a trampoline mechanism.This overview allows us to discern the trade-offs between different callback implementations, illustrating how the amalgamation of virtual dispatch methods and storage strategies can influence the overall efficiency and size of callback mechanisms.
Based on the previous discussion, I implemented a highly experimental policy-based callback library, which supports the following interfaces:
Additionally, I used it to replicate many popular callback choices to evaluate their performance:
I benchmarked various scenarios:
The results are as follows:
In this test case, the callable is an 8-byte global function pointer. We tested both without the strongly-typed function technique (plain function pointers) and with a strongly-typed lambda wrapper.
Overall, for invocation, function pointers are slightly faster than the alternatives. Using a strongly-typed function as input eliminated the difference. As for the copy and call scenario, all implementations over 16 bytes are the slowest, followed by 16-byte implementations, and then the 8-byte raw function pointer. The virtual dispatch and trampoline approach displayed similar overhead in calling, while dynamic heap allocation considerably impacted copy performance.
When it comes to a 16-byte callable object, it still demonstrated similar performance behavior as the previous case. All implementations have similar call performance, while the 16-byte implementation demonstrated much better copy and call performance. The difference might be explained by accumulated cache pressure.
In this article, we discussed the design choices for the interface and implementation of the C++ callback class. We claim that the performance of implementation could be impacted by both the dispatch and storage implementations and covered many approaches to optimize them. In the end, we implemented these trade-offs via a PolicyCB
class and discussed the performance.