On the Design of C++ Callback Classes

Even before the introduction of std::function in C++11, callbacks had been extensively used in C++. Most well-known libraries and frameworks also have their own implementations, including Qt's Signals and Slots, the superseded Boost.Function, and ETLCpp's delegate. C++23 and C++26 are also going to introduce more variants: move_only_function and function_ref.

Observing many confusions around the trade-offs between these classes, I've written this article to discuss them thoroughly. This article will cover the problem domain, the design trade-offs, and their impact on performance. Finally, I will implement these trade-offs in a separate policy-based callback class and measure its performance.

Problem Domain

What is a callback function? Intuitively speaking, it needs to support two operations:

Storing a callable object. The invocable object can be a function pointer, a lambda with captures, a class with operator(), a member function pointer, and even more.
Allowing callers to invoke the stored object.

In the following paragraphs, I will use the following notions for the sake of brevity:

cb denotes the callback instance.
obj denotes the invocable object.

The above two requirements roughly map to:

Callback cb{obj}; is valid.
cb(args...) should result in the same behavior as obj(args...).

Apparently, the above definition leaves many unexplained bits about the semantics of each operation. For now, let's put these aside and make a few more assumptions to keep this article short.

Assumption 1: Static Function Signature

Most callback implementations assume that each Callback has an associated function signature. For example, std::function<double(int)> cb; can only be created from an obj that has an obj(int) -> double interface, and passing a string to the interface results in a compile-time error. This approach is very easy to reason about as enforcing signature checks prevents potential argument mismatch errors, which are considered one of the biggest benefits of statically-typed languages. We will limit our discussion to it in the remaining paragraphs.

A tiny fraction of callback implementations allow the signature to be determined at runtime, which means (1) functions could take an AnyCallback anyCB parameter, (2) any callable could be stored in AnyCallback, and (3) invoking it with unmatched arguments versus the parameters of the stored callable object leads to a runtime error.

Assumption 2: 1 Callback, 1 Receiver

Some signal-slot implementations allow users to register many receivers to a slot. Calling cb(args...) will notify all registered (also called connected) receivers. This pattern is typically used in GUI programming like GTK or QT. This article will not discuss it, as multi-receivers could be considered an additional subscriber pattern over simple callbacks.

Interface Tradeoffs

Type Erasure, or Not?

Before diving into real-world implementations, let's discuss the approach of passing the callback type as a template parameter, akin to standard-library-style callbacks:

template<class InputIt, class OutputIt, class UnaryOperation>
constexpr OutputIt transform(InputIt first1, InputIt last1,
                             OutputIt d_first, UnaryOperation unary_op);

This method has its advantages and disadvantages:

The caller (transform) becomes specialized against the concrete callback type (unary_op), potentially allowing inlining of the function body to enhance performance.
With no abstraction layer over the callback, the overhead of integrating unary_op directly into this function is virtually nonexistent.
However, requiring a template instantiation for each callback type can lead to template bloat, possibly increasing code size, compilation time, and negatively affecting performance.
If the user defines a class intending to store the callback, the callback type will permeate the class's signature. For instance, the Compare type is manifested in the type of std::set, potentially leading to further template bloat when users introduce various callback types.

Verdict:

Utilize template parameters only when the caller is straightforward, positioned at the end of the call stack, highly performance-sensitive, and anticipates a limited variety of callback types.
In other scenarios, opt for type erasure of the callback.

Storing the Object by Reference

The forthcoming function_ref in C++26, among other implementations, stores only a reference to the callable object. This results in a fixed callback size, requiring storage for only a trampoline function pointer plus the object (either a pointer to a callable object or a member function pointer). On the x86-64 LLP64 platform, this equates to 24 bytes (8 for the pointer plus 16 for the object), offering savings compared to std::function's 32 bytes. Additionally, it eliminates the virtual dispatch seen in std::function, generally improving performance.

When accepting a function_ref as a parameter, it's crucial for the caller to ensure the reference isn't stored beyond the call duration to prevent dangling references. A typical use case is illustrated below:

void visitTreeNode(std::function_ref<void(const Node*)>);

{
    int nodeCnt = 0;
    visitTreeNode([&nodeCnt](const Node*) {
        nodeCnt++;
    });
}

It's important to note that performance enhancements aren't solely attributed to taking a reference. In theory, passing a raw function pointer or a std::ref to a plain callback implementation should yield equivalent outcomes. This will be further explored in subsequent sections.

Copyable or Movable

Following our discussion, we arrive at a type-erased callback interface:

Callback<int(const string&)> cb1 = [](const string& s) { return (int)s.size(); };
Callback<int(const string&)> cb2 = [](const string& s) { return (int)s.at(0); };
std::swap(cb1, cb2); // Also OK!
void caller(Callback<int(const string&)> stringCB);
caller(cb1); 
caller(cb2);

std::function or its contemporary counterpart, std::copyable_function, mandates that the callable be copyable, whereas std::move_only_function requires the callable to be movable.

`operator()` Interface

Qualifiers

In C++, a callable's call operator may have various qualifiers. std::copyable_function permits the user to specify these in the callback signature, applying them to operator():

template<class R, class... Args>
class copyable_function<R(Args...)>;
template<class R, class... Args>
class copyable_function<R(Args...) noexcept>;
template<class R, class... Args>
class copyable_function<R(Args...)&>;
template<class R, class... Args>
class copyable_function<R(Args...)& noexcept>;
template<class R, class... Args>
class copyable_function<R(Args...)&&>;
template<class R, class... Args>
class copyable_function<R(Args...)&& noexcept>;
template<class R, class... Args>
class copyable_function<R(Args...) const>;
template<class R, class... Args>
class copyable_function<R(Args...) const noexcept>;
template<class R, class... Args>
class copyable_function<R(Args...) const&>;
template<class R, class... Args>
class copyable_function<R(Args...) const& noexcept>;
template<class R, class... Args>
class copyable_function<R(Args...) const&&>;
template<class R, class... Args>
class copyable_function<R(Args...) const&& noexcept>;

R operator()(ArgTypes...) cv ref noexcept(noex);

Comparing it to its predecessor, std::function:

template<class R, class... Args>
class function<R(Args...)>;

R operator()(ArgTypes...)

As proposed here, including const and other qualifiers in the callback interface can help prevent unintended modifications to the callback itself:

auto lambda{[&]() mutable { … }}; // Signature: void operator()(void)
function<void(void)> oldFunc{lambda}; // OK!
copyable_function<void(void)> func{lambda};  // OK!
const auto& oldRef{oldFunc}; // Still std::function<void(void)>
oldRef(); // Allowed but problematic: modifies the object stored in `oldFunc` via a const reference!
const auto& ref{func}; // copyable_function<void(void) const>
ref(); // Disallowed and beneficial: `operator()` is NOT const!

Implementation Tradeoffs

After exploring various interface designs and their impacts on users, we now turn our attention to the implementation of these interfaces. We'll assess performance across three dimensions:

The size of the callable object
The overhead associated with construction, copying, and moving
The overhead of making calls

We posit that many prior analyses failed to adequately distinguish between distinct subdomains. Our discussion hinges on two critical choices:

The method of implementing virtual dispatch
The layout for storing objects

Test Platform

The tests were performed on my personal laptop, equipped with an AMD 6800H processor. Frequency scaling was disabled, and the performance governor was set to performance. The system runs Linux 6.7.4 and glibc 2.39. Performance metrics were gathered using Catch2's benchmarking default, which executes 100 runs with 100,000 resamples. Compilation was done with gcc 13.2.1 using the -O2 optimization flag:

COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.1 20230801 (GCC)

Virtual Dispatch Implementation

The question arises: What operations require virtual dispatch? We identify four candidates:

The call operator (essential)
The destructor (unnecessary for function_ref (including function pointers and member function pointers), or trivially-destructible objects)
The copy constructor (unnecessary for function_ref, non-copyable, or trivially-copyable objects)
The move constructor (unnecessary for function_ref, non-moveable, or trivially-moveable objects)

VIRTCALL

The most versatile approach is undoubtedly VIRTCALL, which involves declaring all required methods as virtual within a base class and creating a wrapper for each callable object to implement these virtual methods.

struct VirtBase {
    virtual ~VirtBase() {}
    virtual ReturnType invoke(Args&&... args) = 0;
    virtual void cloneTo(VirtBase* other) const = 0;
    virtual void moveConstructAt(VirtBase* other) && = 0;
};

template<typename CallableT>
struct CallableWrapper : VirtBase {
    CallableT obj;
    ~CallableWrapper() final {}
    ReturnType invoke(Args&&... args) final {
        return std::invoke(obj, std::forward<Args>(args)...);
    }
    void cloneTo(VirtBase* other) const final {
        new (other) CallableWrapper(*this);
    }
    void moveConstructAt(VirtBase* other) && final {
        new (other) CallableWrapper(std::move(*this));
    }
};

This methodology is employed by std::function and numerous other implementations that accommodate non-trivial destructors, copy constructors, and move constructors. Regarding overhead, each operation—invocation, copying, moving, and destruction—necessitates a virtual call. Additionally, the object incorporates an 8-byte vptr to enable these virtual calls.

Trampoline

Is it feasible to eliminate the virtual call? Remember, when only the polymorphic call operator is necessary, we can store a function pointer to a trampoline function as shown below:

template<typename ReturnT, typename... Args, typename ObjT>
ReturnT invokeTrampoline(void* obj, Args&&... args) {
    return std::invoke(*static_cast<ObjT*>(obj), std::forward<Args>(args)...);
}

{
    // OBJ_STORAGE
    // ...
    ReturnT (*trampolinePtr)(void*, Args&&...);
    Callback(ObjT obj) {
        void* obj_mem_ptr = store_obj(obj);
        trampolinePtr = &invokeTrampoline<ReturnT, Args..., ObjT>;
    }
    
    ReturnType operator()(Args... args) {
        return trampolinePtr(getStoreObjPtr(), std::forward<Args>(args)...);
    }

}

This approach eliminates one level of indirection in the call operation. The storage overhead remains equivalent to VIRTCALL, at 8 bytes. However, since a separate trampoline pointer is needed for each dynamic method, this strategy is best when there's only one virtual method. Additionally, as of 2024, compilers do not typically optimize trampoline pointers for devirtualization, so the missed optimization opportunity may be a significant factor to consider.

This technique was initially introduced in Impossibly Fast C++ Delegates and further explored in its sequel. It is particularly suitable for delegates as they do not own the objects they reference, thus eliminating the need for polymorphic move/copy/destructor features.

Additional Optimizations

While the methods mentioned above are widely utilized, performance can be further enhanced by leveraging platform-specific implementations.

Positioning the `void*` Pointer at the End of Trampoline's Parameter List

Most calling conventions assign arguments to registers or stack locations based on their order. By placing the additional reference at the end of the argument list, we can minimize the overhead associated with shifting Args... args. Although these register moves are typically less than a cycle due to speculative execution and move elimination, this adjustment still contributes to reduced code size:

void* at start:

int (* trampoline)(void*, const string&, int, int*);

void* global;
int obj(string s, int v, int* p) {
    return trampoline(global, s, v, p);
}

obj(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int*): # @obj(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int*)
        mov     rcx, rdx
        mov     edx, esi
        mov     rsi, rdi
        mov     rax, qword ptr [rip + trampoline[abi:cxx11]]
        mov     rdi, qword ptr [rip + global]
        jmp     rax                             # TAILCALL

void* at end:

# void* at end:
int (* trampoline)(const string&, int, int*, void*);
int obj(string s, int v, int* p) {
    return trampoline(s, v, p, global);
}

obj(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int*): # @obj(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int*)
        mov     rax, qword ptr [rip + trampoline[abi:cxx11]]
        mov     rcx, qword ptr [rip + global]
        jmp     rax                             # TAILCALL

Avoiding Member Pointers

Member pointers, which occupy 16 bytes and typically entail slower dereferencing than direct calls via lambdas, especially when the target function is inline-able, have been observed to cause a significant slowdown. We noted a 130% slowdown from member pointers on an inline-able method versus its lambda equivalent:

int cnt = 0;
struct Mid
{
    int* cnt;
    int f(const string& a, const string& b)
    {
        return (*cnt)++;
    }
} mid{&cnt};
Callback<int(Mid*, const string&, const string&)> cb1 = &Mid::f; // (1): 17.25ms for 1M calls
Callback<int(Mid*, const string&, const string&)> cb2 = [](Mid* mid, const string& a, const string& b) {
    return mid->f(a, b);
}; // (2): 7.71ms for 1M calls

For methods that cannot be inlined, the performance degradation is less pronounced, with timings of 8.53ms versus 7.15ms (a 19% difference).

The underlying reasons for this slowdown are threefold: (1) member pointers are 16 bytes in size, while a plain lambda occupies only 1 byte (for an empty lambda); (2) the use of a member pointer, a form of type-erasure, prevents the compiler from inlining; (3) invoking a function pointer necessitates additional offset arithmetic to navigate virtual inheritance. Thus, avoiding member function pointers is generally advisable.

Although various techniques exist for optimizing around member pointers by examining ABI-specific internals (with one example here), practical experience suggests that converting these member pointers into appropriate lambda wrappers is often a simpler and more effective strategy.

Optimizing Parameter Types for Trampoline/VIRTCALL's `invoke`

Currently, we employ ReturnT invokeTrampoline(Args&&... args, void* obj) as the trampoline function signature. When a Callback<void(int)> cb{ /* function pointer of void(int) */ } is specified, the trampoline function receives an int&&, compelling the callback class to store the passed int on the stack, transfer its address into %rdi, and then dereference it within the trampoline to invoke the underlying void(int) function. This conversion process could be bypassed by directly passing int by value to the trampoline. A similar approach is utilized in abseil's implementation.

Strongly-Typed Function on Construction

When constructing a callback like this:

Callback<void(void)> cb{&func};

The precise function address becomes obscured as soon as its address is taken, leading to a situation where trampoline or virtual call wrappers, which are specialized based on the Callable type, must rely on an indirect call via the function pointer to access the underlying callback. This approach limits compiler optimization opportunities and increases the runtime size of the callback.

An alternative is for users to pass a strongly-typed lambda as an argument:

Callback<void(void)> cb{[]() {
       func();
}};

This enables compilers to more aggressively inline within the trampoline implementation, as illustrated below:

// Taking function pointer
int trampoline<int (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void*):
        sub     rsp, 8
        call    [QWORD PTR [rdx]]
        add     rsp, 8
        ret
// Taking lambda
int trampoline<ptr3::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void*):
        sub     rsp, 8
        call    f1()
        add     rsp, 8
        ret

While gcc may automatically optimize this through function cloning and constant propagation in some instances, other situations may necessitate a manual lambda wrapper. It's important to note that this can lead to additional specialization and potentially increase template bloat, so this strategy should be reserved for performance-critical paths.

This approach is exemplified by Matt's delegate implementation, which includes a .bind<&func>() method. This method effectively transforms each unique global function into a distinct type, mirroring the benefits of using strongly-typed lambdas.

Flaky and Probably Unuseful Ideas

This section delves into several additional optimization ideas that remain largely unexplored in production environments. Many of these concepts are highly dependent on specific platforms and may not integrate well with compiler optimizations. Furthermore, they are entirely untested, so it is advisable to approach them with caution.

Self-Modifying Callbacks

One intriguing possibility involves leveraging self-modifying code to achieve polymorphism in the call method, rather than relying on a virtual pointer or trampoline function pointer. In the simplest scenario of storing a function pointer, the indirect callq *(%rax) instruction could be replaced with a relative callq {target_offset} instruction, with the {target_offset} field being modifiable in the source code. This approach might enhance performance by facilitating speculative execution. Moreover, when the callback's signature precisely aligns with the target's signature and employs the same calling convention, it becomes feasible to transform the body of the callback into a jmp {target_offset} instruction, effectively eliminating all overhead associated with parameter passing during the conversion from T to T&&.

However, it is important to note that self-modifying code tends to disrupt branch predictors following any modification. Consequently, the potential benefits may only justify the complexities involved if a callback is expected to be invoked numerous times.

Fast Path for Calling a Function Pointer

Considering the trampoline implementation discussed earlier, when invoking a function pointer, the callback class must initially call the trampoline function ReturnT trampoline(Args&&... args, void* obj), followed by the actual call to obj(forward<Args>(args)...). An alternative strategy involves substituting the trampoline function pointer with the function pointer itself, thereby reducing one level of indirection. However, this simplification is subject to certain constraints:

The parameter list of the trampoline function must exactly match that of the function pointer.
The inclusion of an additional void* argument should not interfere with the passing conventions or stack layout of the existing args.

These ideas represent the frontier of callback optimization techniques, where the balance between innovation and practical utility must be carefully navigated.

Storage Implementations

The utilization of Small Buffer Optimization (SBO) is a common practice among callback implementations, where a small buffer is allocated on the stack for objects that can fit within this space. While numerous online articles detail the implementation of SBO (one example), this discussion will focus on the overhead introduced by SBO and various implementation nuances rather than reiterating well-covered material. To provide context, a typical SBO implementation for callbacks can be outlined as follows:

struct SBO {
    union StackStorage {
        unsigned char data[STACK_SIZE];
        unsigned char* heap_buffer_end;
    } stack;
    unique_ptr<unsigned char[]> heap = nullptr;
    span<unsigned char> getActiveStorage() const {
        return heap != nullptr ? span<unsigned char>{heap.get(), stack.heap_buffer_end} : span<unsigned char>{stack.data, STACK_SIZE};
    }
    void resizeToContain(size_t newSize) {
        if (newSize <= getActiveStorage().size()) {
            return;
        }
        if (newSize > STACK_SIZE) {
            heap = std::make_unique<unsigned char[]>(newSize);
            stack.heap_buffer_end = heap.get() + newSize;
        } else {
            heap.reset();
            stack.heap_buffer_end = stack.data + STACK_SIZE;
        }
    }
};

On amd64 LLP64 platforms, the size of the object is calculated as STACK_SIZE + sizeof(unique_ptr), which equals STACK_SIZE + 8 bytes. This formula highlights that expanding the buffer to accommodate heap allocations introduces an 8-byte overhead. Additionally, this design necessitates a conditional branch in getActiveStorage and some extra logic in resizeToContain.

A notable pitfall with this SBO approach when storing callable objects involves the direct use of reinterpret_cast<Callable*>(dataPtr). This action can lead to undefined behavior, as type punning from unsigned char[] to another object type is generally prohibited. To avoid potential optimization issues, it's recommended to use std::launder when casting pointers.

Furthermore, specifying alignof(max_align_t) for the SBO buffer may be necessary to ensure proper alignment for callable objects. While some implementations may forego specifying alignment to leverage x86-64's forgiving alignment requirements, such 'optimizations' are generally discouraged in callback implementations due to the standard alignment practices for many other fields.

À La Carte of Callbacks

Reflecting on our earlier assertion, we identified that performance hinges on two pivotal design choices:

The method of implementing virtual dispatch
The configuration of object storage

Drawing from the discussions, here's a comparative table showcasing various features and their associated performance overheads:

Feature	Overhead
Base dynamic dispatch (virtcall or trampoline)	8 bytes
Supports polymorphic destruction	-
Supports polymorphic move	-
Supports polymorphic copy	(When any of these three is necessary, VIRTCALL is implemented to enable dynamic dispatch, adding an extra layer of indirection but not affecting callback size.)
Base SBO storage size	STACK_SIZE
Need for SBO to expand onto heap	+8 bytes, plus some additional overhead

With this framework, we can delineate common examples into different Policies based on these features:

Raw function pointer: Limited to only the first feature, with a minimal footprint of 8 bytes.
std::function: Incorporates all features, equipped with a 16-byte SBO buffer, culminating in a total size of 16 + 8 + 8 = 32 bytes.
absl::function_ref: Exclusively facilitates the storage of a callable's reference or a function pointer, employing an SBO of 8 bytes. It does not support polymorphic destruction, resulting in a size of 8 + 8 = 16 bytes, and is realized through a trampoline mechanism.

This overview allows us to discern the trade-offs between different callback implementations, illustrating how the amalgamation of virtual dispatch methods and storage strategies can influence the overall efficiency and size of callback mechanisms.

Policy-based Callback Library and Benchmark

Based on the previous discussion, I implemented a highly experimental policy-based callback library, which supports the following interfaces:

// The policy on allowed callable and Callback itself
// Note that when none of them is DYNAMIC, Callback<> could
// utilize flattened function pointer to save a virtual call
// @{
enum class MovePolicy
{
    // Allows non-trivially movable object
    DYNAMIC = 0,
    // Only allows trivially movable object
    TRIVIAL_ONLY = 1,
    // Forbids any move on Callback
    NOMOVE = 2,
};

enum class CopyPolicy
{
    // Allows non-trivially copyable object
    DYNAMIC = 0,
    // Only allows trivially copyable object
    TRIVIAL_ONLY = 1,
    // Forbids any copy on Callback
    NOCOPY = 2,
};

enum class DestroyPolicy
{
    // Allows non-trivially destructable object
    DYNAMIC = 0,
    // Only allows trivially destructable object
    TRIVIAL_ONLY = 1,
};

// @}

// Policy on the small-buffer-optimization storage
enum class SBOPolicy
{
    // Allows the Callback to store arbitrary-sized object
    // The storage takes InitialBufferSize + 8 bytes
    DYNAMIC_GROWTH = 0,
    // Only allows the Callback to store a object with specified
    // maximum size. Causes an compilation error if the object is
    // too large.
    FIXED_SIZE = 1,
    // This disables storage of the original function,
    // essentially makes the Callback a function pointer
    NO_STORAGE = 2,
};


template<typename FT,
         MovePolicy MP,
         CopyPolicy CP,
         DestroyPolicy DP,
         SBOPolicy SBOP,
         std::size_t InitialBufferSize = 16>
class Callback;

Additionally, I used it to replicate many popular callback choices to evaluate their performance:

// Roughly equivalent to std::function<int(string, string)>
// 24 byte
template<typename FT>
using DynamicCB =
  Callback<FT, MovePolicy::DYNAMIC, CopyPolicy::DYNAMIC, DestroyPolicy::DYNAMIC, SBOPolicy::DYNAMIC_GROWTH, 16>;
  
// Fixed size variant of the above. 16 bytes.
template<typename FT>
using FixedDynamicCB =
  Callback<FT, MovePolicy::DYNAMIC, CopyPolicy::DYNAMIC, DestroyPolicy::DYNAMIC, SBOPolicy::FIXED_SIZE, 16>;

// Only allows trivially-copyable invocables to optimize
// calls. Faster to call than the above variant at the cost
// of slightly more memory.
// 32 bytes
template<typename FT>
using TrivialCB = Callback<FT,
                           MovePolicy::TRIVIAL_ONLY,
                           CopyPolicy::TRIVIAL_ONLY,
                           DestroyPolicy::TRIVIAL_ONLY,
                           SBOPolicy::DYNAMIC_GROWTH,
                           16>;

// This is probably the most useful specilization I used
// in work - the 8 byte storage allows us to store a
// lambda with captured {this}
// 16 bytes
template<typename FT>
using FixedTrivialCB = Callback<FT,
                                MovePolicy::TRIVIAL_ONLY,
                                CopyPolicy::TRIVIAL_ONLY,
                                DestroyPolicy::TRIVIAL_ONLY,
                                SBOPolicy::FIXED_SIZE,
                                8>;

// Basically just a function pointer
template<typename FT>
using FunctionPtr = Callback<FT,
                             MovePolicy::TRIVIAL_ONLY,
                             CopyPolicy::TRIVIAL_ONLY,
                             DestroyPolicy::TRIVIAL_ONLY,
                             SBOPolicy::NO_STORAGE,
                             0>;

I benchmarked various scenarios:

Constructing and destructing 100K callbacks in a vector.
Invoking one of 400 callbacks 1M times.
Copying from one of 100K callbacks into one of 400 callbacks and invoking the updated callback. This process was repeated 1M times.

The results are as follows:

Small 8-byte Function Pointer

In this test case, the callable is an 8-byte global function pointer. We tested both without the strongly-typed function technique (plain function pointers) and with a strongly-typed lambda wrapper.

Overall, for invocation, function pointers are slightly faster than the alternatives. Using a strongly-typed function as input eliminated the difference. As for the copy and call scenario, all implementations over 16 bytes are the slowest, followed by 16-byte implementations, and then the 8-byte raw function pointer. The virtual dispatch and trampoline approach displayed similar overhead in calling, while dynamic heap allocation considerably impacted copy performance.

16-byte Callable Object

When it comes to a 16-byte callable object, it still demonstrated similar performance behavior as the previous case. All implementations have similar call performance, while the 16-byte implementation demonstrated much better copy and call performance. The difference might be explained by accumulated cache pressure.

Conclusion

In this article, we discussed the design choices for the interface and implementation of the C++ callback class. We claim that the performance of implementation could be impacted by both the dispatch and storage implementations and covered many approaches to optimize them. In the end, we implemented these trade-offs via a PolicyCB class and discussed the performance.