Vic Luo
May 18, 2025
Backtrace is a very helpful debugging tool in native programming by giving out the source location at each call level. Unfortunately, getting a backtrace is expensive. The common libunwind
implementation attempts to read the massive DWARF section of .eh_frame
to recover frames from stack, and then falls back to the frame pointer-based approach when it fails. Although with many optimizations like Google's work to compile DWARF into native code, or with recent trend of re-enabling frame pointers, the full process is still taking a bunch of time. One of the reason is that stack layout is not optimized for reading all frame pointers at once:
As we can see above, all RBPs of frames plus current value in RBP actually forms a linked list, and programs could walk the linked list and retrieve the return address of each frame by reading the adjacent value next to each RBP. The problem is that walking a linked list is hostile to CPU caches, essentially chasing pointers. Moreover, the stack is often very large since it stores a lot of data, and it often cannot all fit into higher levels of CPU caches. These performance constraints make people reluctant to include backtrace at common operations like warning messages.
Shadow stack is mostly proposed as a security feature to defend against stack buffer overflows changing return addresses on stack. The idea is simple: create a parallel stack region that (1) only stores return addresses and nothing else (2) can only be written by a few instructions like CALL or RET. On CALL, current instruction pointer (IP) is pushed to shadow stack. On RET, the on-stack return address is checked with the last shadow stack entry, and pops out last shadow stack entry when they match. On mismatch an exception will be triggered.
The above description only covers the case of near jumps within the same privilege level and segment. CPU technically could push more than return address to the shadow stack, but for simplicity and userspace-only usage we will skip them in this article.
Historically, shadow stack was implemented in software like in LLVM's -fsanitize=shadow-call-stack
. Recently, both Intel and AMD announced hardware support:
Microsoft has a pretty good overview of shadow stacks in 2020 from security perspective.
Support for shadow stack in Linux was only added in Linux 6.4. Kernel needs to be configured with CONFIG_X86_USER_SHADOW_STACK=y
at compile time. However, average developers are not expected to use it via calling kernal APIs. Instead, when users pass -fcf-protection=return
or =full
to GCC, GCC will add a x86 feature of SHSTK
to .note.gnu.property
:
At runtime, glibc (requiring 2.28+)'s dynamic linker /lib/ld-linux-x86-64.so.2
will first detect if the hardware supports shadow stacks, and then behaves differently based on glibc.cpu.x86_shstk
tunable's value:
The glibc.cpu.x86_shstk tunable allows the user to control how the shadow stack (SHSTK) should be enabled. Accepted values are on, off, and permissive. on always turns on SHSTK regardless of whether SHSTK is enabled in the executable and its dependent shared libraries. off always turns off SHSTK regardless of whether SHSTK is enabled in the executable and its dependent shared libraries. permissive changes how dlopen works on non-CET shared libraries. By default, when SHSTK is enabled, dlopening a non-CET shared library returns an error. With permissive, it turns off SHSTK instead.
It worths noting that as of today (May 2025), the hardware detection logic is not well supported on my Zen 3 laptop. You have to manually tell glibc that it's supported via GLIBC_TUNABLES='glibc.cpu.hwcaps=SHSTK'
. Also if your other dynamic libraries are not compiled with -fcf-protection=full
(one notable example is Debian), the shadow stack will not work at run time. We will discuss in the next section on how to bypass this constraint.
Below is a minimal example:
Compiling it with: gcc test1.c -fcf-protection=full -g
and run it with GLIBC_TUNABLES='glibc.cpu.hwcaps=SHSTK' ./a.out
gives this out:
To interpret the stack pointers, I also have dumped a /proc/self/map
which shows the ASLR base of the program image. It could be used to get line and file numbers after-the-fact:
The memory region of a shadow stack can be inferred from /proc/self/map
as well. In current system, it's default to the default stack size limit via ulimit -s
.
Through this way, users could easily capture frame pointers by directly memcpy
-ing from shadow stack, bypassing the slower frame pointer chasing logic. With fast frame pointers available, it's possible to include them in the log output and dump a /proc/self/map
at the start of the process. We can write a Python script to convert these frame pointers into file names and line numbers:
Users could further tune the dump_shstk
behavior by only reading last 512 bytes (giving 64 levels of frames) when shadow stack is too large. This should further limit the performance impact.
As discussed above, glibc's shadow stack support has a few caveats:
It turns out that we can implement shadow stack without the support of glibc by using the underlying syscall directly. kernel's selftest gives us an example:
With this, developers could enable or disable shadow stack at runtime:
Developers must disable shadow stack in the same function as ENABLE_SHSTK()
is called. Otherwise the RET
instruction will report no matched shadow stack entry on return, and crash the program.
It's worth noting that this may break uninformed custom context-switching logic like glibc's makecontext()
, swapcontext()
, longjmp()
, and Boost.Context. This is because enabling shadow stack requires library to save and restore shadow stack pointers when switching context, and since it add overhead, the feature is usually opt-in. These libraries are often used in Coroutine or Fiber implementations and you need to double check if they support this use case. Specifically, glibc only handles shadow stack correctly when glibc itself is compiled with -fcf-protection=return/full
, and Boost.Context enables the feature only when user's program is compiled with -fcf-protection=return/full
.
Hardware-supported shadow stack is a powerful feature available on recent CPUs. It can be used to capture backtrace at unprecedented speed and enables backtrace at a wider range of places. Unfortunately the software support for (ab)using it to capture backtrace is still immature and requires some hacks to make it work. Hopefully this article will invoke people's awareness of this usage and improve upstream support.