Backtrace is finally cheap by abusing x86/linux's shadow stack

Vic Luo
May 18, 2025

Backtrace is a very helpful debugging tool in native programming by giving out the source location at each call level. Unfortunately, getting a backtrace is expensive. The common libunwind implementation attempts to read the massive DWARF section of .eh_frame to recover frames from stack, and then falls back to the frame pointer-based approach when it fails. Although with many optimizations like Google's work to compile DWARF into native code, or with recent trend of re-enabling frame pointers, the full process is still taking a bunch of time. One of the reason is that stack layout is not optimized for reading all frame pointers at once:

Stack: From High Address to Low Address:

[ 
    Return Addr to Frame 0 (caller of Frame 1),
    Saved RBP of Frame 0,
    Stack variables of Frame 1,
],
[ 
    Return Addr to Frame 1,
    Saved RBP of Frame 1,
    Stack variables of Frame 2,
],
[ 
    Return Addr to Frame 2,
    Saved RBP of Frame 2,
    Stack variables of Frame 3,
],
[ 
    Return Addr to Frame 3,
    Saved RBP of Frame 3,    <-- RBP
    Stack variables of Frame 4, <-- RSP
    [RED ZONE below current RSP]
]

As we can see above, all RBPs of frames plus current value in RBP actually forms a linked list, and programs could walk the linked list and retrieve the return address of each frame by reading the adjacent value next to each RBP. The problem is that walking a linked list is hostile to CPU caches, essentially chasing pointers. Moreover, the stack is often very large since it stores a lot of data, and it often cannot all fit into higher levels of CPU caches. These performance constraints make people reluctant to include backtrace at common operations like warning messages.

Shadow stack

Shadow stack is mostly proposed as a security feature to defend against stack buffer overflows changing return addresses on stack. The idea is simple: create a parallel stack region that (1) only stores return addresses and nothing else (2) can only be written by a few instructions like CALL or RET. On CALL, current instruction pointer (IP) is pushed to shadow stack. On RET, the on-stack return address is checked with the last shadow stack entry, and pops out last shadow stack entry when they match. On mismatch an exception will be triggered.

The above description only covers the case of near jumps within the same privilege level and segment. CPU technically could push more than return address to the shadow stack, but for simplicity and userspace-only usage we will skip them in this article.

Historically, shadow stack was implemented in software like in LLVM's -fsanitize=shadow-call-stack. Recently, both Intel and AMD announced hardware support:

  • For Intel it's branded as Control-flow Enforcement Technology (CET) and started on be included on Alder Lake
    CPUs in 2021
  • For AMD the support is only matured on Zen 3

Microsoft has a pretty good overview of shadow stacks in 2020 from security perspective.

Shadow stack on x86-64 linux

Support for shadow stack in Linux was only added in Linux 6.4. Kernel needs to be configured with CONFIG_X86_USER_SHADOW_STACK=y at compile time. However, average developers are not expected to use it via calling kernal APIs. Instead, when users pass -fcf-protection=return or =full to GCC, GCC will add a x86 feature of SHSTK to .note.gnu.property:

$ readelf -n /usr/bin/ls
Displaying notes found in: .note.gnu.property
  Owner                Data size 	Description
  GNU                  0x00000040	NT_GNU_PROPERTY_TYPE_0
      Properties: x86 feature: IBT, SHSTK
	x86 ISA needed: x86-64-baseline
	x86 feature used: x86, x87, XMM
	x86 ISA used: x86-64-baseline

At runtime, glibc (requiring 2.28+)'s dynamic linker /lib/ld-linux-x86-64.so.2 will first detect if the hardware supports shadow stacks, and then behaves differently based on glibc.cpu.x86_shstk tunable's value:

The glibc.cpu.x86_shstk tunable allows the user to control how the shadow stack (SHSTK) should be enabled. Accepted values are on, off, and permissive. on always turns on SHSTK regardless of whether SHSTK is enabled in the executable and its dependent shared libraries. off always turns off SHSTK regardless of whether SHSTK is enabled in the executable and its dependent shared libraries. permissive changes how dlopen works on non-CET shared libraries. By default, when SHSTK is enabled, dlopening a non-CET shared library returns an error. With permissive, it turns off SHSTK instead.

It worths noting that as of today (May 2025), the hardware detection logic is not well supported on my Zen 3 laptop. You have to manually tell glibc that it's supported via GLIBC_TUNABLES='glibc.cpu.hwcaps=SHSTK'. Also if your other dynamic libraries are not compiled with -fcf-protection=full (one notable example is Debian), the shadow stack will not work at run time. We will discuss in the next section on how to bypass this constraint.

Below is a minimal example:

static inline unsigned long __attribute__((always_inline)) get_ssp(void)
{
	unsigned long ret = 0;
	__asm__ volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
	return ret;
}
// Crude code to dump shadow stack. Assuming that it's only 1 page
void dump_shstk(void) {
	uintptr_t ssp = (uintptr_t)get_ssp();
	if (ssp == 0) {
		printf("No shadow stack\n");
		return;
	}
	uintptr_t ssp_end = (ssp & ((uintptr_t)PAGE_SIZE - 1U)) == 0 ? ssp : (ssp & ~((uintptr_t)PAGE_SIZE - 1U)) + PAGE_SIZE;
	printf("Current ssp: %p ssp_end: %p\n", (void*)ssp, (void*)ssp_end)  ;

	printf("shstks (from bottom to top): ");
	for (uintptr_t p = ssp; p < ssp_end; p += 8) {
		printf("%p ", *(void**)p);
	}
	printf("\n");
}

void fun(void) {
	dump_shstk();
}

int main(void)
{
  printf("Now ssp: %p\n", (void*)get_ssp());
  fun();
  return 0;
}

Compiling it with: gcc test1.c -fcf-protection=full -g and run it with GLIBC_TUNABLES='glibc.cpu.hwcaps=SHSTK' ./a.out gives this out:

Now ssp: 0x7546ec5fffe8
Current ssp: 0x7546ec5fffd8 ssp_end: 0x7546ec600000
shstks (from bottom to top): 0x558426fbe665 0x558426fbe6bd 0x7546ec7bf6b5 0x7546ec7bf769 0x558426fbe155

To interpret the stack pointers, I also have dumped a /proc/self/map which shows the ASLR base of the program image. It could be used to get line and file numbers after-the-fact:

$ eu-addr2line -a -C -i --pretty-print  -M /tmp/proc_map_.out.18332
0x558426fbe665
0x0000558426fbe665: /tmp/test1.c:154:1
0x558426fbe6bd
0x0000558426fbe6bd: /tmp/test1.c:164:3
0x7546ec7bf6b5
0x00007546ec7bf6b5: /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:74:3
0x7546ec7bf769
0x00007546ec7bf769: /usr/src/debug/glibc/glibc/csu/../csu/libc-start.c:128:20
 (inlined by) /usr/src/debug/glibc/glibc/csu/../csu/libc-start.c:347:5
0x558426fbe155
0x0000558426fbe155: ??:0

The memory region of a shadow stack can be inferred from /proc/self/map as well. In current system, it's default to the default stack size limit via ulimit -s.

558426fbd000-558426fbe000 r--p 00000000 00:23 1072                       /tmp/a.out
558426fbe000-558426fbf000 r-xp 00001000 00:23 1072                       /tmp/a.out
558426fbf000-558426fc0000 r--p 00002000 00:23 1072                       /tmp/a.out
558426fc0000-558426fc1000 r--p 00002000 00:23 1072                       /tmp/a.out
558426fc1000-558426fc2000 rw-p 00003000 00:23 1072                       /tmp/a.out
5584314c5000-5584314e6000 rw-p 00000000 00:00 0                          [heap]
7546ebe00000-7546ec600000 rw-p 00000000 00:00 0                          # <- this is shadow stack. 8M matches ulimit -s output
7546ec795000-7546ec798000 rw-p 00000000 00:00 0 
7546ec798000-7546ec7bc000 r--p 00000000 103:01 9178494                   /usr/lib/libc.so.6
7546ec7bc000-7546ec92c000 r-xp 00024000 103:01 9178494                   /usr/lib/libc.so.6
7546ec92c000-7546ec97a000 r--p 00194000 103:01 9178494                   /usr/lib/libc.so.6
7546ec97a000-7546ec97e000 r--p 001e1000 103:01 9178494                   /usr/lib/libc.so.6
7546ec97e000-7546ec980000 rw-p 001e5000 103:01 9178494                   /usr/lib/libc.so.6
7546ec980000-7546ec98a000 rw-p 00000000 00:00 0 
7546ec9bd000-7546ec9bf000 r--p 00000000 00:00 0                          [vvar]
7546ec9bf000-7546ec9c1000 r--p 00000000 00:00 0                          [vvar_vclock]
7546ec9c1000-7546ec9c3000 r-xp 00000000 00:00 0                          [vdso]
7546ec9c3000-7546ec9c4000 r--p 00000000 103:01 9178455                   /usr/lib/ld-linux-x86-64.so.2
7546ec9c4000-7546ec9ed000 r-xp 00001000 103:01 9178455                   /usr/lib/ld-linux-x86-64.so.2
7546ec9ed000-7546ec9f8000 r--p 0002a000 103:01 9178455                   /usr/lib/ld-linux-x86-64.so.2
7546ec9f8000-7546ec9fa000 r--p 00034000 103:01 9178455                   /usr/lib/ld-linux-x86-64.so.2
7546ec9fa000-7546ec9fb000 rw-p 00036000 103:01 9178455                   /usr/lib/ld-linux-x86-64.so.2
7546ec9fb000-7546ec9fc000 rw-p 00000000 00:00 0 
7ffee3b21000-7ffee3b42000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

Through this way, users could easily capture frame pointers by directly memcpy-ing from shadow stack, bypassing the slower frame pointer chasing logic. With fast frame pointers available, it's possible to include them in the log output and dump a /proc/self/map at the start of the process. We can write a Python script to convert these frame pointers into file names and line numbers:

# App.log
# Start:
/proc/self/map dump
MAP DUMP END
...
17:42:01.224411 SOME WARNING HAPPEND [!SHSTK 0x558426fbe665 0x558426fbe6bd 0x7546ec7bf6b5 0x7546ec7bf769 0x558426fbe155]

---
Later:
analyze-dump.py App.log outputs:

17:42:01.224411 SOME WARNING HAPPEND [!Backtrace /tmp/test1.c:154:1 /tmp/test1.c:164:3 libc_start_call_main.h:74:3 libc-start.c:128:20 /usr/src/debug/glibc/glibc/csu/../csu/libc-start.c:347:5 0x0000558426fbe155]

Users could further tune the dump_shstk behavior by only reading last 512 bytes (giving 64 levels of frames) when shadow stack is too large. This should further limit the performance impact.

Shadow stack without glibc

As discussed above, glibc's shadow stack support has a few caveats:

  1. The hardware detection logic still doesn't support Zen 3 as of today
  2. It requires all used libraries to be compiled with shadow-stack enabled. Otherwise it will reject loading or silently turns off the shadow stack
  3. It can only be configured at application's startup via environment variables

It turns out that we can implement shadow stack without the support of glibc by using the underlying syscall directly. kernel's selftest gives us an example:

/*
 * For use in inline enablement of shadow stack.
 *
 * This has to be a macro, as the program can't return 
 * from the point where shadow stack gets enabled
 * because there will be no address on the shadow stack. So it can't use
 * syscall() for enablement, since it is a function.
 *
 */
#define USSHSTK_ARCH_PRCTL(arg1, arg2)					\
({								\
	long _ret;						\
	register long _num  __asm__("eax") = __NR_arch_prctl;	\
	register long _arg1 __asm__("rdi") = (long)(arg1);		\
	register long _arg2 __asm__("rsi") = (long)(arg2);		\
								\
	__asm__ volatile (						\
		"syscall\n"					\
		: "=a"(_ret)					\
		: "r"(_arg1), "r"(_arg2),			\
		  "0"(_num)					\
		: "rcx", "r11", "memory", "cc"			\
	);							\
	_ret;							\
})

#define ARCH_SHSTK_ENABLE	0x5001
#define ARCH_SHSTK_DISABLE	0x5002
#define ARCH_SHSTK_SHSTK	(1ULL <<  0)

#define ENABLE_SHSTK() ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)
#define DISABLE_SHSTK() ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)

With this, developers could enable or disable shadow stack at runtime:

int main(int argc, char* argv[]) {
    if (FLAG_SET(flag_enable_shstk)) {
        ENABLE_SHSTK()
    }
    // doWork()
    if (FLAG_SET(flag_enable_shstk)) {
        DISABLE_SHSTK()
    }
}

Developers must disable shadow stack in the same function as ENABLE_SHSTK() is called. Otherwise the RET instruction will report no matched shadow stack entry on return, and crash the program.

It's worth noting that this may break uninformed custom context-switching logic like glibc's makecontext(), swapcontext(), longjmp(), and Boost.Context. This is because enabling shadow stack requires library to save and restore shadow stack pointers when switching context, and since it add overhead, the feature is usually opt-in. These libraries are often used in Coroutine or Fiber implementations and you need to double check if they support this use case. Specifically, glibc only handles shadow stack correctly when glibc itself is compiled with -fcf-protection=return/full, and Boost.Context enables the feature only when user's program is compiled with -fcf-protection=return/full.

Conclusion

Hardware-supported shadow stack is a powerful feature available on recent CPUs. It can be used to capture backtrace at unprecedented speed and enables backtrace at a wider range of places. Unfortunately the software support for (ab)using it to capture backtrace is still immature and requires some hacks to make it work. Hopefully this article will invoke people's awareness of this usage and improve upstream support.