2017-02-22



alvinashcraft
shared this story
from Nick Fitzgerald.

I wrote a Rust crate to demangle C++ symbols: cpp_demangle. Find it
on crates.io and github.

C++ symbols are mangled? Why?

Yes.

Linkers only support C identifiers for symbol names. They don’t have any
knowledge of C++’s namespaces, monomorphization of a template function,
overloaded functions, etc. That means that the C++ compiler needs to generate C
identifier compatible symbols for C++ constructs. This process is called
“mangling”, the resulting symbol is a “mangled symbol”, and reconstructing the
original C++ name is “demangling”.

Let’s take an overloaded function as a concrete example. Say there are two
overloads: one that takes a char* and another that takes an int:

Although overloaded is a valid C identifier that the linker could understand,
we can’t name both versions of the function overloaded because we need to
differentiate them from each other for the linker.

We can see mangling in action if we run nm on the object file:

There is some prefixed gunk and then the types of parameters got mangled
themselves and appended to the end of the function name: Pc for a pointer to a
char, and i for int.

Here is a more complex example:

This time the mangled symbols are a bit less human readable, and its a bit less
obvious where some of the mangled bits even came from:

Do compilers have to agree on the mangling format?

If they want code compiled from one compiler to inter-operate with code compiled
from another compiler (or even different versions of the same compiler), then
yes, they need to agree on names.

These days, almost every C++ compiler uses
the Itanium C++ ABI’s name mangling rules. The notable exception is
MSVC, which uses a completely different format.

We’ve been looking at the Itanium C++ ABI style mangled names, and that’s what
cpp_demangle supports.

Why demangle C++ symbols in Rust?

A few reasons:

I wanted to demangle some C++ symbols
for gimli’s addr2line clone. But I also didn’t want to
introduce complexity into the build system for some old C code, nor a
dependency on a system library outside of the Rust ecosystem, nor spawn a
c++filt subprocess.

Tom Tromey, a GNU hacker and buddy of mine, mentioned that
historically, the canonical C++ demangler in libiberty (used by c++filt
and gdb) has had tons of classic C bugs: use-after-free, out-of-bounds array
accesses, etc, and that it falls over immediately when faced with a fuzzer. In
fact, there were so many of these issues that gdb went so far as to install
a signal handler to catch SIGSEGVs during demangling. It “recovered” from
the segfaults by longjmping out of the signal handler and printing a warning
message before moving along and pretending that nothing happened. My ears
perked up. Those are the exact kinds of things Rust protects us from at
compile time! A robust alternative might actually be a boon not just for the
Rust community, but everybody who wants to demangle C++ symbols.

Finally, I was looking for something to kill a little bit of free time.

What I didn’t anticipate was how complicated parsing mangled C++ symbols
actually is! The second bullet point should have been a hint. I hadn’t looked to
deeply into how C++ symbols are mangled, and I expected simple replacement
rules. But if that was the case, why would the canonical demangler have had so
many bugs?

What makes demangling C++ symbols difficult?

Maybe I’ve oversold how complicated it is a little bit: it isn’t terribly
difficult, its just more difficult than I expected it was going to be. That
said, here are some things I didn’t expect.

The grammar is pretty huge, and has to allow encoding all of C++’s
expressions, e.g. because of decltype. And its not just the grammar that’s
huge, the symbols themselves are too. Here is a pretty big mangled C++ symbol
from SpiderMonkey, Firefox’s JavaScript engine:

That’s 355 bytes!

And here is its demangled form – even bigger, with a whopping 909 bytes!

The mangled symbol is smaller than the demangled form because the name mangling
algorithm also specifies a symbol compression algorithm:

To minimize the length of external names, we use two mechanisms, a
substitution encoding to eliminate repetition of name components, and
abbreviations for certain common names. Each non-terminal in the grammar above
for which <substitution> appears on the right-hand side is both a source of
future substitutions and a candidate for being substituted.

After the first appearance of a type in the mangled symbol, all subsequent
references to it must actually be back references pointing to the first
appearance. The back references are encoded as S i _ where
i is the i + 1th substitutable AST node in an in-order walk of
the AST (and S_ is the 0th).

We have to be careful not to mess up the order in which we populate the
substitutions table. If we got it wrong, then all the back references will point
to the wrong component, and the final output of demangling will be
garbage. cpp_demangle uses a handwritten recursive descent parser. This makes
it easy to build the substitutions table as we parse the mangled symbol,
avoiding a second back reference resolution pass over the AST after
parsing. Except there’s a catch: the grammar contains left-recursion, which will
cause naive recursive descent parsers to infinitely recurse and blow the
stack. Can we transform the grammar to remove the left-recursion?
Nope. Although we would parse the exact same set of mangled symbols, the
resulting parse tree would be different, invalidating our substitutions table
and back references! Instead, we resign ourselves to peeking ahead a character
and switching on it.

The compression also means that the demangled symbol can be exponentially larger
than the mangled symbol in the worst case! Here is an example that makes it
plain:

The mangled lets_get_exponential symbol is 91 bytes long, while the demangled
symbol is 5902 bytes long.

The final thing to watch out for with compression and back references is
malicious input that creates cycles with the references. If we aren’t checking
for and defending against cycles, we could hang indefinitely in the best case,
or more likely infinitely recurse and blow the stack.

What’s the implementation status?

The cpp_demangle crate can parse every mangled C++ symbol I’ve thrown at it,
and it can demangle them too. However, it doesn’t always match libiberty’s C++
demangler’s formatting character-for-character. I’m currently in the process of
getting all of libiberty’s C++ demangling tests passing.

Additionally, I’ve been running American Fuzzy Lop (with afl.rs) on
cpp_demangle overnight. It found a panic involving unhandled integer overflow,
which I fixed. Since then, AFL hasn’t triggered any panics, and its never been
able to find a crash (thanks Rust!) so I think cpp_demangle is fairly solid
and robust.

If you aren’t too particular about formatting, then cpp_demangle is ready for
you right now. If you do care more about formatting, it will be ready for you
soon.

What’s next?

Finish getting all the tests in libiberty’s test suite passing, so
cpp_demangle is character-for-character identical with libiberty’s C++
demangler.

Add a C API, so that non-Rust projects can use cpp_demangle.

Add benchmarks and compare performance with libiberty’s C++ demangler. Make
sure cpp_demangle is faster ;-)

A couple other things listed in the github issues.

That’s it! See you later, and happy demangling!

Thanks to Tom Tromey for sharing GNU war stories with me, and providing
feedback on an early draft of this blog post.

Show more