alvinashcraft
shared this story
from Joe Duffy's Blog.
In my first Midori post, I described how safety was the
foundation of everything we did. I mentioned that we built an operating system out of safe code, and yet stayed
competetive with operating systems like Windows and Linux written in C and C++. In many ways, system architecture
played a key role, and I will continue discussing how in future posts. But, at the foundation, an optimizing compiler
that often eeked out native code performance from otherwise "managed", type- and memory-safe code, was one of our most
important weapons. In this post, I'll describe some key insights and techniques that were essential to our success.
Overview
When people think of C#, Java, and related languages, they usually think of Just-In-Time (JIT) compilation. Especially back in the mid-2000s when Midori began. But
Midori was different, using more C++-like Ahead-Of-Time (AOT) compilation from the outset.
AOT compiling managed, garbage collected code
presents some unique challenges compared to C and C++. As a result, many AOT efforts don't achieve parity with their
native counterparts. .NET's NGEN technology is a good example of this. In fact, most efforts in .NET have exclusively
targeted startup time; this is clearly a key metric, but when you're building an operating system and everything on top,
startup time just barely scratches the surface.
Over the course of 8 years, we were able to significantly narrow the gap between our version of C# and classical C/C++
systems, to the point where basic code quality, in both size of speed dimensions, was seldom the deciding factor when
comparing Midori's performance to existing workloads. In fact, something counter-intuitive happened. The ability to
co-design the language, runtime, frameworks, operating system, and the compiler -- making tradeoffs in one area to gain
advantages in other areas -- gave the compiler far more symbolic information than it ever had before about the program's
semantics and, so, I dare say, was able to exceed C and C++ performance in a non-trivial number of situations.
Before diving deep, I have to put in a reminder. The architectural decisions -- like Async Everywhere and Zero-Copy IO (coming soon) -- had more to do with us
narrowing the gap at a "whole system" level. Especially the less GC-hungry way we wrote systems code. But the
foundation of a highly optimizing compiler, that knew about and took advantage of safety, was essential to our results.
I would also be remiss if I didn't point out that the world has made considerable inroads in this area alongside us.
Go has straddled an elegant line between systems performance and safety. Rust is just plain awesome. The .NET Native and, related, Android Runtime projects have brought a nice taste of AOT to C# and Java in a more
limited way, as a "silent" optimization technique to avoid mobile application lag caused by JITting. Lately, we've
been working on bringing AOT to a broader .NET setting with the CoreRT project.
Through that effort I hope we can bring some of the lessons learned below to a real-world setting. Due to the
delicate balance around breaking changes it remains to be seen how far we can go. It took us years to get
everything working harmoniously, measured in man-decades, however, so this transfer of knowledge will take time.
First thing's first. Let's quickly recap: What's the difference between native and managed code, anyway?
What's the same
I despise the false dichotomy "native and managed," so I must apologize for using it. After reading this article, I
hope to have convinced you that it's a continuum. C++ is safer these days than ever before, and likewise, C# performant. It's amusing how many of these lessons apply
directly to the work my team is doing on Safe C++ these days.
So let's begin by considering what's the same.
All the basic dragon book topics apply to managed as
much as they do native code.
In general, compiling code is a balancing act between, on one hand emitting the most efficient instruction sequences
for the target architecture, to execute the program quickly; and on the other hand emitting the smallest encoding of
instructions for the target architecture, to store the program compactly and effectively use the memory system on the
target device. Countless knobs exist on your favorite compiler to dial between the two based on your scenario. On
mobile, you probably want smaller code, whereas on a multimedia workstation, you probably want the fastest.
The choice of managed code doesn't change any of this. You still want the same flexibility. And the techniques you'd
use to achieve this in a C or C++ compiler are by and large the same as what you use for a safe language.
You need a great inliner. You want common subexpression
elimination (CSE), constant propagation and folding, strength reduction,
and an excellent loop optimizer. These days, you probably want to
use static single assignment form (SSA), and some unique
SSA optimizations like global value numbering (although you need
to be careful about working set and compiler throughput when using SSA everywhere). You will need specialized machine
dependent optimizers for the target architectures that are important to you, including register allocators. You'll eventually want a global analyzer that does interprocedural
optimizations, link-time code-generation to extend those interprocedural optimizations across passes, a vectorizer for modern processors (SSE, NEON, AVX, etc.), and most definitely
profile guided optimizations (PGO) to inform all of the
above based on real-world scenarios.
Although having a safe language can throw some interesting curveballs your way that are unique and interesting -- which
I'll cover below -- you'll need all of the standard optimizing compiler things.
I hate to say it, but doing great at all of these things is "table stakes." Back in the mid-2000s, we had to write
everything by hand. Thankfully, these days you can get an awesome off-the-shell optimizing compiler like LLVM that has most of these things already battle tested, ready to go, and ready for you to help improve.
What's different
But, of course, there are differences. Many. This article wouldn't be very interesting otherwise.
The differences are more about what "shapes" you can expect to be different in the code and data structures thrown at
the optimizer. These shapes come in the form of different instruction sequences, logical operations in the code that
wouldn't exist in the C++ equivalent (like more bounds checking), data structure layout differences (like extra object
headers or interface tables), and, in most cases, a larger quantity of supporting runtime data structures.
Objects have "more to them" in most managed languages, compared to frugal data types in, say, C. (Note that C++ data
structures are not nearly as frugal as you might imagine, and are probably closer to C# than your gut tells you.) In
Java, every object has a vtable pointer in its header. In C#, most do, although structs do not. The GC can impose
extra layout restrictions, such as padding and a couple words to do its book-keeping. Note that none of this is really
specific to managed languages -- C and C++ allocators can inject their own words too, and of course, many C++ objects
also carry vtables -- however it's fair to say that most C and C++ implementations tend to be more economical in these
areas. In most cases, for cultural reasons more than hard technical ones. Add up a few thousand objects in a heap,
especially when your system is built of many small processes with isolated heaps, like Midori, and it adds up quickly.
In Java, you've got a lot more virtual dispatch, because methods are virtual by default. In C#, thankfully, methods
are non-virtual by default. (We even made classes sealed by default.) Too much virtual dispatch can totally screw
inlining which is a critical optimization to have for small functions. In managed languages you tend to have more
small functions for two reasons: 1) properties, and 2) higher level programmers tend to over-use abstraction.
Although it's seldom described this formally, there's an "ABI" (Application Binary Interface) that governs interactions between code and the runtime.
The ABI is where the rubber meets the road. It's where things like calling conventions, exception handling, and, most
notably, the GC manifest in machine code. This is not unique to managed code! C++ has a "runtime" and therfore an
ABI too. It's just that it's primarily composed of headers, libraries like allocators, and so on, that are more
transparently linked into a program than with classical C# and Java virtual machines, where a runtime is non-negotiable
(and in the JIT case, fairly heavy-handed). Thinking of it this way has been helpful to me, because the isomorphisms
with C+ suddenly become immediately apparent.
The real biggie is array bounds checks. A traditional approach is to check that the index is within the bounds of an
array before accessing it, either for laoding or storing. That's an extra field fetch, compare, and conditional
branch. Branch prediction these days is quite good, however it's just
plain physics that if you do more work, you're going to pay for it. Interestingly, the work we're doing with C++'s
array_view<T> incurs all these same costs.
Related to this, there can be null checks where they didn't exist in C++. If you perform a method dispatch on a null
object pointer in C++, for example, you end up running the function anyway. If that function tries to access this,
it's bound to AV, but in Java and .NET, the compiler is required
(per specification) to explicitly check and throw an exception in these cases, before the call even occurs. These
little branches can add up too. We eradicated such checks in favor of C++ semantics in optimized builds.
In Midori, we compiled with overflow checking on by default. This is different from stock C#, where you must explicitly
pass the /checked flag for this behavior. In our experience, the number of surprising overflows that were caught,
and unintended, was well worth the inconvenience and cost. But it did mean that our compiler needed to get really good
at understanding how to eliminate unnecessary ones.
Static variables are very expensive in Java and .NET. Way more than you'd expect. They are mutable and so cannot be
stored in the readonly segment of an image where they are shared across processes. And my goodness, the amount of
lazy-initialization checking that gets injected into the resulting source code is beyond belief. Switching from
preciseinit to beforefieldinit semantics in .NET helps a little bit, since the checks needn't happen on every
access to a static member -- just accesses to the static variable in question -- but it's still disgusting compared to
a carefully crafted C program with a mixture of constant and intentional global initialization.
The final major area is specific to .NET: structs. Although structs help to alleviate GC pressure and hence are a
good thing for most programs, they also carry some subtle problems. The CLI specifies surprising behavior around their
initialization, for example. Namely if an exception happens during construction, the struct slot must remain zero-
initialized. The result is that most compilers make defensive copies. Another example is that the compiler must
make a defensive copy anytime you call a function on a readonly struct. It's pretty common for structs to be copied
all over the place which, when you're counting cycles, hurts, especially since it often means time spent in memcpy.
We had a lot of techniques for addressing this and, funny enough, I'm pretty sure when all was said and done, our
code quality here was better than C++'s, given all of its RAII, copy constructor, destructor, and so on, penalties.
Compilation Architecture
Our architecture involved three major components:
C# Compiler: Performs lexing, parsing, and semantic analysis. Ultimately
translates from C# textual source code into a CIL-based
intermediate representation (IR).
Bartok: Takes in said IR, does high-level MSIL-based analysis,
transformations, and optimizations, and finally lowers this IR to something a bit closer to a more concrete machine
representation. For example, generics are gone by the time Bartok is done with the IR.
Phoenix: Takes in this lowered IR, and goes to town on
it. This is where the bulk of the "pedal to the metal" optimizations happen. The output is machine code.
The similarities here with Swift's compiler design, particularly SIL, are evident. The .NET Native project also
mirrors this architecture somewhat. Frankly, most AOT compilers for high level languages do.
In most places, the compiler's internal representation leveraged static single assignment form (SSA). SSA was preserved until very late in the compilation.
This facilitated and improved the use of many of the classical compiler optimizations mentioned earlier.
The goals of this architecture included:
Facilitate rapid prototyping and experimentation.
Produce high-quality machine code on par with commerical C/C++ compilers.
Support debugging optimized machine code for improved productivity.
Facilitate profile-guided optimizations based on sampling and/or instrumenting code.
Suitable for self-host:
The resulting compiled compiler is fast enough.
It is fast enough that the compiler developers enjoy using it.
It is easy to debug problems when the compiler goes astray.
Finally, a brief warning. We tried lots of stuff. I can't remember it all. Both Bartok and Phoenix existed for years
before I even got involved in them. Bartok was a hotbed of research on managed languages -- ranging from optimizations
to GC to software transactional memory -- and Phoenix was meant to replace the shipping Visual C++ compiler. So,
anyway, there's no way I can tell the full story. But I'll do my best.
Optimizations
Let's go deep on a few specific areas of classical compiler optimizations, extended to cover safe code.
Bounds check elimination
C# arrays are bounds checked. So were ours. Although it is important to eliminate superfluous bounds checks in regular
C# code, it was even more so in our case, because even the lowest layers of the system used bounds checked arrays. For
example, where in the bowels of the Windows or Linux kernel you'd see an int*, in Midori you'd see an int[].
To see what a bounds check looks like, consider a simple example:
Here's is an example of the resulting machine code for the inner loop array access, with a bounds check:
If you're doing this bookkeeping on every loop iteration, you won't get very tight loop code. And you're certianly not
going to have any hope of vectorizing it. So, we spent a lot of time and energy trying to eliminate such checks.
In the above example, it's obvious to a human that no bounds checking is necessary. To a compiler, however, the
analysis isn't quite so simple. It needs to prove all sorts of facts about ranges. It also needs to know that a
isn't aliased and somehow modified during the loop body. It's surprising how hard this problem quickly becomes.
Our system had multiple layers of bounds check eliminations.
First it's important to note that CIL severely constraints an optimizer by being precise in certain areas. For example,
accessing an array out of bounds throws an IndexOutOfRangeException, similar to Java's ArrayOutOfBoundsException.
And the CIL specifies that it shall do so at precisely the exception that threw it. As we will see later on, our
error model was more relaxed. It was based fail-fast and permitted code motion that led to inevitable failures
happening "sooner" than they would have otherwise. Without this, our hands would have been tied for much of what I'm
about to discuss.
At the highest level, in Bartok, the IR is still relatively close to the program input. So, some simple patterns could
be matched and eliminated. Before lowering further, the ABCD algorithm -- a straightforward value range analysis based on
SSA -- then ran to eliminate even more common patterns using a more principled approach than pattern matching. We were
also able to leverage ABCD in the global analysis phase too, thanks to inter-procedural length and control flow fact
propagation.
Next up, the Phoenix Loop Optimizer got its hands on things. This layer did all sorts of loop optimizations and, most
relevant to this section, range analysis. For example:
Loop materialization: this analysis actually creates loops. It recognizes repeated patterns of code that would be
more ideally represented as loops, and, when profitable, rewrites them as such. This includes unrolling hand-rolled
loops so that a vectorizer can get its hands on them, even if they might be re-unrolled later on.
Loop cloning, unrolling, and versioning: this analysis creates copies of loops for purposes of specialization. That
includes loop unrolling, creating architectural-specific versions of a vectorized loop, and so on.
Induction range optimization: this is the phase we are most
concerned with in this section. It uses induction range analysis to remove unnecessary checks, in addition to doing
classical induction variable optimizations such as widening. As a byproduct of this phase, bounds checks were
eliminated and coalesced by hoisting them outside of loops.
This sort of principled analysis was more capable than what was shown earlier. For example, there are ways to write
the earlier loop that can easily "trick" the more basic techniques discussed earlier:
You get the point. Clearly at some point you can screw the optimizer's ability to do anything, especially if you
start doing virtual dispatch inside the loop body, where aliasing information is lost. And obviously, things get more
difficult when the array length isn't known statically, as in the above example of 100. All is not lost, however,
if you can prove relationships between the loop bounds and the array. Much of this analysis requires special knowledge
of the fact that array lengths in C# are immutable.
At the end of the day, doing a good job at optimizing here is the difference between this:
And the following, completely optimized, bounds check free, loop:
It's amusing that I'm now suffering deja vu as we go through this same exercise with C++'s new array_view<T> type.
Sometimes I joke with my ex-Midori colleagues that we're destined to repeat ourselves, slowly and patiently, over the
course of the next 10 years. I know that sounds arrogant. But I have this feeling on almost a daily basis.
Overflow checking
As mentioned earlier, in Midori we compiled with checked arithmetic by default (by way of C#'s /checked flag). This
eliminated classes of errors where developers didn't anticipate, and therefore code correctly for, overflows. Of
course, we kept the explicit checked and unchecked scoping constructs, to override the defaults when appropriate,
but this was preferable because a programmer declared her intent.
Anyway, as you might expect, this can reduce code quality too.
For comparison, imagine we're adding two variables:
Now imagine x is in ECX and y is in EDX. Here is a standard unchecked add operation:
Or, if you want to get fancy, one that uses the LEA instruction to also store the result in the EAX register using
a single instruction, as many modern compilers might do:
Well, here's the equivalent code with a bounds check inserted into it:
More of those damn conditional jumps (JO) with error handling routines (CALL 2028).
It turns out a lot of the analysis mentioned earlier that goes into proving bounds checks redundant also apply to
proving that overflow checks are redundant. It's all about proving facts about ranges. For example, if you can prove
that some check is dominated by some earlier check, and that
furthermore that earlier check is a superset of the later check, then the later check is unnecessary. If the opposite
is true -- that is, the earlier check is a subset of the later check, then if the subsequent block postdominates the
earlier one, you might move the stronger check to earlier in the program.
Another common pattern is that the same, or similar, arithmetic operation happens multiple times near one another:
It is obvious that, if the p assignment didn't overflow, then the q one won't either.
There's another magical phenomenon that happens in real world code a lot. It's common to have bounds checks and
arithmetic checks in the same neighborhood. Imagine some code that reads a bunch of values from an array:
Well C# arrays cannot have negative bounds. If a compiler knows that DATA_SIZE is sufficiently small that an
overflowed computation won't wrap around past 0, then it can eliminate the range check in favor of the bounds check.
There are many other patterns and special cases you can cover. But the above demonstrates the power of a really good
range optimizer that is integrated with loops optimization. It can cover a wide array of scenarios, array bounds and
arithmetic operations included. It takes a lot of work, but it's worth it in the end.
Inlining
For the most part, inlining is the same as with true native code. And
just as important. Often more important, due to C# developers' tendency to write lots of little methods (like property
accessors). Because of many of the topics throughout this article, getting small code can be more difficult than in
C++ -- more branches, more checks, etc. -- and so, in practice, most managed code compilers inline a lot less than
native code compilers, or at least need to be tuned very differently. This can actually make or break performance.
There are also areas of habitual bloat. The way lambdas are encoded in MSIL is unintelligable to a naive backend
compiler, unless it reverse engineers that fact. For example, we had an optimization that took this code:
and, after inlining, was able to turn B into just:
That Action argument to A is a lambda and, if you know how the C# compiler encodes lambdas in MSIL, you'll
appreciate how difficult this trick was. For example, here is the code for B:
To get the magic result required constant propagating the ldftn, recognizing how delegate construction works
(IL_0017), leveraging that information to inline B and eliminate the lambda/delegate altogether, and then, again
mostly through constant propagation, folding the arithmetic into the constant 42 initialization of x. I always
found it elegant that this "fell out" of a natural composition of multiple optimizations with separate concerns.
As with native code, profile guided optimization made our inlining decisions far more effective.
Structs
CLI structs are almost just like C structs. Except they're not. The CLI imposes some semantics that incur overheads.
These overheads almost always manifest as excessive copying. Even worse, these copies are usually hidden from your
program. It's worth noting, because of copy constructors and destructors, C++ also has some real issues here, often
even worse than what I'm about to describe.
Perhaps the most annoying is that initializing a struct the CLI way requires a defensive copy. For example, consider
this program, where the initialzer for S throws an exception:
The program behavior here has to be that the value 0 is written to the console. In practice, that means that the
assignment operation s = new S(42) must first create a new S-typed slot on the stack, construct it, and then and
only then copy the value back over the s variable. For single-int structs like this one, that's not a huge deal.
For large structs, that means resorting to memcpy. In Midori, we knew what methods could throw, and which could not,
thanks to our error model (more later), which meant we could avoid this overhead in nearly all cases.
Another annoying one is the following:
Every single time we read from s.Value:
we are going to get a local copy. This one's actually visible in the MSIL. This is without readonly:
And this is with it:
Notice that the compiler elected to use ldsfld followed by lodloca.s, rather than loading the address directly,
by way of ldsflda in the first example. The resulting machine code is even nastier. I also can't pass the struct
around by-reference which, as I mention later on, requires copying it and again can be problematic.
We solved this in Midori because our compiler knew about methods that didn't mutate members. All statics were immutable
to begin with, so the above s wouldn't need defensive copies. Alternatively, or in addition to this, the struct could
have beem declared as immutable, as follows:
Or because all static values were immutable anyway. Alternatively, the properties or methods in question could have
been annotated as readable meaning that they couldn't trigger mutations and hence didn't require defensive copies.
I mentioned by-reference passing. In C++, developers know to pass large structures by-reference, either using * or
&, to avoid excessive copying. We got in the habit of doing the same. For example, we had in parameters, as so:
I'll admit we probably took this to an extreme, to the point where our APIs suffered. If I could do it all over again,
I'd go back and eliminate the fundamental distinction between class and struct in C#. It turns out, pointers aren't
that bad after all, and for systems code you really do want to deeply understand the distinction between "near" (value)
and "far" (pointer). We did implement what amounted to C++ references in C#, which helped, but not enough. More on
this in my upcoming deep dive on our programming language.
Code size
We pushed hard on code size. Even more than some C++ compilers I know.
A generic instantiation is just a fancy copy-and-paste of code with some substitutions. Quite simply, that means an
explosion of code for the compiler to process, compared to what the developer actually wrote. I've covered many of the
performance challenges with generics in the past. A major problem there is the
transitive closure problem. .NET's straightforward-looking List<T> class actually creates 28 types in its transitive
closure! And that's not even speaking to all the methods in each type. Generics are a quick way to explode code size.
I never forgot the day I refactored our LINQ implementation. Unlike in .NET, which uses extension methods, we made all
LINQ operations instance methods on the base-most class in our collection type hierarchy. That meant 100-ish nested
classes, one for each LINQ operation, for every single collection instantiated! Refactoring this was an easy way for
me to save over 100MB of code size across the entire Midori "workstation" operating system image. Yes, 100MB!
We learned to be more thoughtful about our use of generics. For example, types nested inside an outer generic are
usually not good ideas. We also aggressively shared generic instantiations, even more than what the CLR does. Namely, we shared value type generics, where the
GC pointers were at the same locations. So, for example, given a struct S:
we would share the same code representation of List<int> with List<S>. And, similarly, given:
we would share instantiations between List<S> and List<T>.
You might not realize this, but C# emits IL that ensures structs have sequential layout:
As a result, we couldn't share List<S> and List<T> with some hypothetical List<U>:
For this, among other reasons -- like giving the compiler more flexibility around packing, cache alignment, and so on
-- we made structs auto by default in our language. Really, sequential only matters if you're doing unsafe code,
which, in our programming model, wasn't even legal.
We did not support reflection in Midori. In principle, we had plans to do it eventually, as a purely opt-in feature.
In practice, we never needed it. What we found is that code generation was always a more suitable solution. We shaved
off at least 30% of the best case C# image size by doing this. Significantly more if you factor in systems where the
full MSIL is retained, as is usually the case, even for NGen and .NET AOT solutions.
In fact, we removed significant pieces of System.Type too. No Assembly, no BaseType, and yes, even no FullName.
The .NET Framework's mscorlib.dll contains about 100KB of just type names. Sure, names are useful, but our eventing
framework leveraged code generation to produce just those you actually needed to be around at runtime.
At some point, we realized 40% of our image sizes were vtables.
We kept pounding on this one relentlessly, and, after all of that, we still had plenty of headroom for improvements.
Each vtable consumes image space to hold pointers to the virtual functions used in dispatch, and of course has a runtime
representation. Each object with a vtable also has a vtable pointer embedded within it. So, if you care about size
(both image and runtime), you are going to care about vtables.
In C++, you only get a vtable if you use virtual inheritance. In languages like C# and Java, you get them even if you
didn't want them. In C#, at least, you can use a struct type to elide them. I actually love this aspect of Go,
where you get a virtual dispatch-like thing, via interfaces, without needing to pay for vtables on every type; you only
pay for what you use, at the point of coercing something to an interface.
Another vtable problem in C# is that all objects inherit three virtuals from System.Object: Equals, GetHashCode,
and ToString. Besides the point that these generally don't do the right thing in the right way anyways -- Equals
requires reflection to work on value types, GetHashCode is nondeterministic and stamps the object header (or sync-
block; more on that later), and ToString doesn't offer formatting and localization controls -- they also bloat every
vtable by three slots. This may not sound like much, but it's certainly more than C++ which has no such overhead.
The main source of our remaining woes here was the assumption in C#, and frankly most OOP languages like C++ and Java,
that RTTI is always available for downcasts. This was
particularly painful with generics, for all of the above reasons. Although we aggressively shared instantiations, we
could never quite fully fold together the type structures for these guys, even though disparate instantiations tended
to be identical, or at least extraordinarily similar. If I could do it all over agan, I'd banish RTTI. In 90% of the
cases, type discriminated unions or pattern matching are more appropriate solutions anyway.
Profile guided optimizations (PGO)
I've mentioned profile guided optimization (PGO) already.
This was a critical element to "go that last mile" after mostly everything else in this article had been made
competetive. This gave our browser program boosts in the neighborhood of 30-40% on benchmarks like SunSpider and Octane.
Most of what went into PGO was similar to classical native profilers, with two big differences.
First, we tought PGO about many of the unique optimizations listed throughout this article, such as asynchronous stack
probing, generics instantiations, lambdas, and more. As with many things, we could have gone on forever here.
Second, we experimented with sample profiling, in addition to the ordinary instrumented profiling. This is much nicer
from a developer perspective -- they don't need two builds -- and also lets you collect counts from real, live running
systems in the data center. A good example of what's possible is outlined in this Google-Wide Profiling (GWP) paper.
System Architecture
The basics described above were all important. But a number of even more impactful areas required deeper architectural
co-design and co-evolution with the language, runtime, framework, and operating system itself. I've written about the
immense benefits of this sort of "whole system" approach before. It was kind of magical.
GC
Midori was garbage collected through-and-through. This was a key element of our overall model's safety and
productivity. In fact, at one point, we had 11 distinct collectors, each with its own unique characteristics. (For
instance, see this study.) We
had some ways to combat the usual problems, like long pause times. I'll go through those in a future post, however.
For now, let's stick to the realm of code quality.
The first top-level decision is: conservative or precise? A conserative collector is easier to wedge into an
existing system, however it can cause troubles in certain areas. It often needs to scan more of the heap to get the
same job done. And it can falsely keep objects alive. We felt both were unacceptable for a systems programming
environment. It was an easy, quick decision: we sought precision.
Precision costs you something in the code generators, however. A precise collector needs to get instructions where to
find its root set. That root set includes field offsets in data structures in the heap, and also places on the stack
or, even in some cases, registers. It needs to find these so that it doesn't miss an object and erroneously collect it
or fail to adjust a pointer during a relocation, both of which would lead to memory safety problems. There was no magic
trick to making this efficient other than close integration between runtime and code generator, and being thoughtful.
This brings up the topic of cooperative versus preemptive, and the notion of GC safe-points. A GC operating in
cooperative mode will only collect when threads have reached so-called "safe-points." A GC operating in preemptive
mode, on the other hand, is free to stop threads in their tracks, through preemption and thread suspension, so that it
may force a collection. In general, preemptive requires more bookkeeping, because the roots must be identifiable at
more places, including things that have spilled into registers. It also makes certain low-level code difficult to
write, of the ilk you'll probably find in an operating system's kernel, because objects are subject to movement between
arbitrary instructions. It's difficult to reason about. (See this file, and its associated uses in the CLR codebase, if you
don't believe me.) As a result, we used cooperative mode as our default. We experimented with automatic safe-point
probes inserted by the compiler, for example on loop back-edges, but opted to bank the code quality instead. It did
mean GC "livelock" was possible, but in practice we seldom ran into this.
We used a generational collector. This has the advantage of reducing pause times because less of the heap needs to be
inspected upon a given collection. It does come with one disadvantage from the code generator's perspective, which is
the need to insert write barriers into the code. If an older generation object ever points back at a younger generation
object, then the collector -- which would have normally preferred to limit its scope to younger generations -- must know
to look at the older ones too. Otherwise, it might miss something.
Write barriers show up as extra instructions after certain writes; e.g., note the call:
That barrier simply updates an entry in the card table, so the GC knows to look at that segment the next time it scans
the heap. Most of the time this ends up as inlined assembly code, however it depends on the particulars of the
situation. See this code for an
example of what this looks like for the CLR on x64.
It's difficult for the compiler to optimize these away because the need for write barriers is "temporal" in nature. We
did aggressively eliminate them for stack allocated objects, however. And it's possible to write, or transform code,
into less barrier hungry styles. For example, consider two ways of writing the same API:
In the resulting Test method body, you will find a write barrier in the former, but not the latter. Why? Because the
former is writing a heap object reference (of type object), and the compiler has no idea, when analyzing this method
in isolation, whether that write is to another heap object. It must be conservative in its analysis and assume the
worst. The latter, of course, has no such problem, because a bool isn't something the GC needs to scan.
Another aspect of GC that impacts code quality is the optional presence of more heavyweight concurrent read and write
barriers, when using concurrent collection. A concurrent GC does some collection activities concurrent with the user
program making forward progress. This is often a good use of multicore processors and it can reduce pause times and
help user code make more forward progress over a given period of time.
There are many challenges with building a concurrent GC, however one is that the cost of the resulting barriers is high.
The original concurrent GC by Henry Baker was a copying GC and had the notion
of "old" versus "new" space. All reads and writes had to be checked and, anything operation against the old space had
to be forwarded to the new space. Subsequent research for the DEC Firefly used hardware memory protection to reduce the
cost, but the faulting cases were still exceedingly expensive. And, worst of all, access times to the heap were
unpredictable. There has been a lot of good research into solving this problem, however we abandoned copying.
Instead, we used a concurrent mark-sweep compacting collector. This means only write barriers are needed under normal
program execution, however some code was cloned so that read barriers were present when programs ran in the presence of
object movement. Our primary GC guy's research was published, so you can read all about it. The
CLR also has a concurrent collector, but it's not quite as good. It uses copying to collect the youngest generation,
mark-sweep for the older ones, and the mark phase is parallelized. There are unfortunately a few conditions that can
lead to sequential pauses (think of this like a big "lock"), sometimes over 10 milliseconds: 1) all threads must be
halted and scanned, an operation that is bounded only by the number of threads and the size of their stacks; 2) copying
the youngest generation is bounded only by the size of that generation (thankfully, in normal configurations, this is
small); and 3) under worst case conditions, compaction and defragmentation, even of the oldest generation, can happen.
Separate compilation
The basic model to start with is static linking. In this model, you compile everything into a single executable. The
benefits of this are obvious: it's simple, easy to comprehend, conceptually straightforward to service, and less work
for the entire compiler toolchain. Honestly, given the move to Docker containers as the unit of servicing, this model
makes more and more sense by the day. But at some point, for an entire operating system, you'll want separate
compilation. Not just because compile times can get quite long when statically linking an entire operating system, but
also because the working set and footprint of the resulting processes will be bloated with significant duplication.
Separately compiling object oriented APIs is hard. To be honest, few people have actually gotten it to work. Problems
include the fragile base class problem, which is a real killer for
version resilient libraries. As a result, most real systems use a dumbed down "C ABI" at the boundary between components. This is why Windows,
for example, has historically used flat C Win32 APIs and, even in the shift to more object orientation via WinRT, uses
COM underneath it all. At some runtime expense, the ObjectiveC runtime addressed this challenge. As with most things
in computer science, virtually all problems can be solved with an extra level of indirection; this one can be too.
The design pivot we took in Midori was that whole processes were sealed. There was no dynamic loading, so nothing that
looked like classical DLLs or SOs. For those scenarios, we used the Asynchronous Everything programming model, which made it easy to dynamically
connect to and use separately compiled and versioned processes.
We did, however, want separately compiled binaries, purely as a developer productivity and code sharing (working set)
play. Well, I lied. What we ended up with was incrementally compiled binaries, where a change in a root node triggered
a cascading recompilation of its dependencies. But for leaf nodes, such as applications, life was beautiful. Over
time, we got smarter in the toolchain by understanding precisely which sorts of changes could trigger cascading
invaliation of images. A function that was known to never have been inlined across modules, for example, could have its
implementation -- but not its signature -- changed, without needing to trigger a rebuild. This is similar to the
distinction between headers and objects in a classical C/C++ compilation model.
Our compilation model was very similar to C++'s, in that there was static and dynamic linking. The runtime model, of
course, was quite different. We also had the notion of "library groups," which let us cluster multiple logically
distinct, but related, libraries into a single physical binary. This let us do more aggressive inter-module
optimizations like inlining, devirtualization, async stack optimizations, and more.
Parametric polymorphism (a.k.a., generics)
That brings me to generics. They throw a wrench into everything.
The problem is, unless you implement an erasure model -- which stinks performance-wise due to boxing allocations,
indirections, or both -- there's no way for you to possible pre-instantiate all possible versions of the code. For
example, say you're providing a List<T>. How do you know whether folks using your library will want a List<int>,
List<string>, or List<SomeStructYouveNeverHeardOf>?
Solutions abound:
Do not specialize. Erase everything.
Specialize only a subset of instantiations, and create an erased instantiation for the rest.
Specialize everything. This gives the best performance, but at some complexity.
Java uses #1 (in fact, erasure is baked into the language). Many ML compilers use #2. As with everything in Midori,
we picked the hardest path, with the most upside, which meant #3. Actually I'm being a little glib; we had several ML
compiler legends on the team, and #2 is fraught with peril; just dig a little into some papers on how hard this can get, since it's
difficult to know a priori which instantiations are going to be performance critical to a program.
Anyway, Midori's approach turned out to be harder than it sounded at first.
Imagine you have a diamond. Library A exports a List<T> type, and libraries B and C both instantiate List<int>. A
program D then consumes both B and C and maybe even passes List<T> objects returned from one to the other. How do we
ensure that the versions of List<int> are compatible?
We called this problem the potentially multiply instantiated, or PMI for short, problem.
The CLR handles this problem by unifying the instantiations at runtime. All RTTI data structures, vtables, and whatnot,
are built and/or aggressively patched at runtime. In Midori, on the other hand, we wanted all such data structures to
be in readonly data segments and hence shareable across processes, wherever possible.
Again, everything can be solved with an indirection. But unlike solution #2 above, solution #3 permits you to stick
instantiations only in the rare places where you need them. And for purposes of this one, that meant RTTI and accessing
static variables of just those generic types that might have been subject to PMI. First, that affected a vast subset of
code (versus #2 which generally affects even loading of instance fields). Second, it could be optimized away for
instantiations that were known not to be PMI, by attaching state and operations to the existing generic dictionary that
was gets passed around as a hidden argument already. And finally, because of all of this, it was pay for play.
But damn was it complex.
It's funny, but C++ RTTI for template instantiations actually suffers from many of the same problems. In fact, the
Microsoft Visual C++ compiler resorts to a strcmp of the type names, to resolve diamond issues! (Thankfully there are
well-known, more efficient ways to do this, which we are actively pursuing for the next release.)
Virtual dispatch
Although I felt differently when first switching from Java to C#, Midori made me love that C# made methods non-virtual
by default. I'm sure we would have had to change this otherwise. In fact, we went even further and made classes
sealed by default, requiring that you explicitly mark them virtual if you wanted to facilitate subclasses.
Aggressive devirtualization, however, was key to good performance. Each virtual means an indirection. And more
impactfully, a lost opportunity to inline (which for small functions is essential). We of course did global
intra-module analysis to devirtualize, but also extended this across modules, using whole program compilation, when
multiple binaries were grouped together into a library group.
Although our defaults were right, m experience with C# developers is that they go a little hog-wild with virtuals and
overly abstract code. I think the ecosystem of APIs that exploded around highly polymorphic abstractions, like LINQ and
Reactive Extensions, encouraged this and instilled lots of bad behavior. As you can guess, there wasn't very much of
that floating around our codebase. A strong culture around identifying and trimming excessive fat helped keep this in
check, via code reviews, benchmarks, and aggressive static analysis checking.
Interfaces were a challenge.
There are just some poorly designed, inefficient patterns in the .NET Framework. IEnumerator<T> requires two
interface dispatches simply to extract the next item! Compare that to C++ iterators which can compile down a pointer
increment plus dereference. Many of these problems could be addressed simply with better library designs. (Our final
design for enumeration didn't even invole interfaces at all.)
Plus invoking a C# interface is tricky. Existing systems do not use pointer adjustment like
C++ does so usually an interface dispatch requires a table search. First a level of indirection to get to the vtable,
then another level to find the interface table for the interface in question. Some systems attempt to do callsite
caching for monomorphic invocations; that is, caching the latest invocation in the hope that the same object kind passes
through that callsite time and time again. This requires mutable stubs, however, not to mention an incredibly complex
system of thunks and whatnot. In Midori, we
never ever ever violated W^X; and we avoided mutable runtime data structures,
because they inhibit sharing, both in terms of working set, but also amortizing TLB and data cache pressure.
Our solution took advantage of the memory ordering model earlier. We used so-called "fat" interface pointers. A fat
interface pointer was two words: the first, a pointer to the object itself; the second, a pointer to the interface
vtable for that object. This made conversion to interfaces slightly slower -- because the interface vtable lookup had
to happen -- but for cases where you are invoking it one or more times, it came out a wash or ahead. Usually,
significantly. Go does something like this, but it's slightly different for two reasons. First, they generate the
interface tables on the fly, because interfaces are duck typed. Second, fat interface pointers are subject to tearing
and hence can violate memory safety in Go, unlike Midori thanks to our strong concurrency model.
The finally challenge in this category was generic virtual methods, or GVMs. To cut to the chase, we banned them.
Even if you NGen an, image in .NET, all it takes is a call to the LINQ query a.Where(...).Select(...), and you're
pulling in the JIT compiler. Even in .NET Native, there is considerable runtime data structure creation, lazily, when
this happens. In short, there is no known way to AOT compile GVMs in a way that is efficient at runtime. So, we didn't
even bother offering them. This was a slightly annoying limitation on the programming model but I'd have done it all
over again thanks to the efficiencies that it bought us. It really is surprising how many GVMs are lurking in .NET.
Statics
I was astonished the day I learned that 10% of our code size was spent on static initialization checks.
Many people probably don't realize that the CLI specification offers two static initialization modes. There
is the default mode and beforefieldinit. The default mode is the same as Java's. And it's horrible. The static
initializer will be run just prior to accessing any static field on that type, any static method on that type, any
instance or virtual method on that type (if it's a value type), or any constructor on that type. The "when" part
doesn't matter as much as what it takes to make this happen; all of those places now need to be guarded with explicit
lazy initialization checks in the resulting machine code!
The beforefieldinit relaxation is weaker. It guarantees the initializer will run sometime before actually accessing
a static field on that type. This gives the compiler a lot of leeway in deciding on this placement. Thankfully the
C# compiler will pick beforefieldinit automatically for you should you stick to using field initializers only. Most
people don't realize the incredible cost of choosing instead to use a static constructor, however, especially for value
types where suddenly all method calls now incur initialization guards. It's just the difference between:
and:
Now imagine the struct has a property:
Here's the machine code for Property if S has no static initializer, or uses <cod