Woboq.com

Woboq Code Browser: under the hood

2016-10-24

A few years ago, I was introducing the code browser
and code.woboq.org

Since the then, we implemented a few new features and made the
source code available, which you can
browse from the code browser itself.

In this article, I give more technical information on how this is all done.

What/Why a code browser?

As a developer, I spend more time reading code than writing, and I found reading code online a rather poor experience.
Hence, I made a way to publish code in a way that can be read just like in a good IDE, with links, tooltips, and semantic highlighting.
Read more about it in the original blog post.

Architecture Overview

The idea is that there is a generator that generates all the .html pages ahead of time. This is a bit like a compilation.
In the process we also fill up a database containing all the symbols and where
they are defined or used, as well as other information such as the documentation.

The database

Because NoSQL is trendy, we use a well-known NoSQL database directly included in the kernel: the file system ☺.
There is a huge directory containing one file per global symbol. Each file contains all the information you see in the tooltip.
The type of the variable, the list of its uses and its documentation. The HTML files has tags referencing the symbol, the JavaScript can then do
an AJAX request on that file in order to render the tooltip contents.

For example, the file for QQmlEngine::removeImageProvider
, looks like this:

You can see that there is one entry for the declaration, the definition, the documentation, and one for each usage.
This is the information displayed in the tooltip.

We also store information about inherited classes or methods.

See the symbol page that shows information from the tooltip and more.

This way, the whole generated browsable code is just a set of files that can be served by any simple web server.
The whole thing is maybe three times as big as the original source code. Which still amounts to several GB when we host so much source code.
However, it is highly compressible. To save space and ease upload,
we use squashfs images. That way
we even have atomic updates ☺

Using Clang to parse C/C++

Here is the interesting part: how is the generator working?

Clang
is more than just a compiler, it is really a library to parse C and C++.

Clang provides all the tools required, all I have to do is to create an clang::ASTConsumer.
Once the parser has finished its job, we can then visit the full AST of the translation unit with the clang::RecursiveASTVisitor.
As explained in this tutorial.

We then visit all the declarations and usage nodes. We know the source location of the node,
so we know if we are in a file that we should generate. In particular, we do not generate header files twice, so if
the header file has already been parsed, we ignore that node.

Knowing the location of the node, we can register a HTML tag for it. We give a data-ref
tag with the mangled name of the symbol so the
JavaScript will be able to show the tooltip, and we give proper classes so it can be semantically styled.

We also register the use in the database.

Macro Expansions

We also show the expansion of a macro in the macro tooltip.
Macros does not appear as node in the AST because they are expanded before,
in the pre-processing phase.
We use clang::PPCallback to be notified each time a macro is expended.
Unfortunately, getting the actual expansion is far from easy since the expansion never appears as
such in memory. What happens is that the pre-processor just provides tokens to the parser.
We have to
pre-process again the macro and write the token strings in to tooltip.

Documentation Comments

Comments are ignored by the parser so they are not part of the AST.
We do another pass in which, for each file, in which we find comments and keywords for the
basic syntax coloration of things that are not in the AST, and color these element
appropriately. We will try to associate the comments with
the closest declaration or definition, so it can go in the database (in the tooltip).
We also recognize some doxygen commands such as \fn which associate the
comment with a different declaration.

Qt SIGNAL and SLOT

We detect a few Qt extensions. We recognize calls to QObject::connect, QMetaObject::activate and
similar call like QTimer::singleShot (See the full list of recognized functions). Since SIGNAL and SLOT
are macros that transform their argument to string, we can easily extract the string literal and parse that in order to
find to what method it is. We know in which class to look because know the type of the QObject sub class of the receiver parameter.

Usage classification

When looking at the AST, we see how the variables are used. We can classify if the
variable was simply read, or modified. We add a little letter in the
uses in the tooltip. If you click on the little