Today's story is about "XKeyScore source code" leak. As an expert, I'm going to read through the code line-by-line and comment on it.
Let's assume, for the moment, that somebody has taken an open-source deep-packet-inspection project like Snort and written a language on top of it to satisfy XKeyScore needs. Let's look at the gap between what Snort can do now and what this code wants to produce.
Take the first fingerprint in the file as an example:
The keyword fingerprint declares a new fingerprint (like an IDS "signature") that will trigger on the underlying TCP session. We don't see the definition of the global variables $tor_authority or $tor_directory. Missing bits like this shows that we are missing lots of code -- this isn't an original source file. One reason is that instead of a statically generated list, like the other variables in this source file, they may be generated from a dynamic source that's constantly updating.
I don't know what preappid means, but I presume it means whether the there exists a fingerprint name associated with the session with that prefix. Below in the code we see fire_fingerprint statements with that string as a prefix, so I presume they are suppose to match.
Snort fires fingerprints on packets instead of sessions. Snort can match on completely named identifiers associated with sessions, but not prefixes of names. This hints to me that the underlying code isn't actually Snort.
The following code just specifies a global variable:
This appears not to be a "variable" as such, but a "preprocessor macro". The only system that vaguely uses preprocessor macros like this is Snort, though the specifications within the macro aren't something Snort can directly consume.
Apparently, there are many versions of this scripting language:
This line hints first of all that these TOR fingerprints were pulled from many different source files and combined into this one file for publication. Normally, such as line would appear only at the top of a file.
That we are already on version 5 hints that this comes from an operational device rather than a prototype -- or least, the prototype is rapidly evolving.
The above fingerprints triggers simply from IP addresses and ports. The following fingerprint triggers from decoded information:
The Snort IDS is unusually lacking in protocol-decodes, and does not have a X.509 decoder. Other IDSs, such as Bro, Palo Alto, Proventia, and so on have such decoders. On the other hand, it's not too difficult to add an X.509 preprocessor to Snort if you wanted to get this information.
A good example of how to write a decoder for X.509 to grab this information is in my masscan project. I mention this because whatever X.509 decoder it is they are using now is probably wrong. The masscan decoder is the correct way to write such things.
The following fingerprint include C++ source. Instead of posting the entire fingerprint here, I'm just pasting the first few lines:
Update: It looks like the closing parentheses is missing in the email_body keyword, but in fact (as comment from cyphunk points out) it at the end of the C++ closing braces many lines below. This is a common functional idiom these days (JavaScript, Swift, etc.).
The main question here is exactly how this C++ code gets included into the main code. Is the entire program recompiled? Or does the code get compiled into a dynamic library that gets loaded (and unloaded) at runtime? Can this software be updated with new 'rules' while running? or must the existing service be shutdown, the rules replaced, and a new process started?
The string is parsed with a regular expression:
We see there that a C++ array is created by parsing "content" with a regex looking for one or more strings in the email of the form:
bridge 192.168.0.1:443
The PCRE regex is using capture groups. That's the purpose of the parenthesis ( ) in the regex. There are two capture groups, one for the IP address, and one for the port number.
One interesting thing to note about the port number is that it captures the first non-digit character after the number as well. This is obvious a bug, but since it's usually whitespace, one that doesn't impact the system.
In the 'main' function, the variable 'bridges' is passed in, and then used to filling in a database. This is interesting, because some IDSs (e.g. Proventia) have the ability to pass this along directly as IDS events. Other IDSs (e.g. Snort) can't, thus have to use another mechanism to store this information, such as shown here:
Nonetheless, the system still produces an IDS-style event by manually firing one:
The idea of "firing" an event is suggestive of IDS-style deep-packet-inspection, as opposed to other forms.
The following fingerprint is something really only the NSA would care about:
It triggers whenever somebody accesses the Tor Project website, except when they come from the "five eyes" countries.
That's fascinating because it implies that the system has to have an up-to-date geolocation database that can associate IP addresses with country. That's something rare in systems that monitor traffic. Though, the xff_cc clause may mean something else, such as triggering off the cookie information, identifying a user from that, and then identifying the country.
It's also interesting that the other fingerprints DON'T have this. Consider how the previous rule builds up a list of Tor bridges. It seems like it will extract this information from emails belonging to people the NSA knows are within the United States. However, since the NSA doesn't care about who sent/received the email, just the contents so that it can build a list of bridges, it doesn't care where it got the information. Also, whereas HTTP goes from user-to-server, they can reasonably assume that the IP address identifies the nationality of the person. On the other hand, email goes from server-to-server, so the location of the servers tells little about the location of the users.
Consider the following comment:
This doesn't sound right. In an operational system, there will be a process whereby analysts will request that new information be tracked by the system, which will then be handed off to some sort of project manager to track the new requirement, which will be eventually given to the engineers to add to the file like this. This sort of comment might appear in the requirements doc, but it's odd to see it appear in the code as a comment. You see this in Snort/EmergingThreats rules: the rules themselves do not have extensive comments like this explaining them.
This is especially true for an organization like the NSA with strict OPSEC and "need to know" requirements. All an engineer need know is the strings to search for, not why.
Thus, this hints to me that this isn't an operational system. The NSA has many more prototypes than operational system. This sort of comment would be expected in a prototype.
Of course, I may just be thinking too much into things. Maybe analysts are closer to the code than I thought, able to write their own fingerprints and send them to the engineers for inclusion without a lot of process in between.
This fingerprint hints at a lot of functionality:
Firstly, there is the short string "ct_mo". I don't know what this means. The longer fingerprint strings elsewhere in the system hint that the information isn't of particular use to human analysts, that they are used more for automated processes and indexing. This short fingerprint name here suggests otherwise, that it's actually something an analyst might be interested in.
The fingerprint 'documents/comsec/tails_doc' is nowhere defined in this file. The word "doc" suggests that it comes from some other subsystem that processes "documents" instead of network data.
The web_search keyword is interesting because it implies an entire subystem built on top of the HTTP parser that focuses just on search strings found in Google, Bing, etc. This further implies that 'fingerprint' keyword may be tied to more than just the TCP session information, but may also be tied to user-information identified by session cookies.
The url keyword is similar to Snort, but Snort doesn't have an html_title parser.
It's clear what the following code does, but I find the term "map reduce" confusing in this context:
Apparently, the goal is to search for all TCP traffic on all ports looking for a Tor hidden service URL that look like:
http://o87asgd2435fyuil.onion:443/
This further implies that some things about the underlying regex system. That regex would be extremely expensive if tried to run PCRE (the popular regex library) on all network traffic. This implies that they are using either a software or possibly even a hardware accelerator. In software, this can be converted to a fast DFA form, which would allow 5-gbps of network throughput per core, but would have some limitations on the complexity of the regexes. In hardware, such as in network processors by Cavium, it'd be a little bit slower, but allow more complicated regexes. Or, it could mean that system is just slow because the programmers are stupid and don't know how to regex better. XKerScore is described as slower than Turbulence because it does greater "depth" of analysis -- but there's nothing in this file that can't be done at Turbulence speeds.
Consider the following code:
This implies a multi-threaded message-passing system. The network processing code has to run as lean-and-mean as possible to keep up with network traffic. More expensive operations, such as database insertions, have to run on different threads. Update: As comments point out, this message format matches Google's protocol buffers specification, meaning there's a good chance the 'reduce' code runs in another process or on another machine. Protocol buffers are a faster way of doing something like this than JSON or XML.
Thus, the "map" operation is where this code generates a message on the fast, real-time, network processing thread, which then sends the message over to slower threads -- over even possibly to another machines -- that can insert the data into the database.
Why this isn't used above for the Tor bridges I don't know. I suspect that this map-reduce feature is a newer feature of the system, and they have yet to update old fingerprints.
By the way, this regex uses a very different way of parsing an optional port number in a URL than the method used to parse bridge addresses. Here, the code is (?::(\d+)){0,1} whereas for Tor bridges it is :?([0-9]{2,4}?[^0-9]. This hints of different authors writing these rules, or that these regexes were copied from sources on the Internet (like Snort rules).
The last item in the file is an "appid" rather than a "fingerprint". I'm assuming it's actually the same thing, but that with the fingerprint information contains additional data describing how to view the information.
I'm assuming that when the analyst wants to look at the data in this TCP session, that the console will automatically pull up an "ASCII viewer" that allows the analyst to look at raw text. I assume other viewers will be hexdump viewers, HTML, JPEG, and so forth.
Most of the previous fingerprints seem not to care about "full capture". They either wanted the metadata from the IP addresses, or they extracted the information they wanted themselves (like Tor bridge addresses or Onion addresses). In this case, we see that the analysts want to capture the full data of the session.
Also, I'm curious about the 'mixminion' hostname. The previous fingerprints hinted they were interested in exact matches, like "www.torproject.org", and not "www.torproject.org.robertgraham.com". This string suggests the opposite, that these strings are all partial matches. That means we can jam the system by including spurious data in URLs, hostnames, and emails.
Conclusion
I'm an expert who doesn't have access to the full documents. This post is about speculation from what I've seen in the source, as somebody who has written numerous DPI applications. The source definitely seems like something the NSA would use to monitor network traffic, but at the same time, seems fairly limited in scope.