Our thanks to Telvis Calhoun, Zach Hanif, and Jason Trost of Endgame for the guest post below about their BinaryPig application for large-scale malware analysis on Apache Hadoop. Endgame uses data science to bring clarity to the digital domain, allowing its federal and commercial partners to sense, discover, and act in real time.
Over the past three years, Endgame received 40 million samples of malware equating to roughly 19TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10 million samples in the last quarter of 2012. Its total corpus is estimated to be about 100 million samples. VirusTotal receives between 300,000 and 600,000 unique files per day, and of those roughly one-third to half are positively identified as malware (as of April 9, 2013).
This huge volume of malware offers both challenges and opportunities for security research, especially applied machine learning. Endgame performs static analysis on malware in order to extract feature sets used for performing large-scale machine learning. Since malware research has traditionally been the domain of reverse engineers, most existing malware analysis tools were designed to process single binaries or multiple binaries on a single computer and are unprepared to confront terabytes of malware simultaneously. There is no easy way for security researchers to apply static analysis techniques at scale; companies and individuals that want to pursue this path are forced to create their own solutions.
Our early attempts to process this data did not scale well with the increasing flood of samples. As the size of our malware collection increased, the system became unwieldy and hard to manage, especially in the face of hardware failures. Over the past two years we refined this system into a dedicated framework based on Hadoop so that our large-scale studies are easier to perform and are more repeatable over an expanding dataset.
To address this problem, we created an open source framework, BinaryPig, built on Hadoop and Apache Pig (utilizing CDH, Cloudera’s distribution of Hadoop and related projects) and Python. It addresses many issues of scalable malware processing, including dealing with increasingly large data sizes, improving workflow development speed, and enabling parallel processing of binary files with most pre-existing tools. It is also modular and extensible, in the hope that it will aid security researchers and academics in handling ever-larger amounts of malware.
Major Design Points
One of the most significant design principles of BinaryPig is the ability to ingest, store, and process large numbers of small files without affecting performance – an area of potential improvement for Hadoop. BinaryPig’s ingest tool combines thousands of small binary files into one or more large Hadoop SequenceFiles (each file typically ranging from 10GB to 100GB in size), an internal data storage format that enables Hadoop to take advantage of sequential disk read speeds as well as split these files for parallel processing. The SequenceFiles created by BinaryPig are a collection of key/value pairs where the key represents the ID of the binary file (typically MD5 checksum) and the value is the contents of the file represented as raw bytes. The malware data is properly distributed across computing nodes in the cluster, and is resilient to failures, via HDFS.
BinaryPig’s data processing architecture
Within Pig, analysis scripts and their supporting libraries are uploaded to Hadoop’s DistributedCache before job execution and destroyed from the processing nodes afterward. This functionality allows users to quickly publish analytical scripts and libraries to the processing nodes without the general concerns of ensuring deployment operated as expected or the time associated with a manual deployment.
We also developed a series of Pig loader functions that read SequenceFiles, copy binaries to the local filesystem, and enable scripts to process them. These load functions were designed to allow these analysis scripts to read the files dropped onto the local filesystem. Thus BinaryPig users can utilize pre-existing malware analysis scripts and tools. We have built in support for hashing (md5, sha1, sha256, etc), pehash, yara, clamav, and generic script execution. We have used this with some open source tools such as peframe as well as several internally developed malware analysis scripts.
In essence, we are using HDFS to distribute our malware binaries across our cluster and using MapReduce to push the processing to the node where the data is stored. This approach prevents needless data copying as seen in many legacy cluster computing environments. Data emitted from the system is also stored into HDFS for additional batch processing as needed.
We also typically load analysis results into ElasticSearch indices. This distributed index is used for manual exploration by individual users, as well as automated extraction by analytical processes.
BinaryPig has enabled us to conduct several research studies across our entire malware dataset that were either not possible or too time consuming to be useful before. Some of these studies include:
Leveraging our Hadoop cluster to perform virus scanning of all our malware samples periodically as virus signature files are updated. This is useful since most of the malware we receive is not detected by commercial AV systems for several months.
Extracting ICO image files from all our malware in order to cluster malware by embedded icon files.
Creating the ability to perform a repeatable malware census and comparisons at scale in just a matter of hours.
Scanning our entire malware set with newly developed/released Yara signatures in order to determine if any of our old malware matches new signatures. Frequently the open source community releases Yara signatures for interesting malware before AV vendors release signatures so we can leverage these cutting edge signatures to detect threats faster.
A sampling of real icons extracted from our malware set
Conclusion
As you can see, scalable malware analysis is another good example of a Hadoop use case premised on the ingestion, storage, and processing of huge amounts of data in parallel (in this case, even for data in the form of numerous small files – traditionally, not a strong point for Hadoop).
For more details about BinaryPig’s architecture and design, read our paper from Black Hat USA 2013 or check out our presentation slides. BinaryPig is an open source project under the Apache 2.0 License, and all code is available on Github.