Alexstjohn.com

7 Principles of Exascale Software Design

2014-03-16

I’ve been involved in several “big data” projects recently and one of the catch phrases that I’m hearing increasingly often is “Exascale Computing”. Apparently there is a lot of enthusiasm in the tech industry to build gratuitously huge computers on the theory that they might somehow improve our lives or solve problems for us on a proportionate scale. I’ve written many riveting articles on the subject of the “uncomputability” of certain kinds of problems regardless of how large or powerful a computer you attempt to apply to the problem. While everybody is excited about the rapid rate at which Moore’s law is increasing our computing power, relatively little consideration seems to be given to how the way we write code needs to change dramatically to make any practical use of this power. The general assumption seems to be that given a big enough computer even our primitive code will become more productive somehow. As I sit in my little garage laboratory and program my meager Tesla K40 supercomputer I have quickly come to realize that nothing could be father from the truth. In fact almost everything we think we know about software development, data structures and algorithms is rapidly becoming obsolete in the face of computing power that simply cannot be effectively harnessed by traditional means.

To try to express this realization in terms that a non-computer scientist can understand, let’s consider the way we evaluate the computational efficiency of various algorithms for computing a solution to a given problem. In computer science we use a rough technique called Big-O notation to mathematically estimate the efficiency or scalability of a given algorithm for solving some class of computational problems. Here’s a graph of several different types of algorithms and how their efficiency declines with scale.

This graph illustrates how the computational efficiency of various algorithms declines proportionate to the volume of data they have to operate on. Now for the sake of discussion, suppose we have an infinitely fast supercomputer that needs to complete a calculation on an infinite amount of data. Which of these algorithms will never finish? None of the exponential ones? That would be correct. Any computational problems that we only know how to solve with exponential algorithms will receive rapidly diminishing returns from computers who’s performance only increases with Moore’s law… which only improves computer performance with O(n) efficiency in Big-O notation. That example was a little abstract but I used it to try to illustrate as simply as possible the limited computability of certain kinds of problems. Parallelizing these algorithms is analogous to just using a faster computer on them, their efficiency doesn’t change when you throw more processors at them. Any algorithm that doesn’t perform with O(n) with n <= 1.0 will suffer diminishing performance returns no matter how much computing power is thrown at the problem.

In our current world of “little” computers operating on “little” data the mathematical differences between the performance properties of the various algorithms we use, seldom severely impacts our productivity, but as computing moves towards faster computers and bigger data the “nuances” between different algorithms matters more and more.

Algorithms aren’t the only problem however, there are several other previously ignorable “laws of computing” that start to influence the problems we can solve in ominous ways. In this context I have been considering several “properties” of exascale software design that become increasingly important to understand as computing challenges get bigger.

Exacscale algorithms must operate with O(n) efficiency with n <= 1.0

Data structures must be kept highly compact in memory

Algorithms and data structures must operate on highly coalesced memory

Data structures must be able to self analyze and ”rebalance” continuously in real-time

Algorithms must have minimal reliance on fore-knowledge about the structure and size of data they will be operating on

Algorithms must be highly parallelizable

Algorithms must be provably stable

Let’s briefly examine the reasons for each principle.

Exacscale algorithms must operate with O(n) efficiency with n <= 1.0:

As described above any algorithm that doesn’t fit this criteria will result in diminishing performance returns at scale and consequently accelerating hardware costs to compensate for the limits of the algorithm. Here’s a handy cheat sheet that illustrates the Big-O performance of the most commonly used algorithms and data structures in modern computer science.

http://bigocheatsheet.com/

As you can see, most of the popular algorithms appear to be fine for exascale computing according to the first principle, but wait until we compare them to the other 6 principles.

Data structures must be kept highly compact in memory:

This one is a little less obvious. As computers get faster, the ratio at which data can be transferred from storage media, over a network or across RAM is rapidly declining relative to the computational rate of the processor. To try to illustrate this abstract relationship consider that the PCIe Bus has a bandwidth limit of 4GB/second when it was introduced in 2003. Eleven years later the PCIe 3.0 standard supports a maximum bandwidth of 15.5GB/second, thus increasing in performance by 3.875X over that period. In the same time Intel CPU’s increased from roughly 50,000,000 transistors to over 2.6Billion or 52X.  In other words the rate at which we are able to move data around is slowing dramatically relative to the rate at which we can compute on it. This leads to the conclusion that spending extra compute cycles operating on highly compressed data is increasingly a worthwhile and NECESSARY performance tradeoff.

Algorithms and data structures must operate on highly coalesced memory:

This is a close cousin of the second principle. As the ratio of storage bandwidth to processing power declines the importance of moving data around in large contiguous batches increases dramatically because only large contiguous batches of data can easily and quickly be moved asynchronously and in parallel with computation. This principle is also critical to the efficient operation of highly parallel algorithms which perform best when multiple cores can operate independently on contiguous chunks of information with minimal collisions between threads. Note that this requirement can also be in conflict with principal 2 since most known compression algorithms have the property of compressing data asymmetrically. What is needed is a class of compression techniques that produce “constant” compression. Yes I know, it sounds strange but I have found that it actually makes sense with a little thought. The idea is to take chunks of data that fit in a known cache size and compress them as much as possible within that size.

Data structures must be able to self analyze and ”rebalance” continuously in real-time:

Many algorithms and data structures we commonly rely on degrade in efficiency with scale AND with use, hash tables and search trees being classic examples. The problem is that in extremely huge distributed computing environments, the impact of having to STOP the show to rebalance or recompute a data structure can be catastrophic. If the system is live, then it either has to be taken offline to recompute OR a duplicate hardware configuration is required to maintain the live system while rebalancing is taking place on an offline system of equivalent scale. It can also require a massive additional programming and management process to recompute a massive volume of data across a highly distributed system. The need to recompute can double the hardware requirements of a given system and can dramatically increase the complexity of the problem. In this context data structures and algorithms that can adapt continuously in use are highly desirable. *Note that the same should be said of the ability to analyze a live system in real-time. As data sizes and compute demands grow it becomes increasingly impractical to shuttle data off to a separate data warehouse for later analysis. In a exascale world there is no time and no spare supercomputer to store and analyze data that will often have real-time relevance to the problem being computed. This observation is also consistent with principle 2: Once a piece of data has reached the processor, it’s wise to do as much computation on it as possible before returning a result to memory. This can include data decompression, decryption, processing, analysis, compression, re-encryption and writing back to memory. Unfortunately the practical constraints of efficiently using exascale computers to process exascale data is that many traditional design principles of object oriented program design appear to break down.   The way we want to write code is increasingly in conflict with how it needs to work.

Algorithms must have minimal reliance on fore-knowledge about the structure and size of data they will be operating on:

In an exascale computing environment you are often increasingly likely to find problems that involve the continuous processing of an infinite stream of unstructured and error prone data.  For most practical problems we will no longer have the luxury of knowing when the data will arrive, how much will arrive, how “accurate” it is, or how compressible it is. In general computing cycles will be much cheaper to waste than bandwidth, which means that applying a high level of algorithmic intelligence to data as early as possible will be highly desirable versus trying to store it or move it around in any unqualified form. This is yet another example of a situation in which smarter code will save a lot of money in hardware.

Algorithms must be highly parallelizable:

…obviously… but I needed 7 principles and you really can’t leave it out…

Algorithms must be provably stable and the system must be continuously self testing:

This one is a little subtle but I constantly run into it. How will you KNOW when exascale software or data has a bug? How will you know if a result is correct? Some of the bugs that exascale software and data encounter are complex beyond the comprehension of mortal minds. The consequence of bugs in these systems can also be extremely devastating. The environments in which these systems operate can also be so complex and demanding that there is simply no way to STOP the system to analyze it’s state, let alone reproduce it.   The data that produced the result may be so large that it couldn’t be stored and was lost as soon as the bug occurred. Can each developer working on the system have a computer as big as the entire system to develop and test on? This introduces a whole new domain of software “proving” and live testing. Exascale systems will have to have their test code built into them and monitoring them continuously. Key algorithms will have to be provably reliable because they will not be practically testable in any real-world situation. You won’t be able to “trust” any result from such a system without a concurrent set of verification tests running alongside any important calculation. Testing goes hand in hand with principle 5 which includes the statistical certainty that complex software and/or data always contains bugs whether or not you are aware of them. As pointed out earlier, compute cycles relative to memory will be cheap.   Spending some of them on continuous error checking will often be more efficient than getting a highly computed result and having to halt the system to try to reproduce it. The only way to test, debug and verify results on an exascale system will be in real-time on a live but hopefully highly fault tolerant system.

In my own work, I find that I am often running into situations where I cannot write an automated unit test because there is no “right’ answer to the problem I’m solving or no second way of knowing if the answer is correct. Absent a rigid definition of “correct” I find myself resorting to writing tests that verify the reproducibility of a result, the symmetry of a result, the stability of the algorithm that produced the result, etc. These tests are reassuring but hardly definitive. I find it interesting that as software systems become larger and more complex, it becomes less practical to think about whether or not they are computing the correct result so much as finding ways to bound the circumstances in which they can be wrong. It is often easier to write a test that proves a solution is wrong than it is to write a test that proves it is correct.

For me the most interesting challenge associated with writing exascale software is that the classical principles of good OOP (Object Oriented Programming) software design seem to constantly conflict with the way software needs to be written to function efficiently on extremely large scales. This leads me to conclude that the tools and the way we think about coding will undergo a major revolution as we are forced to find new ways to code productively under extremely complex and demanding new conditions. The C++ OOP paradigm generally invites us to think about grouping data together with the functions that operate on the data and isolating and abstracting that data from unrelated operations. The principles of exascale computing increasingly appears to be in conflict with that programming paradigm as the need for extreme parallelism, data coalescence (keeping the same kind of data together in contiguous blocks of memory) and memory transfer overhead all appear to be pushing us to think about keeping data of the same type together and performing as much work on it as possible at once, even if the computations are “unrelated”. Without claiming to see clearly what will come next, it “feels” as though the OOP paradigm is becoming obsolete, or at least the tools that implement it are losing relevance.

The post 7 Principles of Exascale Software Design appeared first on The Saint.