Goparallel.sourceforge.net

Rewriting AI code for more parallelization yields huge performance boost

2016-09-21

Xeon Phi boosts image-recognition 50x, general purpose Xeon CPUs net over 25x increase in performance

Less than 5% of today’s workloads run on CPU/GPU combinations today, but Intel has demonstrated how recoding for its Xeon Phi processor along with Xeon CPUs can yield dramatic performance boosts, as outlined in this blog by Slashdot Media Contributing Editor John O’Donnell

Intel Corp. is backing up its claim to leadership in the market for machine-learning platforms with tests that show its Xeon Phi can boost performance of machine learning apps by as much as 55 percent.

Decision-makers will increasingly depend on more sophisticated analytics to make decisions — analytics so far beyond simple pattern matching in large data sets that the applications that conduct them will have to teach themselves to analyze data more efficiently in order to reach levels of insight that are sufficiently deep.

“Artificial Intelligence (AI) is the next big wave of compute that will transform the way businesses operate and how people engage with the world,” according to an Aug. 23 blog by Jason Waxman, corporate vice president in Intel’s Data Center Group and general manager of the company’s Data Center Solutions Group.

Intel staked out its claim as the leader in the market for artificial intelligence/machine-learning platforms in August, following a series of announcements about design elements that will optimize future versions of Xeon Phi and other Intel products for high performance in compute-intensive AI, deep learning, neural networking and machine learning applications.

More than 97 percent of machine-learning workloads run on servers driven only by CPUs, not the GPU/CPU combination on which many leading supercomputers are based, Waxman wrote.

Xeon Phi, which has strengths of both a CPU and GPU — and a long history as a supercomputing co-processor — is a workhorse in the demanding market for high-performance computing (HPC) hardware. With optimizations designed to make it an even better-performing platform for AI and machine learning, Waxman wrote, Xeon Phi will become a key driver of the machine-learning/analytics-driven economy.

AI and machine-learning applications are notoriously compute intensive, however and, as it turns out, are rarely designed in ways that allow them to get the best performance from the hardware on which they run, according to a study first published by Colfax Research in June of 2016.

The Knights Landing version of Xeon Phi has high-performance capabilities that make it ideal as a platform for machine-learning applications, according to the report. Xeon Phi is especially strong in basic linear algebra subroutines (BLAS), general matrix-matrix multiplications (GEMM) and other building blocks of machine-learning applications.

In an effort to discover what it would take to customize machine-learning applications to get the best performance from Xeon Phi chips — and find ways to make those customizations easier — Colfax Researchers tested an image-recognition application called NeuralTalk2.

NeuralTalk2 is open-source code that uses machine learning to analyze complex real-life photographs and produce verbal descriptions of the relationships among people and objects in those photographs.

NeuralTalk2 uses a recurrent neural network to create the captions, and a long short-term memory (LSTM) network for image analysis.

It is written in a well-known and stable scripting language called Lua, and a computing framework called Torch that is designed to support many machine-learning algorithms but which relies on the highly parallel computing of GPUs to keep its performance high.

The tested system also used Theano, a Python library designed to optimize the execution of complex calculations using multi-dimensional data arrays at nearly the speed of applications written in C. “It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs,” according to the project source page.

Despite the sophistication of the software, however, “out-of-box performance of NeuralTalk2 on Intel architecture is sub-optimal due to inefficient usage of Intel Architecture capabilities by the Torch library,” according to the Colfax report.

Updating that code to get optimal performance meant teaching several layers of sophisticated analytics to make better use of the highly parallel processors, rebuilding Torch libraries with a C compiler from Intel, and linking the BLAS and Linear Algebra Package (LAPACK) calculations to Intel’s Math Kernel Library (Intel MKL).

Intel MKL is a library of mathematical functions in C, C++, Python and other languages that is optimized for high performance on Intel processors due to its efficient use of highly parallel processing of the type of calculations that weigh heavily in machine-learning applications, including Naural Network Vector Math, Statistics, Linear Algebra, Fast Fourier Transforms and other techniques.
Intel distributes the MKL code at no cost to researchers.

Colfax researchers expanded the vectorization and parallel-processing capabilities of Torch by vectorizing loops in the neural network, LSTM and GEMMs by vectorizing loops, replacing initial array sorting with a higher-performing search algorithm and by running several multithreaded instances of NeuralTalk2 simultaneously after pinning various processes to cores within the Xeon Phi processors.

The result was an increase in performance of more than 50 times compared to similar applications running on CPU/GPU setups rather than Xeon Phi.

The optimized code also showed a 25x boost in performance on general-purpose Broadwell family Intel Xeon processors.

Researchers noted in their conclusion that the NeuralTalk2 optimization worked on general-purpose as well as Xeon Phi processors, widening the potential benefit and highlighting the wide availability of highly parallel platforms. They predicted even greater increases in performance with improvements in parallelization of GEMM code, and, in the future, by using Intel’s Message Passing Interface (MPI) framework to help scale the application across a cluster of Xeon Phi systems rather than just one system.

“Our industry needs breakthrough compute capability — capability that is both scalable and open — to enable innovation across the broad developer community,” Waxman wrote.

Intel announcements at the Intel Developers Forum in August — including improvements in the MKL, and in Intel’s Deep Learning Neural Network and Intel Deep Learning SDK, disclosure of the AI-optimized version of Xeon Phi called Knights Mill and acquisition of machine-learning specialty company Nervana Systems — will accelerate the same performance advantages highlighted by the Colfax test.

“As data sets continue to scale, Intel’s strengths will shine,” Waxman wrote.

Source: Colfax Research

Machine Learning on 2nd Generation Intel® Xeon Phi™ Processors: Image Captioning with NeuralTalk2, Torch

The post Rewriting AI code for more parallelization yields huge performance boost appeared first on Go Parallel.