Gumdaeng.com

20160617

2016-06-17

성효진 박사님, DeNovo: A Software-Driven Rethinking of the Memory Hierarchy

Title: DeNovo: A Software-Driven Rethinking of the Memory Hierarchy

Abstract: As multicore systems become widespread, both software and hardware face a major challenge in efficiently exploiting and implementing parallelism. While shared-memory remains a popular programming model due to the global address space, it is plagued with undisciplined programming practices allowing implicit communication and unstructured non-determinism. Such “wild” shared-memory behavior not only makes it difficult to test and maintain software, but also complicates hardware, preventing it from scaling in a power-efficient manner. Recent research has proposed replacing the “wild” shared-memory programming models with a more disciplined approach. The DeNovo project asks the question: if software is more disciplined, can we build a more efficient memory hierarchy? Focusing on deterministic programs as a discipline to drive DeNovo, we first show that coherence and communication can be made much simpler and efficient than the current state-of-the-art. The resulting protocol is without transient states, invalidation traffic, directory sharer-lists, or false sharing – all significant sources of inefficiencies in existing protocol. Widening the software space further, we then show how DeNovo can support software with disciplined non-determinism without giving up its benefits for deterministic programs. Our software-driven approach does not stop there at simplifying coherence and consistency for deterministic and non-deterministic data accesses. We present a simple yet efficient hardware mechanism to support arbitrary synchronization, a big step towards our eventual goal of supporting legacy programs. In summary, DeNovo shows the potential for comprehensive and commercially viable shared-memory systems with higher complexity, performance, and energy-efficiency than today’s software-oblivious hardware.

Bio: Hyojin Sung is a research staff member at IBM TJ Watson research center. Her research interests are computer architecture, parallel computing, and compiler/runtime technologies. She received the Ph.D. in computer science from Illinois in 2015. She earned her bachelor’s in literature and computer science from Seoul National University and the M.S. in computer science from UC San Diego. She is a recipient of W.J. Poppelbaum memorial award in 2014 and a two time winner of the Feng Chen memorial award. Her research publications include a best paper awardee and a selection to Micro Top Picks journal.

Memory hierarchy를 어떻게 다시 디자인할 것인가. 소프트웨어, 하드웨어를 함께 다시 디자인한다면 하드웨어를 어느 정도까지 효율적으로 디자인할 수 있을 것인가? 하드웨어 디자인에서 에너지가 중요한 요소이다. 멀티코어를 사용하게 됨. 이러한 패러다임의 변화는 하드웨어 및 소프트웨어 모두에 영향을 주고 있음. 하드웨어 측면에서는 어떻게 메모리 계층을 에너지 효율적으로 만들 것인지, 소프트웨어 측면에서는 어떻게 병렬 프로그램을 만들 것인지. Motivation: 현재 SW, HW 모두 공유 메모리 구조를 선호함. 공유 메모리 SW에서는 implicit communication, data races, non-determinism이 문제이다. 공유 메모리 HW에서는 이러한 문제를 해결하기 위해 complex coherence / consistency, unnecessary traffic이 발생. SW / HW 모든 측면에서 문제가 되고 있는 상황이다. Scalability가 떨어짐. SW / HW 사이에 적절한 조율이 이루어지지 않고 있다는 것이 근본적인 문제이다.

이러한 문제를 해결하기 위해 disciplined parallelism을 위한 전체 스택을 새롭게 설계하고자 함. 공유 메모리 소프트웨어에서는 구조화된 병렬 제어(아무 곳에서나 thread를 만드는 것이 아니라, 정해진 곳에서 만들도록), side-effect 명시를 수행 (이 루틴에서 읽고 쓰는 변수를 명시 – 이렇게 함으로써 side-effect를 제어할 수 있음). 이처럼 disciplined parallelism을 구현할 때, shared memory hardware를 간단하게 구현할 수 있고, 최적화할 수 있다는 점이 이 연구의 핵심. 그리고 이렇게 하드웨어를 디자인했을 때, 소프트웨어는 어떻게 디자인되어야 하는지도 연구의 목표. Software-driven hardware design을 수행하고자 함. deterministic data access / non-deterministic data access / racy synchronization access로 구분하고, 가장 간단한 deterministic data access를 대상으로 우선 하드웨어 구현. 그리고 그 뒤로 나머지 data access를 다루기 위한 메커니즘을 추가한다. DeNovo(PACT’11), DeNovoND(ASPLOS’13)… 등등에서 발표.

이 프로젝트는 소프트웨어에 대한 가정을 갖고 시작함. 소프트웨어들이 undisciplined라면 coherence protocol이 필요할 수밖에 없음. Disciplined determinism이 보장된다면 복잡한 하드웨어가 필요하지 않음. Disciplined determinism은 1) structured parallel control, 2) metadata about read and write accesses, 3) Data-race freedom guarantee로 이루어짐. Fork-join parallelism을 기본으로 한다. Heap을 region으로 나누고, 각 thread가 접근하는 region을 metadata로 기록한다. 그리고 이 정보를 기반으로 data-race freedom을 확인할 수 있다. 이같은 구조에서는 determinism을 보장할 수 있다. 어떤 변수를 읽을 때, 그 변수는 같은 쓰레드에서 쓰인 값이거나, 이전 parallel phase에서 쓰인 값이다.

Cache coherence를 유지하기 위해서는 1) local cache에 있는 local copy가 stale한지 알아야 한다. 2) 그리고 최신 copy가 어디에 있는지 알 수 있어야 한다. Compiler가 writeable region을 알기 때문에 coherence message 없이, 자기 자신의 write할 수 없는 region에 대해 self invalidate할 수 있다(해당 region은 다른 thread에 의해 write되었을 것이므로). 이처럼 DeNovo coherence protocol을 구현하면 directory를 구현할 필요가 없다 (space overhead 없음). 그리고 transient state도 필요없다. 기존의 coherence protocol은 읽는 중에 갑자기 쓰기 연산이 들어올 수 있으므로 transient state가 필요함. 하지만 software에 race가 없다고 가정한다면 transient state가 필요하지 않다. Self invalidation을 수행하므로, invalidation traffic이 필요하지 않다. False sharing도 없음. 프로토콜은 locality 위해 line-based로 구현.

DeNovo는 결정적인 프로그램을 위해 디자인되었다. conflicting concurrent access를 허용하지 않는다. Barrier synchronization만을 허용함. Side-effects는 모두 미리 알고 있음. non-deterministic program에 대해서는 DeNovo를 사용할 수 없음. Non-deterministic program이 deterministic program보다 더 효율적이기 때문에 문제가 됨. Non-deterministic program을 위해 DeNovoND를 연구. DeNovoND에서는 서로 다른 쓰레드가 같은 region을 공유하는 것을 허용한다 (lock으로 보호). 그렇다면 읽기 전에 local copy를 invalidate해야 하는데, 언제 invalidate하는가? Store 명령 이후 load 발생 전에 invalidate하면 된다. 무엇을 invalidate하는가? Bloom filter를 사용해 추적한다. atomic effects를 갖는 접근에 대해서만 추적하면 된다. Bloom filter는 write 연산이 발생할 때에 업데이트되고, read 연산을 할 때 Bloom filter를 보고 invalidate할지 여부를 결정한다. Bloom filter는 core별로 가지며, lock transfer할 때 Bloom filter가 함께 transfer된다. Bloom filter는 phase가 끝날 때마다 reset한다. Region이 많다면 Bloom filer가 saturate될 수 있고, 이 때에는 주기적으로 Bloom filter를 reset한다. 하지만 실제 실험에서 Bloom filter가 saturate 되는 경우는 거의 없었다.

DeNovo와 MESI를 Murphi model checker로 검증했을 때, 15배 더 적은 상태가 발생함. 20배 더 짧은 검증 시간. 성능은 Simics + GEMS + Garnet으로 검증. 64개 코어, in-order core model에서 실험. 실행 시간에 있어서는 DeNovo / DeNovoND와 MESI가 큰 차이 없음. False sharing이 많은 워크로드라면 DeNovo / DeNovoND가 더 빠름. DeNovo는 word 단위로 coherence state를 추적하기 때문이다. Network traffic은 DeNovo / DeNovoND가 MESI보다 35~36% network traffic이 적다.

현재 컴퓨터 구조에서 coherence protocol이 복잡하고, 네트워크 트래픽이 너무 많이 발생한다. DeNovo는 transient state가 없어 protocol이 단순하고, 검증이 쉽다. 네트워크 트래픽이 적게 발생한다. Directory overhead 또한 없으며, region based coherence로 인해 false sharing, invalidation / ack message가 불필요하다. DeNovo의 검증이 MESI보다 20배 빠르다. 하드웨어적으로는 cache coherence 검증이 쉬워졌으나, 소프트웨어적으로 coherence 검증은 어려워졌을 수 있다. 하지만 한편으로는 그렇게 프로그래밍을 하는 것이 우리가 앞으로 나아가야 할 방향일 수도 있다. 이렇게 coherence model을 바꿀 때 programming model의 표현력이 줄어들 수 있지 않나? Structured programming model만으로도 충분히 모든 논리를 표현할 수 있다는 것이 프로그래밍 언어 연구자들의 설명. 하지만 한편으로는 성능 측면에서는 문제가 될 수도 있음.

추가 연구는 설명을 끝까지 듣지 못함. Synchronization에 대한 이야기. Synchronization은 기본적으로 racy할 수밖에 없는데, 이에 대해서는 어떻게 할 것인가? DeNovoSync에서는 software에 대한 요구사항을 낮춤.