An overview of Cloudera’s contributions to YARN that help support management of multiple resources, from multi-resource scheduling to node-level enforcement
As Apache Hadoop become ubiquitous, it is becoming more common for users to run diverse sets of workloads on Hadoop, and these jobs are more likely to have different resource profiles. For example, a MapReduce distcp job or Cloudera Impala query that does a simple scan on a large table may be heavily disk-bound and require little memory. Or, an Apache Spark (incubating) job executing an iterative machine-learning algorithm with complex updates may wish to store the entire dataset in memory and use spurts of CPU to perform complex computation on it.
For that reason, the new YARN framework in Hadoop 2 allows workloads to share cluster resources dynamically between a variety of processing frameworks, including MapReduce, Impala, and Spark. YARN currently handles memory and CPU and will coordinate additional resources like disk and network I/O in the future.
Accounting for memory, CPU, and other resources separately confers several advantages:
It allows us to treat tenants on a Hadoop cluster more fairly by rationing the resources that are most utilized at a point in time.
It makes resource configuration more straightforward, because a single resource does not need to be used as a proxy for others.
It provides more predictable performance by not oversubscribing nodes, and protects higher-priority workloads with better isolation.
Finally, it can increase cluster utilization because all the above mean that resource needs and capacities can be configured less conservatively.
One of Cloudera’s top priorities in Cloudera Enterprise 5 (in beta at the time of writing) is to provide smooth and powerful resource management functionality on Hadoop via YARN. In this post, I’ll describe the work we’ve done recently to allow YARN to support multiple resources, from multi-resource scheduling with Dominant Resource Fairness in the Fair Scheduler to enforcement on the node level with cgroups. The changes discussed below are included in YARN/MR2 in CDH 4.4 and going forward.
Background
In Hadoop 1, a single dimension, the “slot”, represented resources on a cluster. Each node was configured with a number of slots, and each map or reduce task occupied a single slot regardless of how much memory or CPU it used.
This approach offered the benefit of simplicity but had a few disadvantages. Because of the coarse-grained abstraction, it was common for a node’s resources to be over or under subscribed. Initially, YARN improved this situation by switching to memory-based scheduling – in YARN, each node is configured with a set amount of memory and applications (such as MapReduce request containers) for their tasks with configurable amounts of memory. More recently, YARN added CPU as a resource in the same manner: Nodes are configured with a number of “virtual cores” (vcores) and applications give a vcore number when requesting a container. In almost all cases, a node’s virtual core capacity should be set as the number of physical cores on the machine. CPU capacity is configured with the yarn.nodemanager.resource.cpu-vcores, and the mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores properties can be used to change the CPU request for MapReduce tasks from the default of 1.
Dominant Resource Fairness
Perhaps the most difficult challenge when managing multiple resources is deciding how to share them fairly. With a single resource, moving toward a fair allocation across a set of agents is pretty straightforward. When there’s space to place a container, we give it to the agent that has the least total memory already allocated. The Hadoop Fair Scheduler has been taking this approach for years, allocating containers to the agent (pool or application) with the smallest current allocation. (A recent blog postdescribes how we extended this model to a hierarchy of pools.)
But what if you want to schedule both memory and CPU, and you need them in possibly different and changing proportions? If you have 6GB and three cores, and I have 4GB and two cores, it’s pretty clear that I should get the next container. What if you have 6GB and three cores, but I have 4GB and four cores? Sure, you have a larger total number of units, but cores might be more valuable. To complicate things further, I might care about CPU more than you do.
To navigate this problem, YARN drew on recent research from Ghodsi et al at UC Berkeley that presents a notion of fairness that works with multiple resources. The researchers chose this notion, called dominant resource fairness (DRF), over some other options because it provides several properties that are trivially satisfied in the single resource case but more difficult in the multi resource case. For example, it is strategy-proof — meaning that agents cannot increase the amount of resources they receive by “lying” about what they need, and it incentivizes sharing, meaning that agents with separate clusters of equal size can only stand to benefit by merging their clusters.
The basic idea is as follows: My share of a resource is the ratio between the amount of that resource allocated to me and the total capacity on the cluster of that resource. So if the cluster has 10GB and I have 4GB, then my share of memory is 40%. If the cluster has 20 cores and I have 5 of them, my share of CPU is 25%. My dominant share is simply the max of resource shares, in this case 40%.
With single resource fairness, we try to equalize the shares of that resource across agents. When it’s time to allocate a container, we give it to the agent with the lowest share. With DRF, we try to equalize the dominant shares across agents, where the dominant share can come from a different resource for each agent. If you have 1GB and 10 cores, I get the next container, because my 40% dominant share of memory is less than your 50% dominant share of CPU.
We can use DRF when working with hierarchical queues similarly to how we do so with a single resource. We assign containers by starting at the root queue and traversing the queue tree. At each node, we explore the nodes below it in the order given by our fairness algorithm (in order of smallest dominant share).
Enforcing CPU Allocations with CGroups
So, DRF gives us a way to schedule multiple resources. But what about enforcing the allocations? Because memory is for the most part inelastic, we enforce memory limits by killing container processes when they go over their memory allocation. While we could do the same for CPU, this approach is a little bit harsh because the application doesn’t really have much say in the matter unless we expect it to insert a millisecond sleep after every five lines of code. What we really want is a way to control what proportion of CPU time is allotted to each process.
Luckily, the Linux kernel provides us exactly this in a feature called cgroups (control groups). With cgroups, you can place a YARN container process and all the threads it spawns in a control group. You then allocate a number of “CPU shares” to the process and place it in the cgroups hierarchy next to the other YARN container processes. Available CPU cycles will then be allotted to these processes in proportion to the number of shares given to them. Thus, if the node is not fully scheduled or containers are not using their full allotment, other containers will get to use the extra CPU.
cgroups also provides similar controls for disk and network I/O that we will probably use in the future for managing these resources.
We implemented cgroups CPU enforcement in YARN-3. To turn on cgroups CPU enforcement from Cloudera Manager:
Go to the configuration for the YARN service check the boxes for “Use CGroups for Resource Management” and “Always use Linux Container Executor”.
Go to the hosts configuration and check the box for ”Enable CGroup-based Resource Management”.
If you are not using Cloudera Manager:
Use the LinuxContainerExecutor. This requires setting yarn.nodemanager.container-executor.class to org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor in all NodeManager configs and setting certain permissions, as described here.
Set yarn.linux-container-executor.resources-handler.class to org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler
Want to Learn More?
There are a ton of details we didn’t cover regarding how multiple resources interact with Fair Scheduler features like per-queue minimums and maximums, preemption, and fair share reporting. For more info, check out the Fair Scheduler documentation.
Sandy Ryza is a Software Engineer at Cloudera and a Hadoop Committer.