Serverphorums.com

[ANNOUNCE] Multiple run-queues for BFS (1 reply)

2012-12-16

I'm doing a CPU-Scheduler based on BFS by Con Kolivas with support for

multiple run-queues. BFS in itself uses only one run-queue for all

CPU's. This avoids the load-balancing overhead, but does not scale well.

One run-queue per CPU does scale well, but then the scheduler has

load-balancing overhead. The scheduler I'm developing supports every

possible run-queues configuration. You can have one single run-queue

like in BFS, or you can have one run-queue per CPU, or something

completely different like one run-queue every two CPU's. This, in theory

would allow the scheduler to be fine-tuned to the hardware and the

workload.

What state is it in?

Currently it is very unstable, CPU-Hotplug is broken, scheduling

statistics are broken, support for real-time tasks is broken. Load

balancing when having more than one run-queue is working, but is nothing

more than keeping the load on all run-queues equal. Associating a CPU

and a run-queue is currently done with a system call and there is no

access right checking. The source is in a very bad state.

Uni-processor build is broken.

It lacks proper Documentation.

Why allow the user to change the run-queue layout?

To optimize the scheduler to specific hardware and workloads.

You could use one run-queue for all CPU's if you want low latency and

low scheduling overhead.

You could use one run-queue per CPU if you want high scalability.

You could use one run-queue per n CPU's is these n CPU's share cache and

there is not much benefit in load balancing between them.

Benchmarks?

None, it is not stable enough to benchmark and the load balancing

algorithm that is currently used, delivers very bad performance.

What advantages does it have when compared to other schedulers?

It is more scalable than BFS.

It could in future have all features of BFS and of CFS, especially

throughput and low latency.

It has far less lines of code than CFS.

What disadvantages does it have when compared to other schedulers?

It is not stable.

It is not tested on anything else than kvm and more than 4 CPU's.

Many features are not yet working or not implemented at all (good load

balancing).

Implementation details:

All tasks that are runnable but not currently executing on a CPU, are

queued on one of the global run-queues. Every global run-queue has its

own spin-lock. When a task gets queued or dequeued this lock needs to be

taken. All global run-queues are protected by one global read-write

lock. When normal scheduling is done, this lock needs to be read_locked.

When any change to the layout of the global run-queues is done,

like adding new global run-queues or removing them, the global

read-write lock needs to be write-locked.

Fair time distribution among tasks is done via the deadline mechanism of

BFS.

Patch for linux-3.6.2:

diff -uprN linux-3.6.2/arch/powerpc/platforms/cell/spufs/sched.c

linux-3.6.2-bfs-multi-runqueue/arch/powerpc/platforms/cell/spufs/sched.c

--- linux-3.6.2/arch/powerpc/platforms/cell/spufs/sched.c 2012-10-12

22:50:59.000000000 +0200

+++

linux-3.6.2-bfs-multi-runqueue/arch/powerpc/platforms/cell/spufs/sched.c

2012-10-25 17:13:12.578060772 +0200

@@ -63,11 +63,6 @@ static struct timer_list spusched_timer;

static struct timer_list spuloadavg_timer;

/*

- * Priority of a normal, non-rt, non-niced'd process (aka nice level

0).

- */

-#define NORMAL_PRIO 120

-

-/*

* Frequency of the spu scheduler tick. By default we do one SPU

scheduler

* tick for every 10 CPU scheduler ticks.

*/

diff -uprN linux-3.6.2/arch/x86/Kconfig

linux-3.6.2-bfs-multi-runqueue/arch/x86/Kconfig

--- linux-3.6.2/arch/x86/Kconfig 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/arch/x86/Kconfig 2012-10-25

17:13:12.597060777 +0200

@@ -797,15 +797,7 @@ config SCHED_MC

increased overhead in some places. If unsure say N here.

config IRQ_TIME_ACCOUNTING

- bool "Fine granularity task level IRQ time accounting"

- default n

- ---help---

- Select this option to enable fine granularity task irq time

- accounting. This is done by reading a timestamp on each

- transitions between softirq and hardirq state, so there can be a

- small performance impact.

-

- If in doubt, say N here.

+ def_bool y

source "kernel/Kconfig.preempt"

diff -uprN linux-3.6.2/arch/x86/syscalls/syscall_64.tbl

linux-3.6.2-bfs-multi-runqueue/arch/x86/syscalls/syscall_64.tbl

--- linux-3.6.2/arch/x86/syscalls/syscall_64.tbl 2012-10-12

22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/arch/x86/syscalls/syscall_64.tbl

2012-12-07 19:32:05.307937117 +0100

@@ -319,6 +319,7 @@

310 64 process_vm_readv sys_process_vm_readv

311 64 process_vm_writev sys_process_vm_writev

312 common kcmp sys_kcmp

+313 common associate_cpu_grq sys_associate_cpu_grq

#

# x32-specific system call numbers start at 512 to avoid cache impact

diff -uprN linux-3.6.2/Documentation/scheduler/sched-BFS.txt

linux-3.6.2-bfs-multi-runqueue/Documentation/scheduler/sched-BFS.txt

--- linux-3.6.2/Documentation/scheduler/sched-BFS.txt 1970-01-01

01:00:00.000000000 +0100

+++ linux-3.6.2-bfs-multi-runqueue/Documentation/scheduler/sched-BFS.txt

2012-10-25 17:13:12.579060779 +0200

@@ -0,0 +1,347 @@

+BFS - The Brain Fuck Scheduler by Con Kolivas.

+

+Goals.

+

+The goal of the Brain Fuck Scheduler, referred to as BFS from here on,

is to

+completely do away with the complex designs of the past for the cpu

process

+scheduler and instead implement one that is very simple in basic

design.

+The main focus of BFS is to achieve excellent desktop interactivity and

+responsiveness without heuristics and tuning knobs that are difficult

to

+understand, impossible to model and predict the effect of, and when

tuned to

+one workload cause massive detriment to another.

+

+

+Design summary.

+

+BFS is best described as a single runqueue, O(n) lookup, earliest

effective

+virtual deadline first design, loosely based on EEVDF (earliest

eligible virtual

+deadline first) and my previous Staircase Deadline scheduler. Each

component

+shall be described in order to understand the significance of, and

reasoning for

+it. The codebase when the first stable version was released was

approximately

+9000 lines less code than the existing mainline linux kernel scheduler

(in

+2.6.31). This does not even take into account the removal of

documentation and

+the cgroups code that is not used.

+

+Design reasoning.

+

+The single runqueue refers to the queued but not running processes for

the

+entire system, regardless of the number of CPUs. The reason for going

back to

+a single runqueue design is that once multiple runqueues are

introduced,

+per-CPU or otherwise, there will be complex interactions as each

runqueue will

+be responsible for the scheduling latency and fairness of the tasks

only on its

+own runqueue, and to achieve fairness and low latency across multiple

CPUs, any

+advantage in throughput of having CPU local tasks causes other

disadvantages.

+This is due to requiring a very complex balancing system to at best

achieve some

+semblance of fairness across CPUs and can only maintain relatively low

latency

+for tasks bound to the same CPUs, not across them. To increase said

fairness

+and latency across CPUs, the advantage of local runqueue locking, which

makes

+for better scalability, is lost due to having to grab multiple locks.

+

+A significant feature of BFS is that all accounting is done purely

based on CPU

+used and nowhere is sleep time used in any way to determine entitlement

or

+interactivity. Interactivity "estimators" that use some kind of

sleep/run

+algorithm are doomed to fail to detect all interactive tasks, and to

falsely tag

+tasks that aren't interactive as being so. The reason for this is that

it is

+close to impossible to determine that when a task is sleeping, whether

it is

+doing it voluntarily, as in a userspace application waiting for input

in the

+form of a mouse click or otherwise, or involuntarily, because it is

waiting for

+another thread, process, I/O, kernel activity or whatever. Thus, such

an

+estimator will introduce corner cases, and more heuristics will be

required to

+cope with those corner cases, introducing more corner cases and failed

+interactivity detection and so on. Interactivity in BFS is built into

the design

+by virtue of the fact that tasks that are waking up have not used up

their quota

+of CPU time, and have earlier effective deadlines, thereby making it

very likely

+they will preempt any CPU bound task of equivalent nice level. See

below for

+more information on the virtual deadline mechanism. Even if they do not

preempt

+a running task, because the rr interval is guaranteed to have a bound

upper

+limit on how long a task will wait for, it will be scheduled within a

timeframe

+that will not cause visible interface jitter.

+

+

+Design details.

+

+Task insertion.

+

+BFS inserts tasks into each relevant queue as an O(1) insertion into a

double

+linked list. On insertion, *every* running queue is checked to see if

the newly

+queued task can run on any idle queue, or preempt the lowest running

task on the

+system. This is how the cross-CPU scheduling of BFS achieves

significantly lower

+latency per extra CPU the system has. In this case the lookup is, in

the worst

+case scenario, O(n) where n is the number of CPUs on the system.

+

+Data protection.

+

+BFS has one single lock protecting the process local data of every task

in the

+global queue. Thus every insertion, removal and modification of task

data in the

+global runqueue needs to grab the global lock. However, once a task is

taken by

+a CPU, the CPU has its own local data copy of the running process'

accounting

+information which only that CPU accesses and modifies (such as during a

+timer tick) thus allowing the accounting data to be updated lockless.

Once a

+CPU has taken a task to run, it removes it from the global queue. Thus

the

+global queue only ever has, at most,

+

+ (number of tasks requesting cpu time) - (number of logical CPUs) + 1

+

+tasks in the global queue. This value is relevant for the time taken to

look up

+tasks during scheduling. This will increase if many tasks with CPU

affinity set

+in their policy to limit which CPUs they're allowed to run on if they

outnumber

+the number of CPUs. The +1 is because when rescheduling a task, the

CPU's

+currently running task is put back on the queue. Lookup will be

described after

+the virtual deadline mechanism is explained.

+

+Virtual deadline.

+

+The key to achieving low latency, scheduling fairness, and "nice level"

+distribution in BFS is entirely in the virtual deadline mechanism. The

one

+tunable in BFS is the rr_interval, or "round robin interval". This is

the

+maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling

policy)

+tasks of the same nice level will be running for, or looking at it the

other

+way around, the longest duration two tasks of the same nice level will

be

+delayed for. When a task requests cpu time, it is given a quota

(time_slice)

+equal to the rr_interval and a virtual deadline. The virtual deadline

is

+offset from the current time in jiffies by this equation:

+

+ jiffies + (prio_ratio * rr_interval)

+

+The prio_ratio is determined as a ratio compared to the baseline of

nice -20

+and increases by 10% per nice level. The deadline is a virtual one only

in that

+no guarantee is placed that a task will actually be scheduled by this

time, but

+it is used to compare which task should go next. There are three

components to

+how a task is next chosen. First is time_slice expiration. If a task

runs out

+of its time_slice, it is descheduled, the time_slice is refilled, and

the

+deadline reset to that formula above. Second is sleep, where a task no

longer

+is requesting CPU for whatever reason. The time_slice and deadline are

_not_

+adjusted in this case and are just carried over for when the task is

next

+scheduled. Third is preemption, and that is when a newly waking task is

deemed

+higher priority than a currently running task on any cpu by virtue of

the fact

+that it has an earlier virtual deadline than the currently running

task. The

+earlier deadline is the key to which task is next chosen for the first

and

+second cases. Once a task is descheduled, it is put back on the queue,

and an

+O(n) lookup of all queued-but-not-running tasks is done to determine

which has

+the earliest deadline and that task is chosen to receive CPU next.

+

+The CPU proportion of different nice tasks works out to be

approximately the

+

+ (prio_ratio difference)^2

+

+The reason it is squared is that a task's deadline does not change

while it is

+running unless it runs out of time_slice. Thus, even if the time

actually

+passes the deadline of another task that is queued, it will not get CPU

time

+unless the current running task deschedules, and the time

"base" (jiffies) is

+constantly moving.

+

+Task lookup.

+

+BFS has 103 priority queues. 100 of these are dedicated to the static

priority

+of realtime tasks, and the remaining 3 are, in order of best to worst

priority,

+SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle

priority

+scheduling). When a task of these priorities is queued, a bitmap of

running

+priorities is set showing which of these priorities has tasks waiting

for CPU

+time. When a CPU is made to reschedule, the lookup for the next task to

get

+CPU time is performed in the following way:

+

+First the bitmap is checked to see what static priority tasks are

queued. If

+any realtime priorities are found, the corresponding queue is checked

and the

+first task listed there is taken (provided CPU affinity is suitable)

and lookup

+is complete. If the priority corresponds to a SCHED_ISO task, they are

also

+taken in FIFO order (as they behave like SCHED_RR). If the priority

corresponds

+to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n).

At this

+stage, every task in the runlist that corresponds to that priority is

checked

+to see which has the earliest set deadline, and (provided it has

suitable CPU

+affinity) it is taken off the runqueue and given the CPU. If a task has

an

+expired deadline, it is taken and the rest of the lookup aborted (as

they are

+chosen in FIFO order).

+

+Thus, the lookup is O(n) in the worst case only, where n is as

described

+earlier, as tasks may be chosen before the whole task list is looked

over.

+

+

+Scalability.

+

+The major limitations of BFS will be that of scalability, as the

separate

+runqueue designs will have less lock contention as the number of CPUs

rises.

+However they do not scale linearly even with separate runqueues as

multiple

+runqueues will need to be locked concurrently on such designs to be

able to

+achieve fair CPU balancing, to try and achieve some sort of nice-level

fairness

+across CPUs, and to achieve low enough latency for tasks on a busy CPU

when

+other CPUs would be more suited. BFS has the advantage that it requires

no

+balancing algorithm whatsoever, as balancing occurs by proxy simply

because

+all CPUs draw off the global runqueue, in priority and deadline order.

Despite

+the fact that scalability is _not_ the prime concern of BFS, it both

shows very

+good scalability to smaller numbers of CPUs and is likely a more

scalable design

+at these numbers of CPUs.

+

+It also has some very low overhead scalability features built into the

design

+when it has been deemed their overhead is so marginal that they're

worth adding.

+The first is the local copy of the running process' data to the CPU

it's running

+on to allow that data to be updated lockless where possible. Then there

is

+deference paid to the last CPU a task was running on, by trying that

CPU first

+when looking for an idle CPU to use the next time it's scheduled.

Finally there

+is the notion of "sticky" tasks that are flagged when they are

involuntarily

+descheduled, meaning they still want further CPU time. This sticky flag

is

+used to bias heavily against those tasks being scheduled on a different

CPU

+unless that CPU would be otherwise idle. When a cpu frequency governor

is used

+that scales with CPU load, such as ondemand, sticky tasks are not

scheduled

+on a different CPU at all, preferring instead to go idle. This means

the CPU

+they were bound to is more likely to increase its speed while the other

CPU

+will go idle, thus speeding up total task execution time and likely

decreasing

+power usage. This is the only scenario where BFS will allow a CPU to go

idle

+in preference to scheduling a task on the earliest available spare CPU.

+

+The real cost of migrating a task from one CPU to another is entirely

dependant

+on the cache footprint of the task, how cache intensive the task is,

how long

+it's been running on that CPU to take up the bulk of its cache, how big

the CPU

+cache is, how fast and how layered the CPU cache is, how fast a context

switch

+is... and so on. In other words, it's close to random in the real world

where we

+do more than just one sole workload. The only thing we can be sure of

is that

+it's not free. So BFS uses the principle that an idle CPU is a wasted

CPU and

+utilising idle CPUs is more important than cache locality, and cache

locality

+only plays a part after that.

+

+When choosing an idle CPU for a waking task, the cache locality is

determined

+according to where the task last ran and then idle CPUs are ranked from

best

+to worst to choose the most suitable idle CPU based on cache locality,

NUMA

+node locality and hyperthread sibling business. They are chosen in the

+following preference (if idle):

+

+* Same core, idle or busy cache, idle threads

+* Other core, same cache, idle or busy cache, idle threads.

+* Same node, other CPU, idle cache, idle threads.

+* Same node, other CPU, busy cache, idle threads.

+* Same core, busy threads.

+* Other core, same cache, busy threads.

+* Same node, other CPU, busy threads.

+* Other node, other CPU, idle cache, idle threads.

+* Other node, other CPU, busy cache, idle threads.

+* Other node, other CPU, busy threads.

+

+This shows the SMT or "hyperthread" awareness in the design as well

which will

+choose a real idle core first before a logical SMT sibling which

already has

+tasks on the physical CPU.

+

+Early benchmarking of BFS suggested scalability dropped off at the 16

CPU mark.

+However this benchmarking was performed on an earlier design that was

far less

+scalable than the current one so it's hard to know how scalable it is

in terms

+of both CPUs (due to the global runqueue) and heavily loaded machines

(due to

+O(n) lookup) at this stage. Note that in terms of scalability, the

number of

+_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual

(2x)

+quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer

benchmark

+results are very promising indeed, without needing to tweak any knobs,

features

+or options. Benchmark contributions are most welcome.

+

+

+Features

+

+As the initial prime target audience for BFS was the average desktop

user, it

+was designed to not need tweaking, tuning or have features set to

obtain benefit

+from it. Thus the number of knobs and features has been kept to an

absolute

+minimum and should not require extra user input for the vast majority

of cases.

+There are precisely 2 tunables, and 2 extra scheduling policies. The

rr_interval

+and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In

addition

+to this, BFS also uses sub-tick accounting. What BFS does _not_ now

feature is

+support for CGROUPS. The average user should neither need to know what

these

+are, nor should they need to be using them to have good desktop

behaviour.

+

+rr_interval

+

+There is only one "scheduler" tunable, the round robin interval. This

can be

+accessed in

+

+ /proc/sys/kernel/rr_interval

+

+The value is in milliseconds, and the default value is set to 6ms.

Valid values

+are from 1 to 1000. Decreasing the value will decrease latencies at the

cost of

+decreasing throughput, while increasing it will improve throughput, but

at the

+cost of worsening latencies. The accuracy of the rr interval is limited

by HZ

+resolution of the kernel configuration. Thus, the worst case latencies

are

+usually slightly higher than this actual value. BFS uses "dithering" to

try and

+minimise the effect the Hz limitation has. The default value of 6 is

not an

+arbitrary one. It is based on the fact that humans can detect jitter at

+approximately 7ms, so aiming for much lower latencies is pointless

under most

+circumstances. It is worth noting this fact when comparing the latency

+performance of BFS to other schedulers. Worst case latencies being

higher than

+7ms are far worse than average latencies not being in the microsecond

range.

+Experimentation has shown that rr intervals being increased up to 300

can

+improve throughput but beyond that, scheduling noise from elsewhere

prevents

+further demonstrable throughput.

+

+Isochronous scheduling.

+

+Isochronous scheduling is a unique scheduling policy designed to

provide

+near-real-time performance to unprivileged (ie non-root) users without

the

+ability to starve the machine indefinitely. Isochronous tasks (which

means

+"same time") are set using, for example, the schedtool application like

so:

+

+ schedtool -I -e amarok

+

+This will start the audio application "amarok" as SCHED_ISO. How

SCHED_ISO works

+is that it has a priority level between true realtime tasks and

SCHED_NORMAL

+which would allow them to preempt all normal tasks, in a SCHED_RR

fashion (ie,

+if multiple SCHED_ISO tasks are running, they purely round robin at

rr_interval

+rate). However if ISO tasks run for more than a tunable finite amount

of time,

+they are then demoted back to SCHED_NORMAL scheduling. This finite

amount of

+time is the percentage of _total CPU_ available across the machine,

configurable

+as a percentage in the following "resource handling" tunable (as

opposed to a

+scheduler tunable):

+

+ /proc/sys/kernel/iso_cpu

+

+and is set to 70% by default. It is calculated over a rolling 5 second

average

+Because it is the total CPU available, it means that on a multi CPU

machine, it

+is possible to have an ISO task running as realtime scheduling

indefinitely on

+just one CPU, as the other CPUs will be available. Setting this to 100

is the

+equivalent of giving all users SCHED_RR access and setting it to 0

removes the

+ability to run any pseudo-realtime tasks.

+

+A feature of BFS is that it detects when an application tries to obtain

a

+realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have

the

+appropriate privileges to use those policies. When it detects this, it

will

+give the task SCHED_ISO policy instead. Thus it is transparent to the

user.

+Because some applications constantly set their policy as well as their

nice

+level, there is potential for them to undo the override specified by

the user

+on the command line of setting the policy to SCHED_ISO. To counter

this, once

+a task has been set to SCHED_ISO policy, it needs superuser privileges

to set

+it back to SCHED_NORMAL. This will ensure the task remains ISO and all

child

+processes and threads will also inherit the ISO policy.

+

+Idleprio scheduling.

+

+Idleprio scheduling is a scheduling policy designed to give out CPU to

a task

+_only_ when the CPU would be otherwise idle. The idea behind this is to

allow

+ultra low priority tasks to be run in the background that have

virtually no

+effect on the foreground tasks. This is ideally suited to distributed

computing

+clients (like setiathome, folding, mprime etc) but can also be used to

start

+a video encode or so on without any slowdown of other tasks. To avoid

this

+policy from grabbing shared resources and holding them indefinitely, if

it

+detects a state where the task is waiting on I/O, the machine is about

to

+suspend to ram and so on, it will transiently schedule them as

SCHED_NORMAL. As

+per the Isochronous task management, once a task has been scheduled as

IDLEPRIO,

+it cannot be put back to SCHED_NORMAL without superuser privileges.

Tasks can

+be set to start as SCHED_IDLEPRIO with the schedtool command like so:

+

+ schedtool -D -e ./mprime

+

+Subtick accounting.

+

+It is surprisingly difficult to get accurate CPU accounting, and in

many cases,

+the accounting is done by simply determining what is happening at the

precise

+moment a timer tick fires off. This becomes increasingly inaccurate as

the

+timer tick frequency (HZ) is lowered. It is possible to create an

application

+which uses almost 100% CPU, yet by being descheduled at the right time,

records

+zero CPU usage. While the main problem with this is that there are

possible

+security implications, it is also difficult to determine how much CPU a

task

+really does use. BFS tries to use the sub-tick accounting from the TSC

clock,

+where possible, to determine real CPU usage. This is not entirely

reliable, but

+is far more likely to produce accurate CPU usage data than the existing

designs

+and will not show tasks as consuming no CPU usage when they actually

are. Thus,

+the amount of CPU reported as being used by BFS will more accurately

represent

+how much CPU the task itself is using (as is shown for example by the

'time'

+application), so the reported values may be quite different to other

schedulers.

+Values reported as the 'load' are more prone to problems with this

design, but

+per process values are closer to real usage. When comparing throughput

of BFS

+to other designs, it is important to compare the actual completed work

in terms

+of total wall clock time taken and total work done, rather than the

reported

+"cpu usage".

+

+

+Con Kolivas
Tue, 5 Apr 2011

diff -uprN linux-3.6.2/Documentation/sysctl/kernel.txt

linux-3.6.2-bfs-multi-runqueue/Documentation/sysctl/kernel.txt

--- linux-3.6.2/Documentation/sysctl/kernel.txt 2012-10-12

22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/Documentation/sysctl/kernel.txt

2012-10-25 17:13:12.584060777 +0200

@@ -33,6 +33,7 @@ show up in /proc/sys/kernel:

- domainname

- hostname

- hotplug

+- iso_cpu

- kptr_restrict

- kstack_depth_to_print [ X86 only ]

- l2cr [ PPC only ]

@@ -59,6 +60,7 @@ show up in /proc/sys/kernel:

- randomize_va_space

- real-root-dev ==> Documentation/initrd.txt

- reboot-cmd [ SPARC only ]

+- rr_interval

- rtsig-max

- rtsig-nr

- sem

@@ -301,6 +303,16 @@ kernel stack.

==============================================================

+iso_cpu: (BFS CPU scheduler only).

+

+This sets the percentage cpu that the unprivileged SCHED_ISO tasks can

+run effectively at realtime priority, averaged over a rolling five

+seconds over the -whole- system, meaning all cpus.

+

+Set to 70 (percent) by default.

+

+==============================================================

+

l2cr: (PPC only)

This flag controls the L2 cache of G3 processor boards. If

@@ -517,6 +529,20 @@ rebooting. ???

==============================================================

+rr_interval: (BFS CPU scheduler only)

+

+This is the smallest duration that any cpu process scheduling unit

+will run for. Increasing this value can increase throughput of cpu

+bound tasks substantially but at the expense of increased latencies

+overall. Conversely decreasing it will decrease average and maximum

+latencies but at the expense of throughput. This value is in

+milliseconds and the default value chosen depends on the number of

+cpus available at scheduler initialisation with a minimum of 6.

+

+Valid values are from 1-1000.

+

+==============================================================

+

rtsig-max & rtsig-nr:

The file rtsig-max can be used to tune the maximum number

diff -uprN linux-3.6.2/drivers/cpufreq/cpufreq.c

linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq.c

--- linux-3.6.2/drivers/cpufreq/cpufreq.c 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq.c 2012-10-25

17:13:12.594060777 +0200

@@ -28,6 +28,7 @@

#include

#include

#include

+#include

#include

#include

@@ -1476,6 +1477,12 @@ int __cpufreq_driver_target(struct cpufr

target_freq, relation);

if (cpu_online(policy->cpu) && cpufreq_driver->target)

retval = cpufreq_driver->target(policy, target_freq, relation);

+ if (likely(retval != -EINVAL)) {

+ if (target_freq == policy->max)

+ cpu_nonscaling(policy->cpu);

+ else

+ cpu_scaling(policy->cpu);

+ }

return retval;

}

diff -uprN linux-3.6.2/drivers/cpufreq/cpufreq_conservative.c

linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq_conservative.c

--- linux-3.6.2/drivers/cpufreq/cpufreq_conservative.c 2012-10-12

22:50:59.000000000 +0200

+++

linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq_conservative.c

2012-10-25 17:13:12.596060777 +0200

@@ -29,8 +29,8 @@

* It helps to keep variable names smaller, simpler

*/

-#define DEF_FREQUENCY_UP_THRESHOLD (80)

-#define DEF_FREQUENCY_DOWN_THRESHOLD (20)

+#define DEF_FREQUENCY_UP_THRESHOLD (63)

+#define DEF_FREQUENCY_DOWN_THRESHOLD (26)

/*

* The polling frequency of this governor depends on the capability of

diff -uprN linux-3.6.2/drivers/cpufreq/cpufreq_ondemand.c

linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq_ondemand.c

--- linux-3.6.2/drivers/cpufreq/cpufreq_ondemand.c 2012-10-12

22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/drivers/cpufreq/cpufreq_ondemand.c

2012-10-25 17:13:12.595060777 +0200

@@ -28,8 +28,8 @@

* It helps to keep variable names smaller, simpler

*/

-#define DEF_FREQUENCY_DOWN_DIFFERENTIAL (10)

-#define DEF_FREQUENCY_UP_THRESHOLD (80)

+#define DEF_FREQUENCY_DOWN_DIFFERENTIAL (26)

+#define DEF_FREQUENCY_UP_THRESHOLD (63)

#define DEF_SAMPLING_DOWN_FACTOR (1)

#define MAX_SAMPLING_DOWN_FACTOR (100000)

#define MICRO_FREQUENCY_DOWN_DIFFERENTIAL (3)

@@ -472,10 +472,10 @@ static void dbs_check_cpu(struct cpu_dbs

/*

* Every sampling_rate, we check, if current idle time is less

- * than 20% (default), then we try to increase frequency

+ * than 37% (default), then we try to increase frequency

* Every sampling_rate, we look for a the lowest

* frequency which can sustain the load while keeping idle time over

- * 30%. If such a frequency exist, we try to decrease to this

frequency.

+ * 63%. If such a frequency exist, we try to decrease to this

frequency.

*

* Any frequency increase takes it to the maximum frequency.

* Frequency reduction happens at minimum steps of

diff -uprN linux-3.6.2/fs/proc/base.c

linux-3.6.2-bfs-multi-runqueue/fs/proc/base.c

--- linux-3.6.2/fs/proc/base.c 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/fs/proc/base.c 2012-10-25

17:13:12.585060777 +0200

@@ -338,7 +338,7 @@ static int proc_pid_stack(struct seq_fil

static int proc_pid_schedstat(struct task_struct *task, char *buffer)

{

return sprintf(buffer, "%llu %llu %lu\n",

- (unsigned long long)task->se.sum_exec_runtime,

+ (unsigned long long)tsk_seruntime(task),

(unsigned long long)task->sched_info.run_delay,

task->sched_info.pcount);

}

diff -uprN linux-3.6.2/include/linux/init_task.h

linux-3.6.2-bfs-multi-runqueue/include/linux/init_task.h

--- linux-3.6.2/include/linux/init_task.h 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/include/linux/init_task.h 2012-10-25

17:13:12.585060777 +0200

@@ -141,12 +141,70 @@ extern struct task_group root_task_group

# define INIT_PERF_EVENTS(tsk)

#endif

-#define INIT_TASK_COMM "swapper"

-

/*

* INIT_TASK is used to set up the first task table, touch at

* your own risk!. Base=0, limit=0x1fffff (=2MB)

*/

+#ifdef CONFIG_SCHED_BFS

+#define INIT_TASK_COMM "BFS"

+#define INIT_TASK(tsk) \

+{ \

+ .state = 0, \

+ .stack = &init_thread_info, \

+ .usage = ATOMIC_INIT(2), \

+ .flags = PF_KTHREAD, \

+ .prio = NORMAL_PRIO, \

+ .static_prio = MAX_PRIO-20, \

+ .normal_prio = NORMAL_PRIO, \

+ .deadline = 0, \

+ .policy = SCHED_NORMAL, \

+ .cpus_allowed = CPU_MASK_ALL, \

+ .mm = NULL, \

+ .active_mm = &init_mm, \

+ .run_list = LIST_HEAD_INIT(tsk.run_list), \

+ .time_slice = HZ, \

+ .tasks = LIST_HEAD_INIT(tsk.tasks), \

+ INIT_PUSHABLE_TASKS(tsk) \

+ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \

+ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \

+ .real_parent = &tsk, \

+ .parent = &tsk, \

+ .children = LIST_HEAD_INIT(tsk.children), \

+ .sibling = LIST_HEAD_INIT(tsk.sibling), \

+ .group_leader = &tsk, \

+ RCU_POINTER_INITIALIZER(real_cred, &init_cred), \

+ RCU_POINTER_INITIALIZER(cred, &init_cred), \

+ .comm = INIT_TASK_COMM, \

+ .thread = INIT_THREAD, \

+ .fs = &init_fs, \

+ .files = &init_files, \

+ .signal = &init_signals, \

+ .sighand = &init_sighand, \

+ .nsproxy = &init_nsproxy, \

+ .pending = { \

+ .list = LIST_HEAD_INIT(tsk.pending.list), \

+ .signal = {{0}}}, \

+ .blocked = {{0}}, \

+ .alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \

+ .journal_info = NULL, \

+ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \

+ .pi_lock = __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \

+ .timer_slack_ns = 50000, /* 50 usec default slack */ \

+ .pids = { \

+ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \

+ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \

+ [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \

+ }, \

+ INIT_IDS \

+ INIT_PERF_EVENTS(tsk) \

+ INIT_TRACE_IRQFLAGS \

+ INIT_LOCKDEP \

+ INIT_FTRACE_GRAPH \

+ INIT_TRACE_RECURSION \

+ INIT_TASK_RCU_PREEMPT(tsk) \

+}

+#else /* CONFIG_SCHED_BFS */

+#define INIT_TASK_COMM "swapper"

#define INIT_TASK(tsk) \

{ \

.state = 0, \

@@ -211,7 +269,7 @@ extern struct task_group root_task_group

INIT_TASK_RCU_PREEMPT(tsk) \

INIT_CPUSET_SEQ \

}

-

+#endif /* CONFIG_SCHED_BFS */

#define INIT_CPU_TIMERS(cpu_timers) \

{ \

diff -uprN linux-3.6.2/include/linux/ioprio.h

linux-3.6.2-bfs-multi-runqueue/include/linux/ioprio.h

--- linux-3.6.2/include/linux/ioprio.h 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/include/linux/ioprio.h 2012-10-25

17:13:12.585060777 +0200

@@ -52,6 +52,8 @@ enum {

*/

static inline int task_nice_ioprio(struct task_struct *task)

{

+ if (iso_task(task))

+ return 0;

return (task_nice(task) + 20) / 5;

}

diff -uprN linux-3.6.2/include/linux/jiffies.h

linux-3.6.2-bfs-multi-runqueue/include/linux/jiffies.h

--- linux-3.6.2/include/linux/jiffies.h 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/include/linux/jiffies.h 2012-10-25

17:13:12.593060777 +0200

@@ -173,7 +173,7 @@ static inline u64 get_jiffies_64(void)

* Have the 32 bit jiffies value wrap 5 minutes after boot

* so jiffies wrap bugs show up earlier.

*/

-#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

+#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))

/*

* Change timeval to jiffies, trying to avoid the

diff -uprN linux-3.6.2/include/linux/sched.h

linux-3.6.2-bfs-multi-runqueue/include/linux/sched.h

--- linux-3.6.2/include/linux/sched.h 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/include/linux/sched.h 2012-10-25

17:13:12.587060777 +0200

@@ -37,8 +37,15 @@

#define SCHED_FIFO 1

#define SCHED_RR 2

#define SCHED_BATCH 3

-/* SCHED_ISO: reserved but not implemented yet */

+/* SCHED_ISO: Implemented on BFS only */

#define SCHED_IDLE 5

+#ifdef CONFIG_SCHED_BFS

+#define SCHED_ISO 4

+#define SCHED_IDLEPRIO SCHED_IDLE

+#define SCHED_MAX (SCHED_IDLEPRIO)

+#define SCHED_RANGE(policy) ((policy)
sched_time)

+#define tsk_rttimeout(t) ((t)->rt_timeout)

+

+static inline void tsk_cpus_current(struct task_struct *p)

+{

+}

+

+static inline int runqueue_is_locked(int cpu)

+{

+ return grunqueue_is_locked();

+}

+

+void print_scheduler_version(void);

+

+static inline bool iso_task(struct task_struct *p)

+{

+ return (p->policy == SCHED_ISO);

+}

+#else /* CFS */

+extern int runqueue_is_locked(int cpu);

+static inline void cpu_scaling(int cpu)

+{

+}

+

+static inline void cpu_nonscaling(int cpu)

+{

+}

+#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)

+#define tsk_rttimeout(t) ((t)->rt.timeout)

+

+static inline void tsk_cpus_current(struct task_struct *p)

+{

+ p->nr_cpus_allowed = current->nr_cpus_allowed;

+}

+

+static inline void print_scheduler_version(void)

+{

+ printk(KERN_INFO"CFS CPU scheduler.\n");

+}

+

+static inline bool iso_task(struct task_struct *p)

+{

+ return false;

+}

+

+/* Anyone feel like implementing this? */

+static inline bool above_background_load(void)

+{

+ return false;

+}

+#endif /* CONFIG_SCHED_BFS */

+

/* Future-safe accessor for struct task_struct's cpus_allowed. */

#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

@@ -1608,10 +1691,20 @@ struct task_struct {

*/

#define MAX_USER_RT_PRIO 100

-#define MAX_RT_PRIO MAX_USER_RT_PRIO

+#define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1)

+#define DEFAULT_PRIO (MAX_RT_PRIO + 20)

+#ifdef CONFIG_SCHED_BFS

+#define PRIO_RANGE (40)

+#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE)

+#define ISO_PRIO (MAX_RT_PRIO)

+#define NORMAL_PRIO (MAX_RT_PRIO + 1)

+#define IDLE_PRIO (MAX_RT_PRIO + 2)

+#define PRIO_LIMIT ((IDLE_PRIO) + 1)

+#else /* CONFIG_SCHED_BFS */

#define MAX_PRIO (MAX_RT_PRIO + 40)

-#define DEFAULT_PRIO (MAX_RT_PRIO + 20)

+#define NORMAL_PRIO DEFAULT_PRIO

+#endif /* CONFIG_SCHED_BFS */

static inline int rt_prio(int prio)

{

@@ -1989,7 +2082,7 @@ extern unsigned long long

task_sched_runtime(struct task_struct *task);

/* sched_exec is called by processes performing an exec */

-#ifdef CONFIG_SMP

+#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS)

extern void sched_exec(void);

#else

#define sched_exec() {}

@@ -2705,7 +2798,7 @@ static inline unsigned int task_cpu(cons

return 0;

}

-static inline void set_task_cpu(struct task_struct *p, unsigned int

cpu)

+static inline void set_task_cpu(struct task_struct *p, int cpu)

{

}

diff -uprN linux-3.6.2/init/Kconfig

linux-3.6.2-bfs-multi-runqueue/init/Kconfig

--- linux-3.6.2/init/Kconfig 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/init/Kconfig 2012-10-25

17:13:12.588060777 +0200

@@ -32,6 +32,19 @@ config BUILDTIME_EXTABLE_SORT

menu "General setup"

+config SCHED_BFS

+ bool "BFS cpu scheduler"

+ ---help---

+ The Brain Fuck CPU Scheduler for excellent interactivity and

+ responsiveness on the desktop and solid scalability on normal

+ hardware. Not recommended for 4096 CPUs.

+

+ Currently incompatible with the Group CPU scheduler, and RCU TORTURE

+ TEST so these options are disabled.

+

+ Say Y here.

+ default y

+

config EXPERIMENTAL

bool "Prompt for development and/or incomplete code/drivers"

---help---

@@ -676,6 +689,7 @@ config PROC_PID_CPUSET

config CGROUP_CPUACCT

bool "Simple CPU accounting cgroup subsystem"

+ depends on !SCHED_BFS

help

Provides a simple Resource Controller for monitoring the

total CPU consumed by the tasks in a cgroup.

@@ -778,6 +792,7 @@ config CGROUP_PERF

menuconfig CGROUP_SCHED

bool "Group CPU scheduler"

+ depends on !SCHED_BFS

default n

help

This feature lets CPU scheduler recognize task groups and control

CPU

@@ -1042,6 +1057,7 @@ config UIDGID_STRICT_TYPE_CHECKS

config SCHED_AUTOGROUP

bool "Automatic process group scheduling"

+ depends on !SCHED_BFS

select EVENTFD

select CGROUPS

select CGROUP_SCHED

@@ -1426,38 +1442,8 @@ config COMPAT_BRK

On non-ancient distros (post-2000 ones) N is usually a safe choice.

-choice

- prompt "Choose SLAB allocator"

- default SLUB

- help

- This option allows to select a slab allocator.

-

-config SLAB

- bool "SLAB"

- help

- The regular slab allocator that is established and known to work

- well in all environments. It organizes cache hot objects in

- per cpu and per node queues.

-

config SLUB

- bool "SLUB (Unqueued Allocator)"

- help

- SLUB is a slab allocator that minimizes cache line usage

- instead of managing queues of cached objects (SLAB approach).

- Per cpu caching is realized using slabs of objects instead

- of queues of objects. SLUB can use memory efficiently

- and has enhanced diagnostics. SLUB is the default choice for

- a slab allocator.

-

-config SLOB

- depends on EXPERT

- bool "SLOB (Simple Allocator)"

- help

- SLOB replaces the stock allocator with a drastically simpler

- allocator. SLOB is generally more space efficient but

- does not perform as well on large systems.

-

-endchoice

+ def_bool y

config MMAP_ALLOW_UNINITIALIZED

bool "Allow mmapped anonymous memory to be uninitialized"

diff -uprN linux-3.6.2/init/main.c

linux-3.6.2-bfs-multi-runqueue/init/main.c

--- linux-3.6.2/init/main.c 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/init/main.c 2012-11-19

16:14:36.759575528 +0100

@@ -701,7 +701,6 @@ int __init_or_module do_one_initcall(ini

return ret;

}

-

extern initcall_t __initcall_start[];

extern initcall_t __initcall0_start[];

extern initcall_t __initcall1_start[];

@@ -809,6 +808,8 @@ static noinline int init_post(void)

current->signal->flags |= SIGNAL_UNKILLABLE;

flush_delayed_fput();

+ print_scheduler_version();

+

if (ramdisk_execute_command) {

run_init_process(ramdisk_execute_command);

printk(KERN_WARNING "Failed to execute %s\n",

@@ -857,10 +858,8 @@ static int __init kernel_init(void * unu

cad_pid = task_pid(current);

smp_prepare_cpus(setup_max_cpus);

-

do_pre_smp_initcalls();

lockup_detector_init();

-

smp_init();

sched_init_smp();

diff -uprN linux-3.6.2/kernel/cpu.c

linux-3.6.2-bfs-multi-runqueue/kernel/cpu.c

--- linux-3.6.2/kernel/cpu.c 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/kernel/cpu.c 2012-11-22

18:10:05.921368389 +0100

@@ -285,11 +285,9 @@ static int __ref _cpu_down(unsigned int

if (err) {

/* CPU didn't die: tell everyone. Can't complain. */

cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);

-

goto out_release;

}

BUG_ON(cpu_online(cpu));

-

/*

* The migration_call() CPU_DYING callback will have removed all

* runnable tasks from the cpu, there's only the idle task left now

@@ -398,7 +396,6 @@ int __cpuinit cpu_up(unsigned int cpu)

#endif

return -EINVAL;

}

-

#ifdef CONFIG_MEMORY_HOTPLUG

nid = cpu_to_node(cpu);

if (!node_online(nid)) {

@@ -406,7 +403,6 @@ int __cpuinit cpu_up(unsigned int cpu)

if (err)

return err;

}

-

pgdat = NODE_DATA(nid);

if (!pgdat) {

printk(KERN_ERR

diff -uprN linux-3.6.2/kernel/delayacct.c

linux-3.6.2-bfs-multi-runqueue/kernel/delayacct.c

--- linux-3.6.2/kernel/delayacct.c 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/kernel/delayacct.c 2012-10-25

17:13:12.589060777 +0200

@@ -130,7 +130,7 @@ int __delayacct_add_tsk(struct taskstats

*/

t1 = tsk->sched_info.pcount;

t2 = tsk->sched_info.run_delay;

- t3 = tsk->se.sum_exec_runtime;

+ t3 = tsk_seruntime(tsk);

d->cpu_count += t1;

diff -uprN linux-3.6.2/kernel/exit.c

linux-3.6.2-bfs-multi-runqueue/kernel/exit.c

--- linux-3.6.2/kernel/exit.c 2012-10-12 22:50:59.000000000 +0200

+++ linux-3.6.2-bfs-multi-runqueue/kernel/exit.c 2012-10-25

17:13:12.590060777 +0200

@@ -145,7 +145,7 @@ static void __exit_signal(struct task_st

sig->inblock += task_io_get_inblock(tsk);

sig->oublock += task_io_get_oublock(tsk);

task_io_accounting_add(&sig->ioac, &tsk->ioac);

- sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

+ sig->sum_sched_runtime += tsk_seruntime(tsk);

}

sig->nr_threads--;

diff -uprN linux-3.6.2/kernel/posix-cpu-timers.c

linux-3.6.2-bfs-multi-runqueue/kernel/posix-cpu-timers.c

--- linux-3.6.2/kernel/posix-cpu-timers.c 2012-10-12 22:50:59.000000000

+0200

+++ linux-3.6.2-bfs-multi-runqueue/kernel/posix-cpu-timers.c 2012-10-25

17:13:12.591060777 +0200

@@ -495,7 +495,7 @@ static void cleanup_timers(struct list_h

void posix_cpu_timers_exit(struct task_struct *tsk)

{

cleanup_timers(tsk->cpu_timers,

- tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);

+ tsk->utime, tsk->stime, tsk_seruntime(tsk));

}

void posix_cpu_timers_exit_group(struct task_struct *tsk)

@@ -504,7 +504,7 @@ void posix_cpu_timers_exit_group(struct

cleanup_timers(tsk->signal->cpu_timers,

tsk->utime + sig->utime, tsk->stime + sig->stime,

- tsk->se.sum_exec_runtime + sig->sum_sched_runtime);

+ tsk_seruntime(tsk) + sig->sum_sched_runtime);

}

static void clear_dead_task(struct k_itimer *timer, union

cpu_time_count now)

@@ -934,7 +934,7 @@ static void check_thread_timers(struct t

struct cpu_timer_list *t = list_first_entry(timers,

struct cpu_timer_list,

entry);

- if (!--maxfire || tsk->se.sum_exec_runtime
expires.sched) {

+ if (!--maxfire || tsk_seruntime(tsk)
expires.sched) {

tsk->cputime_expires.sched_exp = t->expires.sched;

break;

}

@@ -951,7 +951,7 @@ static void check_thread_timers(struct t

ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max);

if (hard != RLIM_INFINITY &&

- tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {

+ tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {

/*

* At the hard limit, we just die.

* No need to calculate anything else now.

@@ -959,7 +959,7 @@ static void check_thread_timers(struct t

__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);

return;

}

- if (tsk->rt.timeout > DIV_ROUND_UP(soft, USEC_PER_SEC/HZ)) {

+ if (tsk_rttimeout(tsk) > DIV_ROUND_UP(soft, USEC_PER_SEC/HZ)) {

/*

* At the soft limit, send a SIGXCPU every second.

*/

@@ -1252,7 +1252,7 @@ static inline int fastpath_timer_check(s

struct task_cputime task_sample = {

.utime = tsk->utime,

.stime = tsk->stime,

- .sum_exec_runtime = tsk->se.sum_exec_runtime

+ .sum_exec_runtime = tsk_seruntime(tsk)

};

if (task_cputime_expired(&task_sample, &tsk->cputime_expires))

diff -uprN linux-3.6.2/kernel/sched/bfs.c

linux-3.6.2-bfs-multi-runqueue/kernel/sched/bfs.c

--- linux-3.6.2/kernel/sched/bfs.c 1970-01-01 01:00:00.000000000 +0100

+++ linux-3.6.2-bfs-multi-runqueue/kernel/sched/bfs.c 2012-12-15

16:25:58.651400017 +0100

@@ -0,0 +1,8156 @@

+/*

+ * kernel/sched_bfs.c, was sched.c

+ *

+ * Kernel scheduler and related syscalls

+ *

+ * Copyright (C) 1991-2002 Linus Torvalds

+ *

+ * 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and

+ * make semaphores SMP safe

+ * 1998-11-19 Implemented schedule_timeout() and related stuff

+ * by Andrea Arcangeli

+ * 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar:

+ * hybrid priority-list and round-robin design with

+ * an array-switch method of distributing timeslices

+ * and per-CPU runqueues. Cleanups and useful suggestions

+ * by Davide Libenzi, preemptible kernel bits by Robert Love.

+ * 2003-09-03 Interactivity tuning by Con Kolivas.

+ * 2004-04-02 Scheduler domains code by Nick Piggin

+ * 2007-04-15 Work begun on replacing all interactivity tuning with a

+ * fair scheduling design by Con Kolivas.

+ * 2007-05-05 Load balancing (smp-nice) and other improvements

+ * by Peter Williams

+ * 2007-05-06 Interactivity improvements to CFS by Mike Galbraith

+ * 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri

+ * 2007-11-29 RT balancing improvements by Steven Rostedt, Gregory

Haskins,

+ * Thomas Gleixner, Mike Kravetz

+ * now Brainfuck deadline scheduling policy by Con Kolivas deletes

+ * a whole lot of those previous things.

+ * Support for multiple runqueues by Matthias Kohler

+ */

+

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+#include

+

+#include

+#include

+#include

+#include

+#ifdef CONFIG_PARAVIRT

+#include

+#endif

+

+#include "cpupri.h"

+#include "../workqueue_sched.h"

+#include "../smpboot.h"

+

+#define CREATE_TRACE_POINTS

+#include

+

+#define rt_prio(prio) unlikely((prio)
prio)

+#define rt_queue(rq) rt_prio((rq)->rq_prio)

+#define batch_task(p) (unlikely((p)->policy == SCHED_BATCH))

+#define is_rt_policy(policy) ((policy) == SCHED_FIFO || \

+ (policy) == SCHED_RR)

+#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))

+#define idleprio_task(p) unlikely((p)->policy == SCHED_IDLEPRIO)

+#define iso_task(p) unlikely((p)->policy == SCHED_ISO)

+#define iso_queue(rq) unlikely((rq)->rq_policy == SCHED_ISO)

+#define ISO_PERIOD ((5 * HZ * grq.noc) + 1)

+

+/*

+ * Convert user-nice values [ -20 ... 0 ... 19 ]

+ * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

+ * and back.

+ */

+#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)

+#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)

+#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)

+

+/*

+ * 'User priority' is the nice value converted to something we

+ * can work with better when scaling various scheduler parameters,

+ * it's a [ 0 ... 39 ] range.

+ */

+#define USER_PRIO(p) ((p) - MAX_RT_PRIO)

+#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)

+#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))

+#define SCHED_PRIO(p) ((p) + MAX_RT_PRIO)

+#define STOP_PRIO (MAX_RT_PRIO - 1)

+

+/*

+ * Some helpers for converting to/from various scales. Use shifts to

get

+ * approximate multiples of ten for less overhead.

+ */

+#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))

+#define JIFFY_NS (1000000000 / HZ)

+#define HALF_JIFFY_NS (1000000000 / HZ / 2)

+#define HALF_JIFFY_US (1000000 / HZ / 2)

+#define MS_TO_NS(TIME) ((TIME)
> 20)

+#define NS_TO_US(TIME) ((TIME) >> 10)

+

+#define RESCHED_US (100) /* Reschedule if less than this many μs left

*/

+

+void print_scheduler_version(void)

+{

+}

+

+/*

+ * This is the time all tasks within the same priority round robin.

+ * Value is in ms and set to a minimum of 6ms. Scales with number of

cpus.

+ * Tunable via /proc interface.

+ */

+int rr_interval __read_mostly = 6;

+

+/*

+ * sched_iso_cpu - sysctl which determines the cpu percentage SCHED_ISO

tasks

+ * are allowed to run five seconds as real time tasks. This is the

total over

+ * all online cpus.

+ */

+int sched_iso_cpu __read_mostly = 70;

+

+/*

+ * The relative length of deadline for each priority(nice) level.

+ */

+static int prio_ratios[PRIO_RANGE] __read_mostly;

+

+/*

+ * The quota handed out to tasks of all priority levels when refilling

their

+ * time_slice.

+ */

+static inline int timeslice(void)

+{

+ return MS_TO_US(rr_interval);

+}

+

+/*

+ * The global runqueue data that all CPUs work off. Data is protected

either

+ * by the global grq lock, or the discrete lock that precedes the data

in this

+ * struct.

+ */

+

+struct global_rq {

+ raw_spinlock_t lock;

+ unsigned long nr_running;

+ unsigned long nr_uninterruptible;

+ unsigned long long nr_switches;

+ struct list_head queue[PRIO_LIMIT];

+ DECLARE_BITMAP(prio_bitmap, PRIO_LIMIT + 1);

+#ifdef CONFIG_SMP

+ unsigned long qnr; /* queued not running */

+ cpumask_t cpu_idle_map;

+ bool idle_cpus;

+ cpumask_t cpu_span;

+#endif

+ int noc; /* num_online_cpus stored and updated when it changes */

+ u64 niffies; /* Nanosecond jiffies */

+ unsigned long last_jiffy; /* Last jiffy we updated niffies */

+

+ raw_spinlock_t iso_lock;

+ int iso_ticks;

+ int iso_refractory;

+

+ /*

+ * for the initial global runqueue this is set to false

+ * to ensure it does not get freed

+ */

+

+ bool is_freeable;

+};

+

+static void init_a_grq(struct global_rq *const this_grq,

+ int const n_online_cpus,

+ struct cpumask const *this_span,

+ struct global_rq const *const old_grq)

+{

+ int i;

+

+ raw_spin_lock_init(&(this_grq->lock));

+ this_grq->nr_running =

+ this_grq->nr_uninterruptible =

+ this_grq->nr_switches = 0;

+ if (old_grq == NULL) {

+ this_grq->niffies = 0;

+ this_grq->last_jiffy = jiffies;

+ } else {

+ this_grq->niffies = old_grq->niffies;

+ this_grq->last_jiffy = old_grq->last_jiffy;

+ }

+ raw_spin_lock_init(&(this_grq->iso_lock));

+ this_grq->iso_ticks = this_grq->iso_refractory = 0;

+ this_grq->noc = n_online_cpus;

+#ifdef CONFIG_SMP

+ this_grq->qnr = this_grq->idle_cpus = 0;

+ cpumask_clear(&(this_grq->cpu_idle_map));

+ cpumask_copy(&this_grq->cpu_span, this_span);

+ this_grq->is_freeable = true;

+#endif

+

+ for (i = 0; i
queue + i);

+ /* delimiter for bitsearch */

+ __set_bit(PRIO_LIMIT, this_grq->prio_bitmap);

+}

+

+/*

+ * The initial global runqueue, this gets never freed

+ */

+

+#ifdef CONFIG_SMP

+

+/*

+ * We add the notion of a root-domain which will be used to define

per-domain

+ * variables. Each exclusive cpuset essentially defines an island

domain by

+ * fully partitioning the member cpus from any other cpuset. Whenever a

new

+ * exclusive cpuset is created, we also create and attach a new

root-domain

+ * object.

+ *

+ */

+struct root_domain {

+ atomic_t refcount;

+ atomic_t rto_count;

+ struct rcu_head rcu;

+ cpumask_var_t span;

+ cpumask_var_t online;

+

+ /*

+ * The "RT overload" flag: it gets set if a CPU has more than

+ * one runnable RT task.

+ */

+ cpumask_var_t rto_mask;

+ struct cpupri cpupri;

+};

+

+/*

+ * By default the system creates a single root-domain with all cpus as

+ * members (mimicking the global state we have today).

+ */

+static struct root_domain def_root_domain;

+

+#endif /* CONFIG_SMP */

+

+/* There can be only one */

+

+/*

+ * A per_cpu field for fast lookup, which gloabl_runqueue is

+ * associated with which cpu

+ */

+static DEFINE_PER_CPU(struct global_rq *, global_runqueues);

+

+/*

+ * This field tranlates global-runqeue numbers to pointers

+ * to them.

+ * if(per_cpu(global_runqueues, number_of_grq) == NULL) :

+ * The global runqueue with the number number_of_grq

+ * does _not_ exist.

+ * if(per_cpu(global_runqueues, number_of_grq) != NULL) :

+ * The global runqueue with the number number_of_grq

+ * does exist.

+ */

+static DEFINE_PER_CPU(struct global_rq *, grq_nr_lookup_table);

+

+/*

+ * 1 is the minimal value, because there has always to be grq.

+ * We set it here to one, when the first cpu is brought up,

+ * it is incremented.

+ */

+static int nr_grq;

+

+/*

+ * When scheduling is in progress, this read-write lock gets

+ * read_locked, to ensure no one is modifying the layout

+ * of the global runqueues. This allows many accesses to

+ * the global runqueues. Each global runqueue has its own

+ * spinlock which is also taken when scheduling is in progress.

+ * If the layout of the global runqueues is changed,

+ * this lock needs to be write_locked, this ensures that

+ * nothing is accesssing the global runqueues or scheduling

+ * taken place.

+ */

+static DEFINE_RWLOCK(grq_layout_rwlock);

+

+#define grq (*__get_cpu_var(global_runqueues))

+

+/*

+ * This is the main, per-CPU runqueue data structure.

+ * This data should only be modified by the local cpu.

+ */

+struct rq {

+#ifdef CONFIG_SMP

+#ifdef CONFIG_NO_HZ

+ u64 nohz_stamp;

+ unsigned char in_nohz_recently;

+#endif

+#endif

+

+ struct task_struct *curr, *idle, *stop;

+ struct mm_struct *prev_mm;

+

+ /* Stored data about rq->curr to work outside grq lock */

+ u64 rq_deadline;

+ unsigned int rq_policy;

+ int rq_time_slice;

+ u64 rq_last_ran;

+ int rq_prio;

+ bool rq_running; /* There is a task running */

+

+ /* Accurate timekeeping data */

+ u64 timekeep_clock;

+ unsi