Sunday, March 27, 2005

2.6.x O(1) scheduler

1.Linux 2.6.x Scheduler
Author: Saravanan S
e-mail: sar.kernel@gmail.com

1.0 Scheduler - Introduction
Scheduler is the basis of a multitasking operating system. Multitasking kernels allow many processes to execute at the same time and each process runs under a impression that it is the only process on the system. It is the part of the kernel code which chooses which process/thread to run next. It decides how to allocate CPU cycles to processes/threads to acheive "good performance". Good performance means many things according to the needs of a system. A scheduler tries to optimise the following

-CPU utilization: Fraction of time CPU is in use
-Throughput: average number of jobs completed per time unit
-Turnaround Time: average time between job submission and completion
-Waiting Time: average amount of time a process is ready but waiting
-Response Time: time until the system responds to a command
-Response Ratio: (Turnaround Time)/(Execution Time) -- long jobs should wait longer

Different systems require different things to be optimised. Thats the reason why "good performance" means different things. For a High performance computer system likes to have high efficiency(throughput) whereas a destop sytem likes to have a high response time.

1.1 CPU bound Vs IO bound Processes
IO bound processes are those processes which spend most of the time waiting for the user's response(or waiting for IO)eg:Text editor. CPU bound processes are those processes which spend most of their hogging the CPU eg" Matrix Multiplication,Sequnecing DNA. It is not always clear whether a process is IO bound or CPU bound.It is detetermined when the processes run whether it is IO bound or CPU bound. If the process spends most of its time in waiting then it is a IO bound process. A scheduler gives higher dynamic priority to IO bound processes for the following reasons

1. Human interactive processes get scheduled immediately on an input from the user
2. Also as IO operations usually take a long time it is good to get them started soon,so that the CPU can work on CPU bound processes.

1.2 Linux 2.6.x Scheduler challenges

1.Boost the priority of IO bound processes thereby increasing the interactiveness of the system
2.Fairness to all processes ( no process should starve)
3.SMP scheduling - It should be able to schedule processes acrosss many processes
4.SMT scheduling - Scheduling multiple threads on a single Multithreading chip. eg: Intel's Hyper Threading Technology
5.NUMA scheduling - Run a single system image across many nodes.
6.Real time scheduling - Linux supports soft real time scheduling

The linux 2.6.x scheduler achieves all the above scheduling perspectives in O(1)(constant) time.

The linux 2.6.x scheduler is a priority based round robin premptive scheduler

1.3 Design of the 2.6.x scheduler
1. 140 different priorities to processes( 1 - 100 for real time processes, 101-140 for normal tasks)
2. Real time tasks - two policies
i) SCHED_FIFO - They are scheduled and run as long as they want to
ii)SCHED_RR - Similaar to SCHED_FIFO but have timeslices, within in a certain priority they are scheduled in a round robin fashion. SCHED_FIFO tasks can prempt them.
3. Normal tasks
i)Initially prio= MAX_RT_PRIO + NICE + 20
ii)Assigned a time slice according to the static priority, within the same priority executed in a round robin fashion
iii)Ensures dynamic priority + fairness

Lets looks at the 2.6.x scheduler algorithm

1.4 Data Structures related to the 2.6.x scheduler
1. runqueues - Every processor has a runqueue. It keeps track of all the runnable processes in a CPU.
2. prio_array - There are two priority arrays per runqueue- One active array which contains the tasks whose timeslice is left and another expired array which contains the tasks whose timeslice has expired

The above 2 data structures form the basis of the O(1) scheduler. They are contained in the kernel/sched.c file as below

struct runqueue {
spinlock_t lock;

* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
unsigned long nr_running;/* number of runnable tasks */
unsigned long cpu_load;/* this CPU's load - average no of processes on the CPU per unit of time*/
unsigned long long nr_switches; /* number of context switches */
unsigned long expired_timestamp, nr_uninterruptible; /* time of last array swap and number of
uninterruptible processes in queue */
unsigned long long timestamp_last_tick; /* timestamp of last scheduler tick */
task_t *curr, *idle; /* this processors current and idle task */
struct mm_struct *prev_mm; /* the last running task's mm_struct */
prio_array_t *active, *expired, arrays[2]; /* the active and expired prio_arrays */
int best_expired_prio; /* highest priority that exists in the expired prio_array */
atomic_t nr_iowait; /* number of tasks in the queue waiting on i/o */

struct sched_domain *sd;/* in SMP systems there can be different scheduler domains */

/* For active balancing */
int active_balance;
int push_cpu;
/* this migration thread for the processor that this runqueue belongs to */
task_t *migration_thread;
struct list_head migration_queue;

struct prio_array {
unsigned int nr_active;/* number of runnable tasks in this prio_array */
unsigned long bitmap[BITMAP_SIZE];/* bitmap showing which priority levels contain tasks */
struct list_head queue[MAX_PRIO];/* a list of array heads, one for each priority on the system */

1.5 A simple description of the 2.6.x O(1) algorithm

The prio_array data structure is extremely important as it is what allows the Linux scheduling algorithm to perform in O(1) time.
The basic structure in the Linux scheduler is the runqueue, defined above.There is one runqueue per processor, and within that runqueue there are two structures of type prio_array. One is for tasks that have not used up their timeslice yet, the other is for tasks that have used up their timeslice. The former are considered active, the latter expired. Note that active and expired has nothing to do with whether or not a task is runnable - active simply means that since the last time timeslices were allocated, a given task in that queue has not used up its timeslice. A task in the active list still has time available on the CPU, tasks in the expired list have used up their timeslice.
The nr_active value stores the number of runnable tasks in the prio_array. The bitmap is a string of bits, one for each priority level on the system (140 by default), that indicates whether or not there are any tasks in the prio_array at a given priority level. The queue value is an array of pointers to arrays that store all tasks at a given priority level.
So if there is only one runnable task in the prio_array, nr_active will be equal to one. If that task is not RT, and it has a nice value of 20, there will be a one in the 119th position of the bitmap to indicate that there is a task in the prio_array at that priority level. The queue array would have a pointer at the 119th position pointing to an array of length 1, its single element being the task in question. This is very useful because in order to determine the next task to run, the scheduler simply
1) looks to see if there are any runnable tasks in its active prio_array (i.e. is nr_active > 0)
2) if so, go to step 3 otherwise go to step 6
3) find the first 1 in the active prio_array's bitmap. There must be a 1 somewhere since we know that there is a task in the prio_array and it must have a priority level.
4) run the first task in the array at the position in the prio_array's queue equal to the first 1 found in the bitmap.
5) when the task is done running for some reason, recalculate its new timeslice and put it in the expired prio_array. decrement nr_active in the active prio_array, and increment it in the expired prio_array. if the task was the last task at a given priority,clear the priority's bit in the active prio_array and make sure the priority's bit is set in the expired prio_array. repeat steps 1-4 until no tasks exist in the active prio_array.
6) when no tasks exist in the active prio_array, swap the active and inactive prio_arrays and start over again. since timeslices are recalculated for each process when it is put onto the expired array, the swap of prio_arrays is fast (i.e. no
sitting around recalculating a timeslice for every task)

1.6 Dynamic priorities
In order to boost IO bound processes, linux uses a heuristic to determine which processes are interactive and to boost their priority.Tasks that are IO bound tend to sleep on IO, so the metric that linux uses is based on this. The metric is sleep_avg whose value is increased depending on the amount of time it spends in sleeping and decreased depending on the amount of time that it spends in the CPU.It is fairly accurate, since it depends not only on the amount of time that the task spends in sleeping but also on the amount of time that the task runs in the CPU. So CPU bound processes which occasionaly sleep for a long time will not get a priority boost and also IO processes which wait on an IO and the suddenly tend to hog the CPU will also not get a priority boost. In summary depending on the sleep_avg IO bound processes get a maximum prio bonus of -5 and CPU bound processes get a penalty of +5.

1.7 Interactive credits

They help to control the rise and fall of the interactive status of tasks.Tasks get an interactive credit when they sleep for a larger number of times and lose an credit if they hog the CPU for long time.
High Interactive task > 100 interactive credits
Low Interactive task < -100 interactive credits

A CPU hog which waits occasionaly on IO for a long time must not get an sleep_avg increase as it is a CPU hog which waits on IO for once.So they are given low interactivity credit. Similarly an high interactivity task must not get less run time since it hogged the CPU for longer once.

1.8 Main scheduler functions
A discussion of the important scheduler functions in sched,c follows

It starts by disabling preemption and checks if it called of kernel premption,then uniterruptible tasks are removed from runqueue and interruptible tasks with a signal sent becomes runnable. Then a check is made whether there are tasks in the runqueue, if there are no tasks an attempt to rebalance is made.It balancing does not bring in any runnable tasks then the idle task is run.It there are runnable tasks but not in the active array,then the active and the expired array are swappped.At this point there is a runnable task in the active array,then check the bitmap to find the first bit that is set and thereby the highest priority level runnable task.At this a chance is given for dependant sleepers to run(which happens only in a SMT system). The cuurent CPU runs the idle task so that dependant sleeper wakes up and runs.If there is no switch to idle task at this point a check is being made to check whether the task chosen to run next is an RT task, if it is not an RT task and is woken up, then it is given a higher sleep_avg( a bonus for interactive processes). Then the task switch is made if previous task is not equal to next.

The schedule function is invoked in the following cases
1.When a task gives up the CPU voluntarily(sched_yield syscall)
2.schedule_tick function sets the TIF_NEED_RESCHED flag on a task because it has run out of time. The scheduler_tick is executed on every timer interrrupt.
3.When preemption is enabled. Premption can occur as below
i) User Preemption
.When returning to user space from a system call
.When returning to user space from a interrupt handler
ii)Kernel Preemption
.When returning to kernel space from an interrupt handler
.When kernel code becomes preemptible again
.If a task in the kernel explicitly calls schedule()
.If a task in the kernel blocks
In all the above cases except for the third one a check on the preempt_count is made, if preempt_count is not zero a lock is held by the current process then scheduler is not invoked.

effective_prio function

The effective prio function calcualtes a tasks dynamic priority. sleep_avg is the heuristic used to calculate the dynamic priority of a process.The function checks if the task is a RT task, if so it returns as there is no dynamic priority for a real time task.If the task is not a real time task, the function calculates the dynamic priority of the task as follows
bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
prio = MAX_RT_PRIO;
if (prio > MAX_PRIO-1)
prio = MAX_PRIO-1;
return prio;
The current bonus macro maps the sleep_avg on to the range 0-MAX_BONUS(which is 10). So the bonus ranges from -5 to +5 as discussed above which is then added to the dynamic priority(prio). Finally the function after the calculation of the dynamic priority checks whether the prioroty falls with in the non RT range. If so it adjusts the prio to fall within the range.

1.Called by recalc_task_prio() which is called by the main scheduled function when a task is inserted in to the expired array and also when tasks are activated i.e, when a task is moved in to the runqueue.
2.The thread and process wakeup calls.
3.scheduler_tick() which runs during every timer interrupt.

task_timeslice function

The tasks timeslice is completely based on static priority.The higher the task's static priority the larger the task's timeslice. Essentialy the function calculates the timeslice as minimum timeslice plus the task's static priority scaled on to the maximum possible range.

It is the interface used by the schduler to know the timeslice of a process

load_balance function
In a SMP system balancing among the the runqueues of different processors is done using load_balance function.load_balance looks for the busiest group in the domain.Before proceeding further let us know what is a scheduler domain.Scheduler domains comes in to picture in case of SMP and NUMA systems. In a NUMA system, a parent scheduling domain contains all the CPUs in the system(CPU's of all nodes), the child scheduler domain contains all the CPU's in each node, and each chis has one group per physical CPU(group contains the virtual SMT CPUs).The load_balance function tries to find the busiest group(group for an SMT) in the domain,if there is no busiest group it exits.Then it identifies the busiest runqueue within the group(a runqueue is busy if it has more than 25% processes than the current runqueue).After identifying the busiest runqueue it tries to pull tasks in to the current runqueue by identifying high priority tasks in the expired array(expired so as to maintain cache hotness).

1.rebalance_tick()- run on every timer tick,checks whether load balancing was not done for the time interval balance_inetrval before invoking load_balance.
2.schedule() - invoked when there is not runnable processes in the current runqueue.

1.9 Scheduler tuning
Some of tha parameters in sched.c than can be used for scheduler tuning are as follows
1.MIN_TIMESLICE and MAX_TIMESLICE - determines the default timeslice. Increased timeslices improve efficieny as context switches decrease but response time will be slow.
2.PRIO_BONUS_RATIO - This is parameter which is used to calculate how much penalty or bonus must be given if the task is CPU bound or IO bound. It essentially controls to which degree the static_prio is effective.If it is high static_prio is less effective and dynamice priority becomes more effective, thereby IO bound processes are boosted further.
3.MAX_SLEEP_AVG - Increasing this value means tasks have to sleep longer to be considered interactive.Effiency might improve as there might be fewer increases in dynamic priority.
4.STARVATION_LIMIT - This value controls fairness, if a task has not run for this limit then interactive tasks stop running for this task to run. Decreasing the value increases fairness but at the same time interactivity decreases.Increasing the value improves interactivity at the expense of non interactive tasks.

1.10 Future of 2.6.x scheduler

1.The 2.6.x scheduler is a generic scheduler which tries to perform well in server market(efficiency important),desktop market(response time important).There are considerations to have the schedulers designed specifically for each of these systems separately in the kernel and provide may be a proc interface to select the scheduler depending on the system.Ingo had rejections to such a kind of a idea saying that it will decrease the amount of work that will be put in to a generice scheduler which performs in all conditions.
2. Shared runqueues - SMT support in the kernel is not perfect, for example if there are idle virtual processors the balance will be done in these virtual CPUs instead of on a differnt physical CPU. The solution out forward by Robert M love is to have shared runqueues, ie.., runqueues will be shared in the case of SMT for virtual CPUs.

1.John aas's documentation of the scheduler
2.Robert Love's Linux kernel development - First edition
The scheduler chapter of the first edition is available online as a sample chapter
3.Kernel source( kernel/sched.c file and include/linux/sched.h files)


At 7:44 PM, Blogger Linux Kernel Group said...

Hi Saravanan,
Very good post. gives complete detail of 2.6 scheduler. would like to see more like this. thank you.



Post a Comment

<< Home