whatisthekernel

Monday, March 28, 2005

Introduction to Linux Device Driver Programming

Author: Gaurav Dhiman
Email: gauravd_linux@yahoo.com, gaurav4lkg@gmail.com


Introduction to Linux Device Drivers:

Linux Device Driver is actually the peace of code which very well knows the device it is controlling. It knows the behavior and has knowledge of device internals. Device Drivers in Linux can be a part of core kernel it self or it can even be developed as a separate module, which can be attached/detached from running kernel anytime, providing a flexibility in kernel to support multiple devices in dynamic environment.

In this article we will talk about writing a device driver as a kernel module. Before talking about device drivers, we should have some basic knowledge of kernel module programming in Linux. In next few sections we will discuss the basics of kernel module programming before jumping to the driver intricacies.


Introduction to Kernel Module Programming:

Kernel module is a piece of code written, compiled and loaded separately from core kernel but linked to the core kernel at load time. Kernel modules have some specific structure to follow. There are two standard functions which need to be implemented in any Linux Kernel Module; we will talk about them bit later. As earlier also mentioned kernel module is a code which attaches to the core kernel at load time and is being executed in kernel or privileged mode. Its not a user program which runs in restricted mode and have limited access to memory and other system resources. Module being a part of the kernel can access any system resource. Due to this fact, kernel module programmers need to take care that their module does not create the security loopholes in kernel and must be well-disciplined. Kernel modules should only be doing what they are supposed to do. One loosely written kernel module loaded to the core kernel can make the whole system venerable.

Module structure:
Module need to have two standard functions which take care of initializing and cleaning up of resources used by module. The standard format of module is as follows:

#include
#include

/*standard function for initializing the resources used by module*/

int init_module(void){
}

/*other function, which mplements the functionality of module*/

func_1(){
……..
……..
}

func_2(){
………
………
}

func_3(){
………
……..
}

……..
…….
………

/*standard function for initializing the resources used by module*/

void cleanup_module(void){
}

In init_module() function, we request and acquire the resources required by our module. For e.g. resources can be an IRQ line (interrupt line on which our device is going to interrupt), I/O memory region, DMA channel etc. We also register few other things like major number used by our driver or the interrupt handlers for the interrupts generated by our device. Init_module() function is called at the time of loading a module to core kernel, using “insmod” or “modprobe” command.

In cleanup_module() function, we release the resources acquired in init_module() function. This function is called at the time uloading a module from core kernel, using “rmmod” command.


Presenting real time Kernel Module:
Let’s follow the tradition and write the first kernel module “hello_world.c”

#define __KERNEL__
#define MODULE

#include
#include

int init_module(void){
printk(<1> “My Module: Hello World !! \n”);
}

void cleanup_module(void ){
printk(<1> “My Module: Bye World ….. I am going !! \n”);
}


We need to define above mentioned two macros (__KERNEL and MODULE) for any module we write. printk() function is a brother of printf() function. The major difference between them is that printf() is a standard C library function which resides in user space and printk() is kernel function written from scratch.


Now coming the real thing – Device Drivers

Before going to the real code of device driver, let’s first understand the some basic fundamentals of what devices are in Linux. We will also discuss about how they are accessed by user processes.


Device Files and Major Numbers:
In Linux or and other Unix variant, devices re represented by files in file system. These files are known as device files or even nodes. Device files are special files through with the user process (program) can communicate (open, read, write, close etc) with the underlining device. Process uses the standard system calls, like open(), read(), write(), ioctl() etc to interact with device. With every device file a two special number are associated, which are known as a major number and minor number of a device. Major number is used by kernel to direct the user process request to right kernel driver and minor number is useless for kernel. Minor number is only used by the driver to identify the exact device which needs to be manipulated. The reason of existence of minor number is that, in practical scenario one driver can handle or control more than one device of same type or even of different types as well, so driver needs some mechanism through which it can identify which device it need to manipulate. Lets take an example, if there is a driver ”A”, which actually controls three physical devices “B”, “C” and “D”, then there need to be three device files in file system with same major number but different minor numbers.

Device files can be created with “mknod” command. It has the following syntax.

mknod {device name} {device type} {major number} {minor number}

For help on this command, refer to man pages of it.
It is a convention that device files reside in “/dev” directory of root file system, so it’s always better to create our own device files there rather than creating them somewhere else in file system.


Device Types:
Devices can mainly be categorized in three groups: character devices, block devices and the network devices.

Character Devices: These are the devices to which a user process can write or read a single character at a time. That means the interaction between device driver and actual physical device is in terms of single character. Example: keyboard, serial ports, parallel ports.

Block Devices: These are the devices to which a user process can write or read a single block of data at a time. Reading and writing on these devices are done in terms of block data. A block can be of 512 bytes, 1024 bytes or so. Example: Hard Disk, Floppy Disk.

Network Devices: These are asynchronous devices and are responsible for establishing a network connection to outside world. Best example of this type of device can be NIC card.

For not complicating things, this article will only talk about the character device drivers.


Opening a Device:
As mentioned earlier also that user process can communicate with the underlining device through device file, so for interacting with device, user process should first open the related device file, using open() system call. Once the device file is opened, user process receives the file descriptor, which it can refer in further file manipulation file system calls.

Now we will see how things work in kernel when a device file is opened by a user process. Before discussing it, let’s discuss about some related data structures.

task_struct: This data structure represents the process or task in kernel. Kernel uses this data structure to keep track of process in system and resources they are using. It is one of the main data structures in kernel and contains number of elements to track process specific information.

files_struct: This structure contains the information related to the open files per process. It keeps information to track the open files of a process. One of the elements of this structure is an array of pointers to “file” structure. This array points to different “file” structures and the index of this array is returned to process as a file descriptor when open system call is made. It also keep a count of total number of open files for a process.

file: This structure represents an open file for a process. Do not confuse it with physical file, which is represented by “inode” structure in kernel. This structure only remains in kernel memory till the file is open for a process. As soon as the process closes, exits or aborts, all the “file” structures (representing open files for that process) are destroyed, if those are not anymore pointed by any other process.

Some of the elements of “file” structure:

- f_mode: This element tells in what mode the file has been opened by the process. Process can open a file in either read (FMODE_READ) or write (FMODE_WRITE) mode. This element is normally used by device drivers to check in what mode the device file has been opened.

- f_pos: This element tells the offset (in bytes) from where to read and write in file.
- f_flags: This element also tells us in which mode the file has been opened (read or write), but it’s always recommendable to use “f_mode” element to check the mode of file. This element remembers one important thing, which might be helpful to driver writers and that is if the file has been opened in non-blocking mode or not. By default (if O_NONBLOCK flag is not mentioned at opening time) the file is opened in blocking mode. Driver checks this flag at the time of reading or writing a device. If the device file is opened in blocking mode (default mode) and at read time there is no data to be read from device or at write time driver specific buffer is full, driver puts the process to sleep on one of its local wait queues (we will soon see what wait queue are). But on other hand if the device file has been opened in non-blocking mode then the driver does not put the process to sleep, rather control returns back to user process with error.
- f_count: This element keeps track of how many processes are referring to this instance of file. As we know that all files of parent process are inherited by child process if file does not have close_on_exec element set. If the child process inherits the files from parent process, the “f_count” element of all inherited files is incremented, so that kernel can keep track of number of process this file structure. “file” structure (representing an open file) does not get destroyed on all close system calls, as it might be shared by other process also. During close system call kernel checks the “f_count” element of “file” structure and if it is zero then only “file” structure is released and its memory Is released.
- f_owner: This element tells which process is the owner of this open file. It contains the pid of the owner process. This element is used for sending the SIGIO signal to the right process in case of asynchronous notification. User process can change this element by using fcntl() system call.
- f_op: This is an important element from the perspective of device driver. This element actually points to the structure of pointers to file operations. For a device file (represented by “file” structure), this element points to the structure, which further contains pointers to driver specific functions. We will discuss in detail, the structure (file_operations) to which this element points.


- file_operations: This is an important data structure for device driver, as this is the structure through which driver registers its functions with kernel, so that kernel can call them on different events, like opening a device file, reading/writing a device file or sending ioctl commands to device. In case of device file, this structure contains pointer to different functions of driver, through which kernel invokes the driver. Now we will briefly discuss the elements of this data structure.


Some of the elements of “file_operations” structure

- llseek: This is a pointer to driver function, which actually moves or sets the “f_pos” element of device file (discussed earlier).
- read: This is a pointer to driver function, which actually physically reads data from device.
- write: This is a pointer to driver function, which actually physically writes data to device.
- poll: This function is called by either “poll” or “select” system calls.
- ioctl: This function is called by ioctl system call. This function in drier is used to pass on the special commands to device, format the device or setting the read/write head of device, which are different from normal read / write commands.
- mmap: This function of driver is used to map the device memory area to process virtual address space.
- open: This function is used to open a file. Incase of device file, this function of driver initialize the device or initializes other book keeping data structures.
- flush: This function flushes the driver buffer to physical file. This should be implemented in driver if driver wants to provide a facility to application to make sure that all the data is physically put on device.
- release: This function is called by close system call, but it is not called for every close system call. As described earlier also that one “file” data structure (which represents an open file in kernel) can be referred by more than one process, if processes are sharing the file (best example is FIFO files), in that case close system file does not release the “file” data structure and only decrement the “f_count” element of “file” data structure. If after decrementing “f_count” element turns to be zero, close system call, calls the associated “release” function, which is a driver function in case of device files. So “release” function of driver should clear and release all the memory acquired. “release” is just an opposite of “open” function.


Few Important Kernel Mechanisms used in Drivers

Wait Queues:
Wait queue is a mechanism in kernel through which the kernel code can put the process to sleep. This is used in different parts of kernel where the kernel decides to put the process to sleep. Kernel puts the process to sleep in case the required event has not yet occurred (for e.g. some process wants to read from device and there is no data to be read), in this case kernel puts the process to sleep and gets back the processor from it by calling the schedule() function, which is a scheduler in Linux Kernel. schedule() function schedules and dispatches the other process.

Before discussing the function related to sleeping a process, we should look what data structures are used for implementing a wait queue in kernel.

Wait queue is actually a linked list of “wait_queue_t” type of structures. The head of wait queue is represented by “wait_queue_head_t” structure, which contains the spin lock to synchronize the access to wait queue. “wait_queue_head_t” structure also contains the pointer to the first element in wait queue. Each element in the wait queue is represented by “wait_queue_t” structure, which contains the pointer to the “task_struct” type of structure. It also contains the pointer to next element in the wait queue. “task_struct” represents the alive process in kernel. So with this mechanism of wait queue driver or any kernel part can keep track of process waiting for a specific event to occur.


Putting process to sleep:
Process can be put to sleep by using any of the following kernel functions. You can call these functions from anywhere in the kernel (drivers, modules or the core kernel) in case you want to put your process to sleep. Whenever a kernel code is executed (when system call is made by the user process), kernel code executes in the context of process which has made a system call. But there is exception to this rule, whenever the interrupt occurs the kernel code (interrupt handler) does not execute in process context, it’s a anonymous context. This is the reason that we should be careful to not to call any function in interrupt handler which can put the execution thread to sleep. If we do so the kernel will hang, that means the system will hang.

Functions which can put a process to sleep:
- sleep_on(wait_queue_head_t * wait_queue)
- interruptible_sleep_on(wait_queue_head_t * wait_queue)
- sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)
- interruptible_sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)

In above functions, “wait_queue” is the wait_queue_head and “timeout” is the value mentioned in terms of jiffies. We will talk about jiffies very soon. Now we will see the difference between above mentioned functions.

- sleep_on: This function puts the process to sleep in TASK_UNINTERRPTIBLE mode, which means the process will not be waked up in case process receives any signal while it was in sleep. The process will only be waked up any other part of kernel code wakes it up (normally on the occurrence of some event) deliberately by calling any of the waking function (we will be discussing the waking up functions very soon). Process put to sleep with this function can sometimes cause some problem. For e.g. if a process is put to sleep with this function and the event on which it need to be waked up does not occur then your process will not come back to the execution stage. That process can not even be killed by sending a KILL signal, as process in sleep in TASK_UNINTERRUPTIBLE mode ignores all signals. Process put to sleep with this function can be waked if any of the following conditions occur:

o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting

- interruptible_sleep_on: This function in kernel is written to avoid the problem caused by “sleep_on” function. This function puts the process to sleep in TASK_INTERRUPTIBLE mode. When a process sleeps in this mode, it can be waked up if any of the following condition occurs:

o Process receives the signal either from any other process or kernel itself.
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

- sleep_on_timeout: This function is similar to “sleep_on” function but is not that much dangerous as “sleep_on”. Process put to sleep with this function can be waked if any of the following conditions occurs:

o Time mentioned in the timeout parameter has expired
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

- interruptible_sleep_on_timeout: I hope by now you can easily guess what this function does. Well the process put to sleep with this function is waked up when any of the following conditions occurs:

o Process receives the signal either from any other process or kernel itself.
o Time mentioned in the timeout parameter has expired
o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.

Waking up a process:
Process put to sleep should also be waked up by some kernel code else the process will never return to the execution state. If your driver is putting the process to sleep, it’s the responsibility of that driver itself to wake up the sleeping processes when the required event occurs for which those processes are waiting. For e.g. if your driver put the reading process to sleep on its internal waiting queue, if there is nothing to read from driver buffer (driver buffer empty) then the process put to sleep should also be waked up whoever new data arrives in driver buffer (this will occur when device interrupts, so interrupt handler will be responsible for waking up the process sleeping on the driver’s waiting queue).

Functions which can be explicitly called to wake up the process:
- wake_up(wait_queue_head_t * wait_queue)
- wake_up_interruptible(wait_queue_head_t * wait_queue)
- wake_up_sync(wait_queue_head_t * wait_queue)
- wake_up_interruptible_sync(wait_queue_head_t * wait_queue)

__________________________________________________________________

That’s it for this time. In next article I will cover ‘ioctl’ interface of devices and different timing mechanisms available in Linux Kernel.

Sunday, March 27, 2005

2.6.x O(1) scheduler

1.Linux 2.6.x Scheduler
Author: Saravanan S
e-mail: sar.kernel@gmail.com

1.0 Scheduler - Introduction
-------------------------------------------------------------------
Scheduler is the basis of a multitasking operating system. Multitasking kernels allow many processes to execute at the same time and each process runs under a impression that it is the only process on the system. It is the part of the kernel code which chooses which process/thread to run next. It decides how to allocate CPU cycles to processes/threads to acheive "good performance". Good performance means many things according to the needs of a system. A scheduler tries to optimise the following

-CPU utilization: Fraction of time CPU is in use
-Throughput: average number of jobs completed per time unit
-Turnaround Time: average time between job submission and completion
-Waiting Time: average amount of time a process is ready but waiting
-Response Time: time until the system responds to a command
-Response Ratio: (Turnaround Time)/(Execution Time) -- long jobs should wait longer

Different systems require different things to be optimised. Thats the reason why "good performance" means different things. For a High performance computer system likes to have high efficiency(throughput) whereas a destop sytem likes to have a high response time.

1.1 CPU bound Vs IO bound Processes
--------------------------------------------------------------------------
IO bound processes are those processes which spend most of the time waiting for the user's response(or waiting for IO)eg:Text editor. CPU bound processes are those processes which spend most of their hogging the CPU eg" Matrix Multiplication,Sequnecing DNA. It is not always clear whether a process is IO bound or CPU bound.It is detetermined when the processes run whether it is IO bound or CPU bound. If the process spends most of its time in waiting then it is a IO bound process. A scheduler gives higher dynamic priority to IO bound processes for the following reasons

1. Human interactive processes get scheduled immediately on an input from the user
2. Also as IO operations usually take a long time it is good to get them started soon,so that the CPU can work on CPU bound processes.

1.2 Linux 2.6.x Scheduler challenges
---------------------------------------------------------------------------

1.Boost the priority of IO bound processes thereby increasing the interactiveness of the system
2.Fairness to all processes ( no process should starve)
3.SMP scheduling - It should be able to schedule processes acrosss many processes
4.SMT scheduling - Scheduling multiple threads on a single Multithreading chip. eg: Intel's Hyper Threading Technology
5.NUMA scheduling - Run a single system image across many nodes.
6.Real time scheduling - Linux supports soft real time scheduling

The linux 2.6.x scheduler achieves all the above scheduling perspectives in O(1)(constant) time.

The linux 2.6.x scheduler is a priority based round robin premptive scheduler

1.3 Design of the 2.6.x scheduler
-------------------------------------------------------------------------
1. 140 different priorities to processes( 1 - 100 for real time processes, 101-140 for normal tasks)
2. Real time tasks - two policies
i) SCHED_FIFO - They are scheduled and run as long as they want to
ii)SCHED_RR - Similaar to SCHED_FIFO but have timeslices, within in a certain priority they are scheduled in a round robin fashion. SCHED_FIFO tasks can prempt them.
3. Normal tasks
i)Initially prio= MAX_RT_PRIO + NICE + 20
ii)Assigned a time slice according to the static priority, within the same priority executed in a round robin fashion
iii)Ensures dynamic priority + fairness

Lets looks at the 2.6.x scheduler algorithm

1.4 Data Structures related to the 2.6.x scheduler
------------------------------------------------------------------------
1. runqueues - Every processor has a runqueue. It keeps track of all the runnable processes in a CPU.
2. prio_array - There are two priority arrays per runqueue- One active array which contains the tasks whose timeslice is left and another expired array which contains the tasks whose timeslice has expired

The above 2 data structures form the basis of the O(1) scheduler. They are contained in the kernel/sched.c file as below

struct runqueue {
spinlock_t lock;

/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
unsigned long nr_running;/* number of runnable tasks */
#ifdef CONFIG_SMP
unsigned long cpu_load;/* this CPU's load - average no of processes on the CPU per unit of time*/
#endif
unsigned long long nr_switches; /* number of context switches */
unsigned long expired_timestamp, nr_uninterruptible; /* time of last array swap and number of
uninterruptible processes in queue */
unsigned long long timestamp_last_tick; /* timestamp of last scheduler tick */
task_t *curr, *idle; /* this processors current and idle task */
struct mm_struct *prev_mm; /* the last running task's mm_struct */
prio_array_t *active, *expired, arrays[2]; /* the active and expired prio_arrays */
int best_expired_prio; /* highest priority that exists in the expired prio_array */
atomic_t nr_iowait; /* number of tasks in the queue waiting on i/o */

#ifdef CONFIG_SMP
struct sched_domain *sd;/* in SMP systems there can be different scheduler domains */

/* For active balancing */
int active_balance;
int push_cpu;
/* this migration thread for the processor that this runqueue belongs to */
task_t *migration_thread;
struct list_head migration_queue;
#endif
}


struct prio_array {
unsigned int nr_active;/* number of runnable tasks in this prio_array */
unsigned long bitmap[BITMAP_SIZE];/* bitmap showing which priority levels contain tasks */
struct list_head queue[MAX_PRIO];/* a list of array heads, one for each priority on the system */
};

1.5 A simple description of the 2.6.x O(1) algorithm
--------------------------------------------------------------------------

The prio_array data structure is extremely important as it is what allows the Linux scheduling algorithm to perform in O(1) time.
The basic structure in the Linux scheduler is the runqueue, defined above.There is one runqueue per processor, and within that runqueue there are two structures of type prio_array. One is for tasks that have not used up their timeslice yet, the other is for tasks that have used up their timeslice. The former are considered active, the latter expired. Note that active and expired has nothing to do with whether or not a task is runnable - active simply means that since the last time timeslices were allocated, a given task in that queue has not used up its timeslice. A task in the active list still has time available on the CPU, tasks in the expired list have used up their timeslice.
The nr_active value stores the number of runnable tasks in the prio_array. The bitmap is a string of bits, one for each priority level on the system (140 by default), that indicates whether or not there are any tasks in the prio_array at a given priority level. The queue value is an array of pointers to arrays that store all tasks at a given priority level.
So if there is only one runnable task in the prio_array, nr_active will be equal to one. If that task is not RT, and it has a nice value of 20, there will be a one in the 119th position of the bitmap to indicate that there is a task in the prio_array at that priority level. The queue array would have a pointer at the 119th position pointing to an array of length 1, its single element being the task in question. This is very useful because in order to determine the next task to run, the scheduler simply
1) looks to see if there are any runnable tasks in its active prio_array (i.e. is nr_active > 0)
2) if so, go to step 3 otherwise go to step 6
3) find the first 1 in the active prio_array's bitmap. There must be a 1 somewhere since we know that there is a task in the prio_array and it must have a priority level.
4) run the first task in the array at the position in the prio_array's queue equal to the first 1 found in the bitmap.
5) when the task is done running for some reason, recalculate its new timeslice and put it in the expired prio_array. decrement nr_active in the active prio_array, and increment it in the expired prio_array. if the task was the last task at a given priority,clear the priority's bit in the active prio_array and make sure the priority's bit is set in the expired prio_array. repeat steps 1-4 until no tasks exist in the active prio_array.
6) when no tasks exist in the active prio_array, swap the active and inactive prio_arrays and start over again. since timeslices are recalculated for each process when it is put onto the expired array, the swap of prio_arrays is fast (i.e. no
sitting around recalculating a timeslice for every task)

1.6 Dynamic priorities
-------------------------------------------------------------
In order to boost IO bound processes, linux uses a heuristic to determine which processes are interactive and to boost their priority.Tasks that are IO bound tend to sleep on IO, so the metric that linux uses is based on this. The metric is sleep_avg whose value is increased depending on the amount of time it spends in sleeping and decreased depending on the amount of time that it spends in the CPU.It is fairly accurate, since it depends not only on the amount of time that the task spends in sleeping but also on the amount of time that the task runs in the CPU. So CPU bound processes which occasionaly sleep for a long time will not get a priority boost and also IO processes which wait on an IO and the suddenly tend to hog the CPU will also not get a priority boost. In summary depending on the sleep_avg IO bound processes get a maximum prio bonus of -5 and CPU bound processes get a penalty of +5.

1.7 Interactive credits
---------------------------------------------------------------

They help to control the rise and fall of the interactive status of tasks.Tasks get an interactive credit when they sleep for a larger number of times and lose an credit if they hog the CPU for long time.
High Interactive task > 100 interactive credits
Low Interactive task < -100 interactive credits

A CPU hog which waits occasionaly on IO for a long time must not get an sleep_avg increase as it is a CPU hog which waits on IO for once.So they are given low interactivity credit. Similarly an high interactivity task must not get less run time since it hogged the CPU for longer once.

1.8 Main scheduler functions
---------------------------------------------------------------------
A discussion of the important scheduler functions in sched,c follows

schedule
--------
It starts by disabling preemption and checks if it called of kernel premption,then uniterruptible tasks are removed from runqueue and interruptible tasks with a signal sent becomes runnable. Then a check is made whether there are tasks in the runqueue, if there are no tasks an attempt to rebalance is made.It balancing does not bring in any runnable tasks then the idle task is run.It there are runnable tasks but not in the active array,then the active and the expired array are swappped.At this point there is a runnable task in the active array,then check the bitmap to find the first bit that is set and thereby the highest priority level runnable task.At this a chance is given for dependant sleepers to run(which happens only in a SMT system). The cuurent CPU runs the idle task so that dependant sleeper wakes up and runs.If there is no switch to idle task at this point a check is being made to check whether the task chosen to run next is an RT task, if it is not an RT task and is woken up, then it is given a higher sleep_avg( a bonus for interactive processes). Then the task switch is made if previous task is not equal to next.

Invocation
The schedule function is invoked in the following cases
1.When a task gives up the CPU voluntarily(sched_yield syscall)
2.schedule_tick function sets the TIF_NEED_RESCHED flag on a task because it has run out of time. The scheduler_tick is executed on every timer interrrupt.
3.When preemption is enabled. Premption can occur as below
i) User Preemption
.When returning to user space from a system call
.When returning to user space from a interrupt handler
ii)Kernel Preemption
.When returning to kernel space from an interrupt handler
.When kernel code becomes preemptible again
.If a task in the kernel explicitly calls schedule()
.If a task in the kernel blocks
In all the above cases except for the third one a check on the preempt_count is made, if preempt_count is not zero a lock is held by the current process then scheduler is not invoked.


effective_prio function
-----------------------

The effective prio function calcualtes a tasks dynamic priority. sleep_avg is the heuristic used to calculate the dynamic priority of a process.The function checks if the task is a RT task, if so it returns as there is no dynamic priority for a real time task.If the task is not a real time task, the function calculates the dynamic priority of the task as follows
bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
prio = MAX_RT_PRIO;
if (prio > MAX_PRIO-1)
prio = MAX_PRIO-1;
return prio;
The current bonus macro maps the sleep_avg on to the range 0-MAX_BONUS(which is 10). So the bonus ranges from -5 to +5 as discussed above which is then added to the dynamic priority(prio). Finally the function after the calculation of the dynamic priority checks whether the prioroty falls with in the non RT range. If so it adjusts the prio to fall within the range.

Invocation:
1.Called by recalc_task_prio() which is called by the main scheduled function when a task is inserted in to the expired array and also when tasks are activated i.e, when a task is moved in to the runqueue.
2.The thread and process wakeup calls.
3.scheduler_tick() which runs during every timer interrupt.


task_timeslice function
-----------------------

The tasks timeslice is completely based on static priority.The higher the task's static priority the larger the task's timeslice. Essentialy the function calculates the timeslice as minimum timeslice plus the task's static priority scaled on to the maximum possible range.

Invocation
It is the interface used by the schduler to know the timeslice of a process

load_balance function
---------------------
In a SMP system balancing among the the runqueues of different processors is done using load_balance function.load_balance looks for the busiest group in the domain.Before proceeding further let us know what is a scheduler domain.Scheduler domains comes in to picture in case of SMP and NUMA systems. In a NUMA system, a parent scheduling domain contains all the CPUs in the system(CPU's of all nodes), the child scheduler domain contains all the CPU's in each node, and each chis has one group per physical CPU(group contains the virtual SMT CPUs).The load_balance function tries to find the busiest group(group for an SMT) in the domain,if there is no busiest group it exits.Then it identifies the busiest runqueue within the group(a runqueue is busy if it has more than 25% processes than the current runqueue).After identifying the busiest runqueue it tries to pull tasks in to the current runqueue by identifying high priority tasks in the expired array(expired so as to maintain cache hotness).

Invocation
1.rebalance_tick()- run on every timer tick,checks whether load balancing was not done for the time interval balance_inetrval before invoking load_balance.
2.schedule() - invoked when there is not runnable processes in the current runqueue.


1.9 Scheduler tuning
-----------------------------------------------------------
Some of tha parameters in sched.c than can be used for scheduler tuning are as follows
1.MIN_TIMESLICE and MAX_TIMESLICE - determines the default timeslice. Increased timeslices improve efficieny as context switches decrease but response time will be slow.
2.PRIO_BONUS_RATIO - This is parameter which is used to calculate how much penalty or bonus must be given if the task is CPU bound or IO bound. It essentially controls to which degree the static_prio is effective.If it is high static_prio is less effective and dynamice priority becomes more effective, thereby IO bound processes are boosted further.
3.MAX_SLEEP_AVG - Increasing this value means tasks have to sleep longer to be considered interactive.Effiency might improve as there might be fewer increases in dynamic priority.
4.STARVATION_LIMIT - This value controls fairness, if a task has not run for this limit then interactive tasks stop running for this task to run. Decreasing the value increases fairness but at the same time interactivity decreases.Increasing the value improves interactivity at the expense of non interactive tasks.

1.10 Future of 2.6.x scheduler
---------------------------------------------------------------------

1.The 2.6.x scheduler is a generic scheduler which tries to perform well in server market(efficiency important),desktop market(response time important).There are considerations to have the schedulers designed specifically for each of these systems separately in the kernel and provide may be a proc interface to select the scheduler depending on the system.Ingo had rejections to such a kind of a idea saying that it will decrease the amount of work that will be put in to a generice scheduler which performs in all conditions.
2. Shared runqueues - SMT support in the kernel is not perfect, for example if there are idle virtual processors the balance will be done in these virtual CPUs instead of on a differnt physical CPU. The solution out forward by Robert M love is to have shared runqueues, ie.., runqueues will be shared in the case of SMT for virtual CPUs.

References
----------
1.John aas's documentation of the 2.6.8.1 scheduler
josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf
2.Robert Love's Linux kernel development - First edition
The scheduler chapter of the first edition is available online as a sample chapter
http://www.samspublishing.com/articles/article.asp?p=101760
3.Kernel source( kernel/sched.c file and include/linux/sched.h files)