<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8801463</id><updated>2012-01-08T12:57:00.022-08:00</updated><title type='text'>whatisthekernel</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>8</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8801463.post-113800083772230935</id><published>2006-01-22T23:18:00.000-08:00</published><updated>2006-01-22T23:20:38.106-08:00</updated><title type='text'>Interrupt Handling Internals in Linux Kernel</title><content type='html'>*************************************************&lt;br /&gt;Interrupt Handling Internals in Linux Kernel&lt;br /&gt;*************************************************&lt;br /&gt;&lt;br /&gt;===========================&lt;br /&gt;Author: Gaurav Dhiman&lt;br /&gt;Email : gauravd.chd@gmail.com&lt;br /&gt;        gaurav4lkg@gmail.com&lt;br /&gt;        gauravd_chd@yahoo.com&lt;br /&gt;===========================&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Index:&lt;br /&gt;=====&lt;br /&gt;- Introduction&lt;br /&gt;- CPU Support for Handling Interrupts&lt;br /&gt;        - Details of Programmable Interrupt Controller&lt;br /&gt;        - Hardware checks performed by CPU&lt;br /&gt;                - Details of Interrupt Descriptor Table&lt;br /&gt;                        - Task Gates&lt;br /&gt;                        - Trap Gates&lt;br /&gt;                        - Interrupt Gates&lt;br /&gt;                - Hardware checks for Interrupts&lt;br /&gt;- Kernel Support for Handling Interrupts&lt;br /&gt;        - Low Level Interrupt Stubs&lt;br /&gt;        - Details of do_IRQ() function, core of Inteuupt Handling&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Introduction&lt;br /&gt;==========&lt;br /&gt;&lt;br /&gt;This article talks about internal details of Interrupt Handling in Linux Kernel. This will discuss, the hardware prospective of interrupt handling from CPU, Linux Kernel's Interrupt Routing subsystem, Device Drivers's role in Interrupt handling.&lt;br /&gt;&lt;br /&gt;Term Interrupt is self defined, Interrupts are signals sent to CPU on an INTR bus (connected to CPU) whenever any device want to get attention of CPU. As soon as the interrupt signal occurs, CPU defer the current activity and service the interruptby executing the interrupt handler corresponding to that interrupt number (also know as IRQ number).&lt;br /&gt;&lt;br /&gt;One of the clasifications of Interrupts can be done as follows:&lt;br /&gt;- Synchronous Interrupts (also know on as software interrupts)&lt;br /&gt;- Asynchronous Interrupts (also know as hardware interrupts)&lt;br /&gt;&lt;br /&gt;Basic difference between these is that, synchronous interrupts are generated by CPU's control unit on facing some abnormal condition; these are also know as exception in Intel's termenology. These are interrupts whihc are generated by CPU itself either when CPU detects an abnormal condition or CPU executes some of the special instructions like 'int' or 'int3' etc. on other hand, asynchronous interupts are those, which actually are generated by outside world (devices connected to CPU), As these interrupts can occur at any point of time, these are known as asynchronous interrupts.&lt;br /&gt;&lt;br /&gt;Its important to note that both synchornous and asynchronous interrupts are handled by CPU on the completion of insturctionduring which the interrupt occur. Execution of a machine instruction is not done in one single CPU cycle, it take some cycles to complete. Any interrupt occurs in between the execution of instruction, will not be handled imediately, rather CPU will check of interrupts on the completion of instruction.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;CPU support for handling interrupts&lt;br /&gt;=============================&lt;br /&gt;&lt;br /&gt;For handling interrupts there are few of the things which we expect the CPU to do on occurence of every interrupt. Wheneveran interrupt occurs, CPU performs some of the hardware checks, which are very much needed to make the system secure. Beforeexplaining the hardware checks, we will understand how the interrupts are routed to the CPU from hardware devices.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Details of Programmable Interrupt Controller&lt;br /&gt;----------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;On Intel architecture, system devices (device controllers) are connected to a special device known as PIC (Programmable Interrupt Controller). CPU have two lines for receiving interrupt signals (NMI and INTR). NMI line is to recieve non-maskable interrupts; the interrupts which can not be masked, means which can not be blocked at any cost. These interrupts are of hightest priority and are rarely used. INTR line is the line on which all the interrupts from system devices are received. These interrupts can be masked or blocked. As all the interrupt signals need to be multiplxed on single CPU line, we need some mechanisum through which interrupts from different device controllers can be routed to single line of CPU. This routing or multiplexing is done PIC (Programmable Interrupt Controller). PIC sits between system devices and CPU and have multiple input lines; each line connected to different divice contollers in system. On other hand IPC have only one output line which is connected to the CPU's INTR line on which it sends signal to CPU.&lt;br /&gt;&lt;br /&gt;There are two PIC controllers joined together and the output of second PIC controller is connected to the second input of first PCI. This setup allows maximum of 15 input lines on which different system device controllers can be connected. PIC have some programmable registers, through which CPU communicates with it (give command, mask/unmask interrup lines, read status). Both PICs have their own following registers:&lt;br /&gt;&lt;br /&gt;- Mask Register&lt;br /&gt;- Status Register&lt;br /&gt;&lt;br /&gt;Mask register is used to mask/unmask a specific interrupt line. CPU can ask the PIC to mask (block) the specific interrupt by setting the corresponding bit in mask register. Unmasking can be done by clearing that bit. When a particular interrupt is being masked, PIC do receive the interrupts on its corresponding line, but do not send the interrupt to CPU in whihc case tCPU keps on doing what it was doing. When an interrupts are being masked, they are not lost, rather PIC remembers those anddo send the interrupt to CPU when CPU unmasks that interrupt line. Masking is different from blocking all the interrupts toCPU. CPU can ignore all the interrupts coming on INTR line by clearing the IF flag in EFLAGS register of CPU. When this bitis cleared, interrupts coming on INTR line are simply ignored by CPU, we can consider it to be blocking of interrupts. So now we understand that masking is done at PIC level and individual interrupt lines can be masked or unmasked, where as blocking is done at CPU level and is done for all the interrupts except NMI (Non-Maskable Interrupt), which is received on NMI line of CPU and can not be blocked or ignored.&lt;br /&gt;&lt;br /&gt;Now days, interrupt architecture is not as simple as shown above. Now days machines uses the APIC (Advanced Programmable Interrupt Controller), which can support upto 256 interrupt lines. Along with APIC, every CPU also have inbuilt IO-APIC. We wont go into details of these right now (will be covered in future articles).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Hardware checks performed by CPU&lt;br /&gt;----------------------------------------------------&lt;br /&gt;&lt;br /&gt;Once the interrupt signal is received by CPU, CPU performs some hardware checks for which no software machine instructions are executed. Before looking into what these checks are, we need to understand some architecture spcific data structures maintained by kernel.&lt;br /&gt;&lt;br /&gt;Details of Interrupt Descriptor Table&lt;br /&gt;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~&lt;br /&gt;&lt;br /&gt;Kernel need to maintain one IDT (Interrupt Descriptor Table), which actually maps the interrupt line with the interrupt handler routine. This table is of 256 enteries and each entry is of 8 bytes. First 32 enteries of this table are used for exceptions and rest are used for hardware interrupts received from outer world. This table can contain three different type of enteries; these three different types are as follows:&lt;br /&gt;&lt;br /&gt;- Task Gates&lt;br /&gt;- Trap Gates&lt;br /&gt;- Interrupt Gates&lt;br /&gt;&lt;br /&gt;Lets see what these gates are where these are used.&lt;br /&gt;&lt;br /&gt;Task Gates&lt;br /&gt;##########&lt;br /&gt;&lt;br /&gt;Format of task gate entry is as follows:&lt;br /&gt;- 0-15 bits ---- reserved (not used)&lt;br /&gt;- 16-31 bits ---- points to the TSS (Task State Segment) entry of the process to which we need to switch.&lt;br /&gt;- 32-39 bits ---- these bits are reserved and are not currently used.&lt;br /&gt;- 40-43 bits ---- specify the type of entry (its value for task gate is 0101)&lt;br /&gt;- 44th bit ---- always 0, not used&lt;br /&gt;- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.&lt;br /&gt;- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)&lt;br /&gt;- 48-63 bits ---- reserved (not used)&lt;br /&gt;&lt;br /&gt;Basically the task gates are used in IDT, to allow the user processs to make a context switch with another process without requesting the kernel to do this. As soon as this gate is hit (interrupt received on line for which there is a task gate in IDT), CPU saves the context (state of processor registers) of currently running process to the TSS of current process, whoseaddress is saved in TR (Task Register) of CPU. After saving the context of current process, CPU sets the CPU registers withthe values stored in the TSS of new process, whose pointer is saved in the 16-31 bits of the task gate. Once the registers are set with these new values, processor gets the new process and the context switch is done. Linux do not use the task gates, it only uses the trap and interrupt gates in IDT. So I will not explain the task gates any more.&lt;br /&gt;&lt;br /&gt;Trap Gates&lt;br /&gt;##########&lt;br /&gt;&lt;br /&gt;Format of trap gates is as follows:&lt;br /&gt;- 0-15 bits ---- first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit&lt;br /&gt;- 16-31 bits ---- indicates the index of segment descriptor in GDT (Global Descriptor Table)&lt;br /&gt;- 32-36 bits ---- these bits are reserved and are not currently used.&lt;br /&gt;- 37-39 bits ---- always 000, not used&lt;br /&gt;- 40-43 bits ---- specify the type of entry (its value for trap gate is 1111)&lt;br /&gt;- 44th bit ---- always 0, not used&lt;br /&gt;- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.&lt;br /&gt;- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)&lt;br /&gt;- 48-63 bits ---- last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit&lt;br /&gt;&lt;br /&gt;Trap gates are basically used to handle exceptions generated by CPU. 0-15 bits and 48-63 bits together form the pointer (offset in segment identified by 16-31 bits of this entry) to a kernel function. The only difference between trap gates and interrupt gates is that, whenever an interrupt gate is hit, CPU automatically disables the interrupts by clearing the IF flag in CPU's EFLAG register, whereas in case of trap gate this is not done and interrupts remain enabled. As mentioned earlier trap gates are used for exceptions, so first 32 enteries in IDT are initialized with trap gates. In addition to this Linux Kernel also uses the trap gate for system call entry (entry 128 of IDT).&lt;br /&gt;&lt;br /&gt;Interrupt Gates&lt;br /&gt;#############&lt;br /&gt;&lt;br /&gt;Format of interrupt gates is same as trap gates explained above, expect the value of type field (40-43 bits). In case of trap gates this have a value 1111 and in case of interrupts its 1110.&lt;br /&gt;&lt;br /&gt;Format is as follows:&lt;br /&gt;- 0-15 bits ---- first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit&lt;br /&gt;- 16-31 bits ---- indicates the index of segment descriptor in GDT (Global Descriptor Table)&lt;br /&gt;- 32-36 bits ---- these bits are reserved and are not currently used.&lt;br /&gt;- 37-39 bits ---- always 000, not used&lt;br /&gt;- 40-43 bits ---- specify the type of entry (its value for interrupt gate is 1110)&lt;br /&gt;- 44th bit ---- always 0, not used&lt;br /&gt;- 45-46 bits ---- this specifies the DPL (Decsriptor Previlege Level) level of gate entry.&lt;br /&gt;- 47th bit ---- specifies if this entry is valid or not (1 - valid, 0 - invalid)&lt;br /&gt;- 48-63 bits ---- last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit&lt;br /&gt;&lt;br /&gt;Note: whenever the interrupt gate is hit, interrupts are disabled automatically.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Hardware Checks for Interrupts and Exceptions&lt;br /&gt;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~&lt;br /&gt;&lt;br /&gt;Whenever an exception or interrupt occurs, corresponding trap/interrupt gate is hit and CPU performs some checks with fields of these gates. Things done by CPU are as follows:&lt;br /&gt;&lt;br /&gt;1). get the ith entry from IDT (physical address and size of IDT is stored in IDTR register of CPU), here 'i' means the interrupt number.&lt;br /&gt;&lt;br /&gt;2). read the segment descriptor index from 16-31 bits of IDT entry, lets say this to be 'n'&lt;br /&gt;&lt;br /&gt;3). gets the segment descriptor from 'n'th entry in GDT (physical address and size of GDT is stored in GDTR register of CPU)&lt;br /&gt;&lt;br /&gt;4). DPL of the nth entry in the GDT should be less that equal to CPL (Current Previelge Level, specified in the read-only lowermost two bits of CS register). Incase DPL &gt; CPL, CPU will generate general protection exception. We will see ahead, whatdoes this check mean and why this is done. Simply saying:&lt;br /&gt;        a). DPL (of GDT entry) &lt;= CPL ----- ok, switches stack if DPL &lt; CPL&lt;br /&gt;        b). DPL (of GDT entry) &gt; CPL ----- general protection exception&lt;br /&gt;If DPL (of GDT entry) &lt; CPL, we are entering the higher previlege level (probably from user to kernel mode). In this case CPU switches the hardware stack (SS and ESP registers) from currently running process's user mode stack to its kernel mode stack. We will see ahead, how this stack switch is exactly done. Note: stack switching idea has been mentioned here, but it actually happens after the 5th step mentioned below.&lt;br /&gt;&lt;br /&gt;5). for software interrupts (generated by assembly instructions 'int'), one more check is done. This check is not performedfor hardware interrupts (interrupts generated by system devices and forwarded by PIC). Simply saying:&lt;br /&gt;        a). DPL (of IDT entry) &gt;= CPL ---- ok, we have permission to enter through this gate&lt;br /&gt;        b). DPL &lt; CPL ---- genreal protection exception&lt;br /&gt;&lt;br /&gt;6). switches the stack if DPL (of GDT entry) &lt; CPL. In addition to this mode of CPU (least significant two bits of CS) are also changed from CPL to DPL (of GDT entry)&lt;br /&gt;&lt;br /&gt;7). if the stack switch has taken place (SS and ESP registers reset to kernelstack), then pushes the old values of SS and ESP (pointing to user stack) on this new stack (kernel stack)&lt;br /&gt;&lt;br /&gt;8). pushes the EFALGS, CS and EIP registers on the stack (note: now we are working on kernel stack). This actually saves the pointer to user application instruction to which we need to return back after servicing the interrupt or exception&lt;br /&gt;&lt;br /&gt;9). In case of exceptions, if there is any harware code, processor pushes that also on kernel stack&lt;br /&gt;&lt;br /&gt;10). loads the CS with the value of GDT entry and EIP with the offset entry of IDT (0-15 bits + 48-63 bits)&lt;br /&gt;&lt;br /&gt;All the above action is done by CPU hardware without the execution of any software instruction. Checks performed at step 4th and 5th (mentioned above) are important.&lt;br /&gt;&lt;br /&gt;4th checks make sure that the code we are going to execute (Interrupt Service Routine) does not fall in a segment with lesser previlege. Obivously the ISR can not be in lesser previlege segment that what we are into. DPL or CPL can have 4 values (0,1,2 for kernel mode and 3 fo user mode). Out of these four only two are used, that is 0 (for kernel mode) and 3 (for user mode).&lt;br /&gt;&lt;br /&gt;5th check makes sure that application can enter the kernel mode through specific gaes only, in Linux only through 128th gate entry which is for system call invocation. If we set the DPL field of IDT entry to be 0,1 or 2, application programme (running with CPL 3) cannot enter through that gate entry. If it tries, CPU will generate general protection exception. This is the reason that in Linux, DPL fields of all the IDT enteries (except 128th entry used for system call) are initialized with value '0', this makes sure only kernel code can access these gates not application code. In Linux 128th entry (used for system call) is of trap gate type and its DPL value is initialized to 3, so that application code can enter through this gate byassembly instruction "int 0x80"&lt;br /&gt;&lt;br /&gt;Now lets see how does the stack switch happens when the DPL (of GDT entry) &lt; CPL. CPU have TR (Task Register) register, which actually points to the TSS (Task Sate Segment) od currently running process. TSS is an architecture defined data structure which contains the stae of processor registers whenever context switch of this process happens. TSS include three sets of ESS and ESP fields, one for each level of processor (0,1 and 2). These fields specifies the stack to be used whenevr we entert that processor level. Lets say the DPL value in GDT entry is 0, in this case, CPU will load the SS register with the value of SS field in TSS for 0 level and ESP register with the value of ESP field in TSS for 0 level. After loading the SS and ESP with these values, CPU starts pointing to the new kernel level stack o current process. Old values of SS and ESP (CPU remembers them somehow) are now pushed on this new kernel level stack; this is done as we need to return back to old stack oncewe service the interrupts, exception or system call. Prudent readers must be wondering, why there is no firld for level 3 stack in TSS. Well the reason for this is that we never use the CPU's stack switching mechanism to switch from higher CPU level (kernel mode - 0,1 and 2) to lower CPU level (user mode - 3). This is the reason that CPU while entering the higher level(kernel mode) saves the previously used lower level stack (user mode) on the kernel stack.&lt;br /&gt;&lt;br /&gt;Once all this CPU action is done, CPU's CS and EIP registers are pointing to the kernel functions written for handling interrupts or exceptions. CPU simply start executing the instructions at this point (now we are in kernel mode - level 0)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Kernel Support for Handling Interrupts&lt;br /&gt;================================&lt;br /&gt;&lt;br /&gt;In this section, we will be covering and walk through the kernel code executed in interrupt context. I will be reffering the the code as per 2.4.18 release of kernel.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Low Level Interrupt Stubs&lt;br /&gt;~~~~~~~~~~~~~~~~~~~~~&lt;br /&gt;&lt;br /&gt;Whenever an interrupt occurs, CPU performs the above mentioned hardware checks and start executing the following assembly instructions in kernel, whose pointer (offest in kernel code segment) is stored correstonding IDT entry.&lt;br /&gt;&lt;br /&gt;File: include/asm-i386/hw_irq.h&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;155 #define BUILD_COMMON_IRQ() 156 asmlinkage void call_do_IRQ(void); 157 __asm__( 158         "\n" __ALIGN_STR"\n" 159         "common_interrupt:\n\t" 160         SAVE_ALL 161         SYMBOL_NAME_STR(call_do_IRQ)":\n\t" 162         "call " SYMBOL_NAME_STR(do_IRQ) "\n\t" 163         "jmp ret_from_intr\n");&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;175 #define BUILD_IRQ(nr) 176 asmlinkage void IRQ_NAME(nr); 177 __asm__( 178 "\n"__ALIGN_STR"\n" 179 SYMBOL_NAME_STR(IRQ) #nr "_interrupt:\n\t" 180         "pushl $"#nr"-256\n\t" 181         "jmp common_interrupt");&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;This macros is used at the kernel initialization time to write out the lowest interrupt stubs, which can be called from IDTby saving there offsets (pointers) in IDT gates. Kernel maintains one global array of function pointers (name of array - interrupt) in which it stores the pointer of these stubs. Code related to creation of these stubs (using above mentioned BUILD_IRQ macro) and saving their pointers in the global array "interrupt[NR_IRQS]" can be seen in file "arch/x86_64/kernel/i8259.c". In this file you will see the usage of BUILD_IRQ macro to create the interrupt stubs as follows:&lt;br /&gt;&lt;br /&gt;File: arch/i386/kernel/i8259.c&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;40 #define BI(x,y) 41         BUILD_IRQ(x##y)&lt;br /&gt;42&lt;br /&gt;43 #define BUILD_16_IRQS(x) 44         BI(x,0) BI(x,1) BI(x,2) BI(x,3) 45         BI(x,4) BI(x,5) BI(x,6) BI(x,7) 46         BI(x,8) BI(x,9) BI(x,a) BI(x,b) 47         BI(x,c) BI(x,d) BI(x,e) BI(x,f)&lt;br /&gt;48&lt;br /&gt;49 /*&lt;br /&gt;50  * ISA PIC or low IO-APIC triggered (INTA-cycle or APIC) interrupts:&lt;br /&gt;51  * (these are usually mapped to vectors 0x20-0x2f)&lt;br /&gt;52  */&lt;br /&gt;53 BUILD_16_IRQS(0x0)&lt;br /&gt;54&lt;br /&gt;55 #ifdef CONFIG_X86_IO_APIC&lt;br /&gt;56 /*&lt;br /&gt;57  * The IO-APIC gives us many more interrupt sources. Most of these&lt;br /&gt;58  * are unused but an SMP system is supposed to have enough memory ...&lt;br /&gt;59  * sometimes (mostly wrt. hw bugs) we get corrupted vectors all&lt;br /&gt;60  * across the spectrum, so we really want to be prepared to get all&lt;br /&gt;61  * of these. Plus, more powerful systems might have more than 64&lt;br /&gt;62  * IO-APIC registers.&lt;br /&gt;63  *&lt;br /&gt;64  * (these are usually mapped into the 0x30-0xff vector range)&lt;br /&gt;65  */&lt;br /&gt;66                    BUILD_16_IRQS(0x1) BUILD_16_IRQS(0x2) BUILD_16_IRQS(0x3)&lt;br /&gt;67 BUILD_16_IRQS(0x4) BUILD_16_IRQS(0x5) BUILD_16_IRQS(0x6) BUILD_16_IRQS(0x7)&lt;br /&gt;68 BUILD_16_IRQS(0x8) BUILD_16_IRQS(0x9) BUILD_16_IRQS(0xa) BUILD_16_IRQS(0xb)&lt;br /&gt;69 BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd)&lt;br /&gt;70 #endif&lt;br /&gt;71&lt;br /&gt;72 #undef BUILD_16_IRQS&lt;br /&gt;73 #undef BI&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;Above code actually creates the interrupt stubs and do not place there pointers in interrupt[NR_IRQS] array. The code whichplaces the pointers of these stubs in global array is as follows and can be found in same file "arch/x86_64/kernel/i8259.c"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;File: arch/i386/kernel/i8259.c&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;100 #define IRQ(x,y) 101         IRQ##x##y##_interrupt&lt;br /&gt;102&lt;br /&gt;103 #define IRQLIST_16(x) 104         IRQ(x,0), IRQ(x,1), IRQ(x,2), IRQ(x,3), 105         IRQ(x,4), IRQ(x,5), IRQ(x,6), IRQ(x,7), 106         IRQ(x,8), IRQ(x,9), IRQ(x,a), IRQ(x,b), 107         IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)&lt;br /&gt;108&lt;br /&gt;109 void (*interrupt[NR_IRQS])(void) = {&lt;br /&gt;110         IRQLIST_16(0x0),&lt;br /&gt;111&lt;br /&gt;112 #ifdef CONFIG_X86_IO_APIC&lt;br /&gt;113                          IRQLIST_16(0x1), IRQLIST_16(0x2),&lt;br /&gt;IRQLIST_16(0x3),&lt;br /&gt;114         IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6),&lt;br /&gt;IRQLIST_16(0x7),&lt;br /&gt;115         IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa),&lt;br /&gt;IRQLIST_16(0xb),&lt;br /&gt;116         IRQLIST_16(0xc), IRQLIST_16(0xd)&lt;br /&gt;117 #endif&lt;br /&gt;118 };&lt;br /&gt;119&lt;br /&gt;120 #undef IRQ&lt;br /&gt;121 #undef IRQLIST_16&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;Above code actually filles the global array of function pointers (array name interrupt[NR_IRQS]). Once the global array is nitialized with the pointers to interrupt stubs, we initialize the IDT (Interrupt Descriptor Table) in function "init_IRQ()"using this global array as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;File: arch/i386/kernel/i8259.c, Function: init_IRQ()&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;for (i = 0; i &lt; (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) {&lt;br /&gt;       int vector = FIRST_EXTERNAL_VECTOR + i;&lt;br /&gt;       if (i &gt;= NR_IRQS)&lt;br /&gt;              break;&lt;br /&gt;       if (vector != IA32_SYSCALL_VECTOR &amp;&amp;amp; vector != KDB_VECTOR) {&lt;br /&gt;              set_intr_gate(vector, interrupt[i]);&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;In above loop, we loop over all the IDT enteries staring from "FIRST_EXTERNAL_VECTOR" (32, because first 32 enteries are for exception) and call "set_intr_gate()" function which actually set the interrupt gate descriptor. For entry 128, which is for system call invocation, interrupt gte is not set, for this rather trap gate is set and that is done in function trap_init(). In the same function init_IRQ(), after this looping, we initialize the IPI (Interprocessor Interrupts). These interruptsare sent from one CPU to another CPU in SMP machines.&lt;br /&gt;&lt;br /&gt;Now we can see once these IDT eneries are set, whenever an interrupt occurs, CPU directly jumps to the code given in BUILD_IRQ macro. Now lets analyse what this macro do. Following is the code for BUILD_IRQ macro:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;File: include/asm-i386/hw_irq.h&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;#define BUILD_IRQ(nr) asmlinkage void IRQ_NAME(nr); __asm__( "\n.p2align\n" "IRQ" #nr "_interrupt:\n\t"        "push $" #nr "-256 ; "        "jmp common_interrupt");&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;This assembly code first subtracts the IRQ number from 256 and pushes the result on kernel stack. After doing this it jumpsto "common_interrupt" assembly label, which simply saves the context of interrupted process (CPU resigters) on to kernel stack and then calls the C language function "do_IRQ()".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Details of do_IRQ() function, core of Inteuupt Handling&lt;br /&gt;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~&lt;br /&gt;&lt;br /&gt;do_IRQ() is the common function to all hardware interrupts. This function is the most important to understand from the prespective of interrupt handling. We will first show the code of whole function and then explain it line by line in coming paragraphs with line refferences.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;File: arch/i386/kernel/irq.c&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;563 asmlinkage unsigned int do_IRQ(struct pt_regs regs)&lt;br /&gt;564 {&lt;br /&gt;565         /*&lt;br /&gt;566          * We ack quickly, we don't want the irq controller&lt;br /&gt;567          * thinking we're snobs just because some other CPU has&lt;br /&gt;568          * disabled global interrupts (we have already done the&lt;br /&gt;569          * INT_ACK cycles, it's too late to try to pretend to the&lt;br /&gt;570          * controller that we aren't taking the interrupt).&lt;br /&gt;571          *&lt;br /&gt;572          * 0 return value means that this irq is already being&lt;br /&gt;573          * handled by some other CPU. (or is disabled)&lt;br /&gt;574          */&lt;br /&gt;575         int irq = regs.orig_eax &amp; 0xff; /* high bits used in ret_from_&lt;br /&gt;code  */&lt;br /&gt;576         int cpu = smp_processor_id();&lt;br /&gt;577         irq_desc_t *desc = irq_desc + irq;&lt;br /&gt;578         struct irqaction * action;&lt;br /&gt;579         unsigned int status;&lt;br /&gt;580&lt;br /&gt;581         kstat.irqs[cpu][irq]++;&lt;br /&gt;582         spin_lock(&amp;desc-&gt;lock);&lt;br /&gt;583         desc-&gt;handler-&gt;ack(irq);&lt;br /&gt;584         /*&lt;br /&gt;585            REPLAY is when Linux resends an IRQ that was dropped earlier&lt;br /&gt;586            WAITING is used by probe to mark irqs that are being tested&lt;br /&gt;587            */&lt;br /&gt;588         status = desc-&gt;status &amp; ~(IRQ_REPLAY | IRQ_WAITING);&lt;br /&gt;589         status |= IRQ_PENDING; /* we _want_ to handle it */&lt;br /&gt;590&lt;br /&gt;591         /*&lt;br /&gt;592          * If the IRQ is disabled for whatever reason, we cannot&lt;br /&gt;593          * use the action we have.&lt;br /&gt;594          */&lt;br /&gt;595         action = NULL;&lt;br /&gt;596         if (!(status &amp; (IRQ_DISABLED | IRQ_INPROGRESS))) {&lt;br /&gt;597                 action = desc-&gt;action;&lt;br /&gt;598                 status &amp;= ~IRQ_PENDING; /* we commit to handling */&lt;br /&gt;599                 status |= IRQ_INPROGRESS; /* we are handling it */&lt;br /&gt;600         }&lt;br /&gt;601         desc-&gt;status = status;&lt;br /&gt;602&lt;br /&gt;603         /*&lt;br /&gt;604          * If there is no IRQ handler or it was disabled, exit early.&lt;br /&gt;605            Since we set PENDING, if another processor is handling&lt;br /&gt;606            a different instance of this same irq, the other processor&lt;br /&gt;607            will take care of it.&lt;br /&gt;608          */&lt;br /&gt;609         if (!action)&lt;br /&gt;610                 goto out;&lt;br /&gt;611&lt;br /&gt;612         /*&lt;br /&gt;613          * Edge triggered interrupts need to remember&lt;br /&gt;614          * pending events.&lt;br /&gt;615          * This applies to any hw interrupts that allow a second&lt;br /&gt;616          * instance of the same irq to arrive while we are in do_IRQ&lt;br /&gt;617          * or in the handler. But the code here only handles the _second_&lt;br /&gt;618          * instance of the irq, not the third or fourth. So it is mostly&lt;br /&gt;619          * useful for irq hardware that does not mask cleanly in an&lt;br /&gt;620          * SMP environment.&lt;br /&gt;621          */&lt;br /&gt;622         for (;;) {&lt;br /&gt;623                 spin_unlock(&amp;desc-&gt;lock);&lt;br /&gt;624                 handle_IRQ_event(irq, &amp;regs, action);&lt;br /&gt;625                 spin_lock(&amp;desc-&gt;lock);&lt;br /&gt;626&lt;br /&gt;627                 if (!(desc-&gt;status &amp; IRQ_PENDING))&lt;br /&gt;628                         break;&lt;br /&gt;629                 desc-&gt;status &amp;= ~IRQ_PENDING;&lt;br /&gt;630         }&lt;br /&gt;631         desc-&gt;status &amp;= ~IRQ_INPROGRESS;&lt;br /&gt;632 out:&lt;br /&gt;633         /*&lt;br /&gt;634          * The -&gt;end() handler has to deal with interrupts which got&lt;br /&gt;635          * disabled while the handler was running.&lt;br /&gt;636          */&lt;br /&gt;637         desc-&gt;handler-&gt;end(irq);&lt;br /&gt;638         spin_unlock(&amp;desc-&gt;lock);&lt;br /&gt;639&lt;br /&gt;640         if (softirq_pending(cpu))&lt;br /&gt;641                 do_softirq();&lt;br /&gt;642         return 1;&lt;br /&gt;643 }&lt;br /&gt;&lt;br /&gt;###########################################################&lt;br /&gt;&lt;br /&gt;Here is the detailed explaination of do_IRQ() function, this has been&lt;br /&gt;explained below line by line.&lt;br /&gt;&lt;br /&gt;Line - 575 to 577&lt;br /&gt;Get the number of the interrupt that got triggered. Its pushed on the kernel stack before pushing the context of the interrupted process. Get the processor or CPU id o which this code is being executed or in other means the CPU id of processor handling this interrupt. Get the pointer to the IRQ descriptor. IRQ descriptor is a kernel data structure which actually binds together the different ISRs (Interrupt Service Routines) registere by device drivers for same IRQ line. As mentioned earlieralso, same IRQ line can b shared between different devices, so their device drivers need to register their own ISRs to handle the interrupts genetated by these devices. IRQ descriptor data structure is defined as follows:&lt;br /&gt;&lt;br /&gt;typedef struct {&lt;br /&gt;        unsigned int status;&lt;br /&gt;        hw_irq_controller *handler;&lt;br /&gt;        struct irqaction *action;&lt;br /&gt;        unsigned int depth;&lt;br /&gt;        spinlock_t lock;&lt;br /&gt;} ____cacheline_aligned irq_desc_t;&lt;br /&gt;&lt;br /&gt;Following is the significance of different elements in this stucture:&lt;br /&gt;&lt;br /&gt;- status : Its a bit mask of different flags to identify the state of a particular IRQ line. We will see the use of differnet flags ahead in this article.&lt;br /&gt;&lt;br /&gt;- handler : This is the pointer to the structure, whose each element is the pointer to the function related to the handlingof physical PIC (programmable interrupt controller). These functions are used to mask/unmask particular interrup line in PIC or to acknowledge the interrupt to PIC. The definitions of these PIC related functions can be found in file "arch/i386/kernel/i8259.c"&lt;br /&gt;&lt;br /&gt;- action : This element is the pointer to the list o ISRs registered by different device drivers for this IRQ line. When a device driver registers its ISR to kernel using kernel function "irq_request()", the ISR is added to this list for that particular IRQ line.&lt;br /&gt;&lt;br /&gt;- lock : This is spinlock to handle the synchronization problem while accessing any element in IRQ descriptor. Kernel execution context access the different elements of IRQ descriptor, but before doing so they should acquire this spinlock so that the synchronization can be maintained.&lt;br /&gt;&lt;br /&gt;Line - 581 to 583&lt;br /&gt;Here we increment the interrupt count received by this CPU, this is maintained for accounting purpose. Hold the spinlock before accessing any element of the IRQ descriptor for our interrupt line. We also mask and acknowledge the interrupt to PIC using handler function of our IRQ descriptor.&lt;br /&gt;&lt;br /&gt;Line - 588 to 589&lt;br /&gt;Now we clear the IRQ_REPLAY and IRQ_WAITING flags from ou IRQ descriptor flag. As mentioned earlier this is used to maintain the status of an interrupt handling line. We clear these flags because now we are going to handle this interrupt will not be anymore in reply or waiting mode. Actually IRQ_WAITING flagis used by device drivers in conjunction with IRQ_AUTODETECT flag for auto-detecting the IRQ line to which their device is connected. Device drivers use the probe_irq_on() function, which actually sets he IRQ_AUTODETECT and IRQ_WAITING flag for all the IRQ descriptors for whome no ISR has yet been registered.After calling probe_irq_on() function, device driver instructs the device to trigger an interrupt and then calls probe_irq_off(0 function. probe_irq_off() function actually looks for those IRQ descriptors whose IRQ_AUTODETECT flag is still set butIRQ_WAITING flag has been cleared. and returns the IRQ line number to device driver.&lt;br /&gt;&lt;br /&gt;After clearing the IRQ_REPLAY and IRQ_WAITING flags in do_IRQ() function we set the IRQ_PENDING function. This is done, to indicate that we are planning to handle this interrupt if this interrupt is not disabled or not bein already handled by another CPU (in case of SMP machines). The use of setting IRQ_PENDING flag is explained in details in next few lines.&lt;br /&gt;&lt;br /&gt;As we have see the interrupt and want to handle it by calling the set of ISRs (Interrupt Serive Routines) registered by different device drivers. We set IRQ_PENDING flag because seeing an interrupt does not mean we will for sure handle it. IRQ_PENDING flag helps us in following two cases:&lt;br /&gt;&lt;br /&gt;- In case interrupt is disabled (set flag IRQ_DISABLED), we will not service the interrupt and will just keep it marked as pending (set flag IRQ_PENDING). Once the interrupt is again enabled (clear flag IRQ_DISABLED), ISRs will be called to service the interrupt. So IRQ_PENDING helps us to remember the intterupt which occured while that interrupt was disabled due to some reason.&lt;br /&gt;&lt;br /&gt;Note: Here disabling interrupt does not mean masking a particular line at PIC level or disabling all the interrupt at CPU level by clearing the IF flag of CPU EFLAG register. Disabling here means the kernel has been asked not to service the interrupt, but the hardware triggering of interrupt signal is not being stopped at all.&lt;br /&gt;&lt;br /&gt;- In case another CPU is already handling the previous interrupt requests on this IRQ line. In this case flag IRQ_INPROGRESS will already be set by that another CPU. Our role will be to just mark the interrupt as IRQ_PENDING and in away asks that other CPU to service this interrupt request also. When that CPU will finish its handling of previous interrupt, it will check this flag. Because of this flag being set by us, that CPU will again go and call all the ISRs once agian to service interrupt request we received on this IRQ line.&lt;br /&gt;&lt;br /&gt;Line - 595 to 601&lt;br /&gt;Now we check if this interrupt is not disabled (flag IRQ_DISABLED is clear) and at the same time is also not being handled by another CPU (flag IRQ_INPROGRESS is also clear), we go forward and clear the IRQ_PENDING flag and sets the IRQ_INPROGRESSflag to indicate that we take the responsibility of handling this interrupt request. Now while we are handling this interrupt request, lets sa another CPU receives an interrupt on same IRQ line, that CPU will simple mark the IRQ_PENDING flag and will transfer his responsibility to us and in that case we (CPU we are executing on) will be responsible to serve that interrupt request also.&lt;br /&gt;&lt;br /&gt;Line - 609 to 610&lt;br /&gt;If there is no registered ISR for this IRQ line, we simply return from interrupt context after releasig the lock we hold and serving the softirqs (if any pending).&lt;br /&gt;&lt;br /&gt;Line - 622 to 630&lt;br /&gt;Now we are al set to call the registered ISRs (device driver's functions), so that they can figure out which device connected to this IRQ line has actually triggered the interrupt and can serve it poperly. before calling the ISRs, we release the IRQ descriptor spinlock so that while we are executing the ISRs this spinlock can be acquired by another interrupt context, which may execute on another CPU for the same IRQ line. This interrupt context on another CPU will simply mark the IRQ_PENDING flag and return without handling the interrupt itself. In this infite loop we call the handle_IRQ_event() function which actualy calls all the ISRs registrered for this IRQ line one by one. After completing the list of ISRs, we again acquire the IRQdescriptor spinlock as we need to again check and update the flag element of IRQ descriptor. After acquiering the spinlock, we check is the IRQ_PENDING flag is clear, we break out of this infite loop, else we clear the IRQ_PENDING flag of our IRQ descriptor and again go into handle_IRQ_event() function to serve the new interrupt request as indicated by IRQ_PENDING flag.&lt;br /&gt;&lt;br /&gt;Line - 631&lt;br /&gt;Finally we come out of the above mentioned infite loop only if there is not pending request for thie IRQ line. Once we are out, we are done with the most of the part, so we clear the IRQ_INPROGRESS flag.&lt;br /&gt;&lt;br /&gt;Line - 637 to 638&lt;br /&gt;Now we call the end function of PIC related functions stored in handler element of our IRQ descriptor. This function take care of the situation where the interrupt we were handling got disabled while we were handling it. Lets sat while we were serving the interrupt by callings all the ISRs for it, the interrupt got disabled (flag IRQ_DISABLED is set) by code running onanother CPU, then in this case we should not unmask the interrupt line (which we masked by calling the PIC related ack() function, line 583). If the IRQ is not yet disabled, this function end() will simply unmask the interrupt line at PIC level and return. After this we go ahead and do serve the pending softirqs (is any marked). We will see in next section what are siftirqs. I will soon post the details of softirqs, tasklets and bottom halfs, so keep looking for that on my blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-113800083772230935?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/113800083772230935/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=113800083772230935' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/113800083772230935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/113800083772230935'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2006/01/interrupt-handling-internals-in-linux.html' title='Interrupt Handling Internals in Linux Kernel'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-112557963463047297</id><published>2005-09-01T05:58:00.000-07:00</published><updated>2005-09-01T07:08:24.680-07:00</updated><title type='text'>Back Door Entry - Getting hold of Kernel</title><content type='html'>&lt;span style="font-family:courier new;font-size:130%;"&gt;&lt;br /&gt;Author: Gaurav Dhiman&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;span style="font-family:courier new;font-size:130%;"&gt;Email:  gaurav4lkg@gmail.com&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;This article talks about a way to break the kernel and getting hold of it for your use, in other words a way to hack a kernel. We will talk in respect to Linux Kernel. Well the same thing can be applied to other kernels as well.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Overview:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;---------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;We all know about the paging mechanisum to implement the virtual memory . Virtual memory is actually implemented using page on demand concept. This means we do not load the whole program in memory in one go, rather we keep on loading the required part of program as requested. Whenever the program refers the code or data which is not in memory, system do page-faults. Page fault is an exception generated by the MMU (Memory Management Unit) of system. Whenever the page-fault exception occurs CPU starts executing the page-fault handler code pointed out by the page-fault entry of IDT (Interrupt Descriptor Table). I wont discuss details about IDT in this article, will be writting seperator article for that. In short IDT is the kernel data structure (an array of pointers to kernel functions which handle hardware interrupts and system generated exceptions). The IDT is pointed by the CPU's IDTR register, so CPU knows where the IDT in memory is, that is why whenever any hardware interrupt or exception occures, system automatically switches to the relevant code whose pointer is placed in IDT entry. Coming back to page-faults, as told earlier page-fault is an exception and can occur in system at any time (in user space as well as in kernel space).&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Details about Page-Fault Handler:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;---------------------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Page-fault handler handles following specific cases:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;- When page fault occures in user space (user application code)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt; - user programme accessed a virtual memory address out of its virtual user address space. In this case page fault handler will generate SEGV and the process will be terminated.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt; - user programme accessed a valid virtual memory address which is in user virtual address space but virtual page related to it is not in momory (page table is not set). In this case page-fault handler will swap-in the required virtual page into memory, set the required page table entry and will return back to the page faulted instruction.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;- When page fault occures in kernel space (kernel code)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt; - Kernel access the user space virtual address and the page related to that address is not avaliable in memory. In this case related page will be swapped in and the kernel will access it.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt; - Kernel tries to access the user space virtual address, whcih does not fall in user address space. In this case, kernel should not generate SEGV, rather it should handle this situation gracefully. This is the case which we will mainly focus on in this article ahead.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Entering a System Call:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;-----------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Before going ahead, lets see some code related to system call invocation, it will help us in understanding this article in a better way.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;When we do any system call using the library function, CPU switches to kernel mode also CPU is set to use the kernel stack of the process. As soon the execution context enters the kernel mode, CPU jumps to the ollowing kernel function, which can be found in arch/i386/kernel/entry.S file in kernel sources&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;ENTRY(system_call)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        pushl %eax                      # save orig_eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        SAVE_ALL&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        GET_CURRENT(%ebx)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        testb $0x02,tsk_ptrace(%ebx)    # PT_TRACESYS&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        jne tracesys&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        cmpl $(NR_syscalls),%eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        jae badsys&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        call *SYMBOL_NAME(sys_call_table)(,%eax,4)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        movl %eax,EAX(%esp)             # save the return value&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;    .....&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;    .....&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;    .....&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;In this function first of all processor context (state of all the registers in a processor) is saved on kernel specific stack and then GET_CURRENT macro is called to get the pointer of process descriptor (task_struct) of currently running process on the CPU. Finally the function related to the requested system call is called by picking the function pointer from sys_call_table (a global array of pointers in kernel - also known as system call table in general terms). From this point onwards the called function is responsible for serving the requested functionality to user space program.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Handling of page-faults in kernel space:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;----------------------------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Kernel access the user space provide address, when user space programme makes a system call to fetch/put some user data into kernel, for e.g. write/ read system calls, ioctl system call etc. In these system calls one of the parameters is the user space address. Kernel put/fetch some data from user space with the help of some special kernel function which handles the page faults gracefully, if they occur. Some of the kernel functions used to copy data from/to user space are as follows:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;copy_to_user()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;copy_from_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;get_user()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;put_user()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Lets take a simple example of ioctl() system call. Use of ioctl() function in a user programme will be something like this&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;int    on = 1;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;ioctl(fd, FIONBIO, &amp;on); &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;When ioctl system call is done, in kernel sys_ioctl() is called, which further down the line calls one of the above mentioned functions to fetch/put data in user provided buffer. Lets take an example of get_user() kernel function. This is implemented in kernel as a macro and you can find the implementation in include/asm/uaccess.h file of kernel sources.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define get_user(x,ptr)                                                 \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;({      int __ret_gu,__val_gu;                                          \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        switch(sizeof (*(ptr))) {                                       \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        case 1:  __get_user_x(1,__ret_gu,__val_gu,ptr); break;          \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        case 2:  __get_user_x(2,__ret_gu,__val_gu,ptr); break;          \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        case 4:  __get_user_x(4,__ret_gu,__val_gu,ptr); break;          \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        default: __get_user_x(X,__ret_gu,__val_gu,ptr); break;          \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        }                                                               \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        (x) = (__typeof__(*(ptr)))__val_gu;                             \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        __ret_gu;                                                       \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;})&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;In this macro "ptr" is the user provided address which can do a page fault. This macro further calls another macro "__get_user_x" depending upon the size of "ptr" pointer passed from user space to kernel. Size of this pointer tells kernel how much bytes need to be copied from/to user space accordingly the "__get_user_x" function is called.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Implementation of "__get_user_x" macro is as follows in kernel, can be found in include/asm/uaccess.h file:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define __get_user_x(size,ret,x,ptr) \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        __asm__ __volatile__("call __get_user_" #size \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                :"=a" (ret),"=d" (x) \&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                :"0" (ptr))&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;This function uses the assembly code to invoke one of the following functions written in assembly language:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;__get_user_1()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;__get_user_2()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;__get_user_4()&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;We will only see the implementation of one of these functions to under stand it better. Lets analyse what "__get_user_4()" assembly function do. Implementation of this function is as follows in linux kernel, you can find its code in arch/i386/lib/getuser.S assembly file&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.align 4&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.globl __get_user_4&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;__get_user_4:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        addl $3,%eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        movl %esp,%edx&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        jc bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        andl $0xffffe000,%edx&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        cmpl addr_limit(%edx),%eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        jae bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;3:      movl -3(%eax),%edx&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        xorl %eax,%eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        ret&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;  &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;bad_get_user:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        xorl %edx,%edx&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        movl $-14,%eax&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        ret&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;  &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.section __ex_table,"a"&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        .long 1b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        .long 2b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        .long 3b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.previous&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;This is the actual code in kernel which actually gets the specific number of bytes (in this case 4 bytes are copied) from user space buffer to kernel buffer. In this while copying the data we might face a page-fault and as we are currently in kernel mode, we need to handle such page-faults gracefully. We will shortly discuss some kernel data structures which help in this.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;In this function following instruction actually copies 4 bytes from address poited by EAX (pointer given by user space program) to CPU's EDX register.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;3:      movl -3(%eax),%edx&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;We can face a page fault while executing this instruction,now lets see how we handle that. For this kernel maintains a two dimentional array of poiters, which is known as exception table (__ex_table). This table have number of enteries and each entry contains two elements or in other words if we literally see it as table, it have number of rows and two columns. First column contains the address of kernel instruction which can page fault while accessing the user space address (in our case it will be address of above mov instruction). Second column of this table contains the address of fix up codewhich need to be called when page fault occurs on instruction whose address is stored in first column. So this table looks like following:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Exception Table&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;+-----------------------------------------+&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 1 | fix up address 1 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 2 | fix up address 2 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 3 | fix up address 3 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 4 | fix up address 4 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 5 | fix up address 5 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 6 | fix up address 6 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;| page fault address 7 | fix up address 7 |&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;If we look at above assembly code of function __get_user_4(), we will find ".section" in it. This is a assembler directive whcih tells the assembler to place the following instruction in a specific section of executable. In above function we are dictating the assembler to place the address of faulting kernel instructions and there corresponding fix up codes in a __ex_table section of linux kernel binary.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Following assembly instructions put the address of faulting instruction and there corresponding fix up codes in exception talble (__ex_table section of linux kernel image).&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.long 1b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.long 2b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;.long 3b,bad_get_user&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;1b means the first lable 1 in backward direction, which is actually the faulting instruction that is the mov instruction we discussed earlier. "bad_get_user" is another assembly lable which serves as the fix up code and will be executed if page fault occurs while executing the instruction at 1b, 2b or 3b instructions.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;This is all about the exception table and setting the enteries in it, but we must know how exception table is exactly used by page fault handler. All this is discussed in the following section.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Use of Exception Table in Page Fault Handler:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;---------------------------------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;In Linux Kernel, page fault handler is the do_page_fault() function defined in arch/i386/mm/fault.c file of kernel sources.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Lets assume that page fault occurs while copying the data from user space to kernel space. Immediately the page fault handler will be executed by CPU and page fault handler will check if the page faulting instruction falls in user space or in kernel space (this determines is the page fault occured in user space program or while executing the kernel instruction). If it occured in kernel, page fault handler (do_page_fault() function) will simple call fixup_exception() kernel function.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;if (fixup_exception(regs))&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;       return;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Implementation of fixup_exception() function can be found in arch/i386/mm/extable.c file of kernel sources.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;int fixup_exception(struct pt_regs *regs)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;{&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        const struct exception_table_entry *fixup;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        fixup = search_exception_tables(regs-&gt;eip);&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        if (fixup) {&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                regs-&gt;eip = fixup-&gt;fixup;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                return 1;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        }&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        return 0;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;}&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;fixup_exception() function looks for the faulting address (regs-&gt;eip) in the first column of exception table and if it find it, it sets the regs_eip to the address found in the second column (this is the address of fix up code). If we are not able to find the faulting address in exception table (this means that kernel access some wrong address for which kernel do not have any fixup code). In this case page fault handler must generate the OOPS (too famous in kernel world) and core dump the kernel image.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;This is all about page faulting in kernel and there handling in linux. Nows lets explore the possiblities to exploit the exception table to get an redirect the execution to our malicious code. This would be interesting and wil give you a free hand to do anything in kernel once its compromised ;-)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Hack kernel using exception table:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;----------------------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;As now we know what exception table is and what it contains, we can think of exploiting it for getting a back door entry into kernel. In simpler words, if we are able to replace the addresses in second column (addresss of fixup code) of exception table with our own function address, we can exceute our function just by generating a page fault in kernel and that is not too difficult (just pass a wrong address in ioctl or write/read system calls, thats it an you get control to your function). You must be thinking, it can not be that simple. Well, as now you know about page fault handler and exception table, it might seems an simple thing to you.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Lets have some practicle linux kernel module for it, which can show us how we can expoit this option. Following Linux Kernel Module, will replace the addresses in exception table and then we can generate a page fault by a simple user program.&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;Linux Kernel Module Code:&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;-------------------------&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#ifndef __KERNEL__&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define __KERNEL__&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#endif&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#ifndef MODULE&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define MODULE&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#endif&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define __START___EX_TABLE 0xc0261e20&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define __END___EX_TABLE   0xc0264548&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#define BAD_GET_USER       0xc022f39c&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;unsigned long start_ex_table = __START___EX_TABLE;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;unsigned long end_ex_table = __END___EX_TABLE;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;unsigned long bad_get_user = BAD_GET_USER;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#include "linux/module.h"&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#include "linux/kernel.h"&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#include "linux/slab.h"&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#ifdef FIXUP_DEBUG&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#  define PDEBUG(fmt, args...) printk(KERN_DEBUG "[fixup] : " fmt, ##args)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#else&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#  define PDEBUG(fmt, args...) do {} while(0)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;#endif&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;MODULE_PARM(start_ex_table, "l");&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;MODULE_PARM(end_ex_table, "l");&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;MODULE_PARM(bad_get_user, "l");&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;MODULE_LICENSE("GPL");&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;struct old_ex_entry {&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        struct old_ex_entry     *next;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        unsigned long           address;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        unsigned long           insn;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        unsigned long           fixup;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;};&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;  &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;struct old_ex_entry *ex_old_table;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;void hook(void) &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;{&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        printk(KERN_INFO "You did a Page Fault ..... \n");&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;} &lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;void cleanup_module(void)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;{&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        struct old_ex_entry     *entry = ex_old_table;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        struct old_ex_entry     *tmp;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        if (!entry)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                return;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        while (entry) {&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                *(unsigned long *)entry-&gt;address = entry-&gt;insn;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                *(unsigned long *)((entry-&gt;address) + sizeof(unsigned&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;long)) = entry-&gt;fixup;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                tmp = entry-&gt;next;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                kfree(entry);&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;                entry = tmp;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        }&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        return;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;}&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;int init_module(void)&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;{&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        unsigned long       insn = start_ex_table;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        unsigned long       fixup;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        struct old_ex_entry *entry, *last_entry;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        ex_old_table = NULL;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        PDEBUG(KERN_INFO "hook at address : %p\n", (void *)hook);&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:courier new;font-size:130%;"&gt;        for(; insn &lt;&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                fixup = insn + sizeof(unsigned long);&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                if (*(unsigned long *)fixup == BAD_GET_USER) {&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        PDEBUG(KERN_INFO "address : %p insn: %lx fixup : %lx\n",&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                        (void *)insn, *(unsigned long *)insn,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                        *(unsigned long *)fixup);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        entry = (struct old_ex_entry *)kmalloc(GFP_ATOMIC,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                    sizeof(struct old_ex_entry));&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        if (!entry){&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                if (ex_old_table) {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                    last_entry = ex_old_table;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                    ex_old_table = ex_old_table-&gt;next;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                    kfree(last_entry);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                }&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                return -1;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;            }&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        entry-&gt;next = NULL;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        entry-&gt;address = insn;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        entry-&gt;insn = *(unsigned long *)insn;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        entry-&gt;fixup = *(unsigned long *)fixup;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        if (ex_old_table) {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                last_entry = ex_old_table;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                            while(last_entry-&gt;next != NULL)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                    last_entry = last_entry-&gt;next;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                last_entry-&gt;next = entry;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        } else&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                ex_old_table = entry;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        *(unsigned long *)fixup = (unsigned long)hook;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                        PDEBUG(KERN_INFO "address : %p insn: %lx fixup : %lx\n",&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                        (void *)insn, *(unsigned long *)insn,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                                        *(unsigned long *)fixup);&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;                }&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        }&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        return 0;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;=====================================================&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;In above Linux Kernel Module (LKM), init_modulr function simply searches the exception table fora specific fixup function (bad_get_user() function) and whereever it finds the address of this function in exception table, it replaces it with our own function hook(). It saves the pointer to bad_get_user() function, so that we can reset the exception table to its original form while removing our kernel module.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Now a simple code which calls ioctl() with a bad argument.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "stdio.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "sys/types.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "sys/stat.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "fcntl.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "unistd.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "errno.h"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;#include "sys/ioctl.h"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;int main()&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;{&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        int     fd;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        int     res;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        fd = open("testfile", O_RDWR | O_CREAT, S_IRWXU);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        res = ioctl(fd, FIONBIO, NULL);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        printf("result = %d errno = %d\n", res, errno);&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;        return 0;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;=====================================================&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Now first load the LKM into system, then run the user program and see the /var/messages/log file, it will show you the string "You did a Page Fault ..... ". This string is printed by the hook() function of our module.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Now you can think what you can do with this, if in place you just printing the string in hook function, you do something important. You have the whole kerel world in front of you ;-)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;=====================================================&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Hope this article helps you in learning more about kernel. The intention of this article is not to hack the kernel, but rather to provide learning material for people who want to learn kernel programming.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;regards,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-Gaurav&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Email: gaurav4lkg@gmail.com&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;---------------------------&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-112557963463047297?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/112557963463047297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=112557963463047297' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/112557963463047297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/112557963463047297'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2005/09/back-door-entry-getting-hold-of-kernel_01.html' title='Back Door Entry - Getting hold of Kernel'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-111207771762246030</id><published>2005-03-28T22:23:00.000-08:00</published><updated>2005-03-28T22:28:37.806-08:00</updated><title type='text'>Introduction to Linux Device Driver Programming</title><content type='html'>Author: Gaurav Dhiman&lt;br /&gt;Email: gauravd_linux@yahoo.com, gaurav4lkg@gmail.com&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Introduction to Linux Device Drivers:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Linux Device Driver is actually the peace of code which very well knows the device it is controlling. It knows the behavior and has knowledge of device internals. Device Drivers in Linux can be a part of core kernel it self or it can even be developed as a separate module, which can be attached/detached from running kernel anytime, providing a flexibility in kernel to support multiple devices in dynamic environment.&lt;br /&gt;&lt;br /&gt;In this article we will talk about writing a device driver as a kernel module. Before talking about device drivers, we should have some basic knowledge of kernel module programming in Linux. In next few sections we will discuss the basics of kernel module programming before jumping to the driver intricacies.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Introduction to Kernel Module Programming:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Kernel module is a piece of code written, compiled and loaded separately from core kernel but linked to the core kernel at load time. Kernel modules have some specific structure to follow. There are two standard functions which need to be implemented in any Linux Kernel Module; we will talk about them bit later. As earlier also mentioned kernel module is a code which attaches to the core kernel at load time and is being executed in kernel or privileged mode. Its not a user program which runs in restricted mode and have limited access to memory and other system resources. Module being a part of the kernel can access any system resource. Due to this fact, kernel module programmers need to take care that their module does not create the security loopholes in kernel and must be well-disciplined. Kernel modules should only be doing what they are supposed to do. One loosely written kernel module loaded to the core kernel can make the whole system venerable.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Module structure:&lt;/span&gt;&lt;br /&gt;Module need to have two standard functions which take care of initializing and cleaning up of resources used by module. The standard format of module is as follows:&lt;br /&gt;&lt;br /&gt;#include&lt;br /&gt;#include&lt;br /&gt;&lt;br /&gt;/*standard function for initializing the resources used by module*/&lt;br /&gt;&lt;br /&gt;int init_module(void){&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;/*other function, which mplements the functionality of module*/&lt;br /&gt;&lt;br /&gt;func_1(){&lt;br /&gt;……..&lt;br /&gt;……..&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;func_2(){&lt;br /&gt;………&lt;br /&gt;………&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;func_3(){&lt;br /&gt;………&lt;br /&gt;……..&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;……..&lt;br /&gt;…….&lt;br /&gt;………&lt;br /&gt;&lt;br /&gt;/*standard function for initializing the resources used by module*/&lt;br /&gt;&lt;br /&gt;void cleanup_module(void){&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;In init_module() function, we request and acquire the resources required by our module. For e.g. resources can be an IRQ line (interrupt line on which our device is going to interrupt), I/O memory region, DMA channel etc. We also register few other things like major number used by our driver or the interrupt handlers for the interrupts generated by our device. Init_module() function is called at the time of loading a module to core kernel, using “insmod” or “modprobe” command.&lt;br /&gt;&lt;br /&gt;In cleanup_module() function, we release the resources acquired in init_module() function. This function is called at the time uloading a module from core kernel, using “rmmod” command.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Presenting real time Kernel Module:&lt;/span&gt;&lt;br /&gt;Let’s follow the tradition and write the first kernel module “hello_world.c”&lt;br /&gt;&lt;br /&gt;#define __KERNEL__&lt;br /&gt;#define MODULE&lt;br /&gt;&lt;br /&gt;#include&lt;br /&gt;#include&lt;br /&gt;&lt;br /&gt;int init_module(void){&lt;br /&gt;printk(&lt;1&gt; “My Module: Hello World !! \n”);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;void cleanup_module(void ){&lt;br /&gt;printk(&lt;1&gt; “My Module: Bye World ….. I am going !! \n”);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We need to define above mentioned two macros (__KERNEL and MODULE) for any module we write. printk() function is a brother of printf() function. The major difference between them is that printf() is a standard C library function which resides in user space and printk() is kernel function written from scratch.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;Now coming the real thing – Device Drivers&lt;/span&gt;&lt;br /&gt;Before going to the real code of device driver, let’s first understand the some basic fundamentals of what devices are in Linux. We will also discuss about how they are accessed by user processes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Device Files and Major Numbers:&lt;/span&gt;&lt;br /&gt;In Linux or and other Unix variant, devices re represented by files in file system. These files are known as device files or even nodes. Device files are special files through with the user process (program) can communicate (open, read, write, close etc) with the underlining device. Process uses the standard system calls, like open(), read(), write(), ioctl() etc to interact with device. With every device file a two special number are associated, which are known as a major number and minor number of a device. Major number is used by kernel to direct the user process request to right kernel driver and minor number is useless for kernel. Minor number is only used by the driver to identify the exact device which needs to be manipulated. The reason of existence of minor number is that, in practical scenario one driver can handle or control more than one device of same type or even of different types as well, so driver needs some mechanism through which it can identify which device it need to manipulate. Lets take an example, if there is a driver ”A”, which actually controls three physical devices “B”, “C” and “D”, then there need to be three device files in file system with same major number but different minor numbers.&lt;br /&gt;&lt;br /&gt;Device files can be created with “mknod” command. It has the following syntax.&lt;br /&gt;&lt;br /&gt;mknod {device name} {device type} {major number} {minor number}&lt;br /&gt;&lt;br /&gt;For help on this command, refer to man pages of it.&lt;br /&gt;It is a convention that device files reside in “/dev” directory of root file system, so it’s always better to create our own device files there rather than creating them somewhere else in file system.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Device Types:&lt;/span&gt;&lt;br /&gt;Devices can mainly be categorized in three groups: character devices, block devices and the network devices.&lt;br /&gt;&lt;br /&gt;Character Devices: These are the devices to which a user process can write or read a single character at a time. That means the interaction between device driver and actual physical device is in terms of single character. Example: keyboard, serial ports, parallel ports.&lt;br /&gt;&lt;br /&gt;Block Devices: These are the devices to which a user process can write or read a single block of data at a time. Reading and writing on these devices are done in terms of block data. A block can be of 512 bytes, 1024 bytes or so. Example: Hard Disk, Floppy Disk.&lt;br /&gt;&lt;br /&gt;Network Devices: These are asynchronous devices and are responsible for establishing a network connection to outside world. Best example of this type of device can be NIC card.&lt;br /&gt;&lt;br /&gt;For not complicating things, this article will only talk about the character device drivers.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Opening a Device:&lt;br /&gt;As mentioned earlier also that user process can communicate with the underlining device through device file, so for interacting with device, user process should first open the related device file, using open() system call. Once the device file is opened, user process receives the file descriptor, which it can refer in further file manipulation file system calls.&lt;br /&gt;&lt;br /&gt;Now we will see how things work in kernel when a device file is opened by a user process. Before discussing it, let’s discuss about some related data structures.&lt;br /&gt;&lt;br /&gt;task_struct: This data structure represents the process or task in kernel. Kernel uses this data structure to keep track of process in system and resources they are using. It is one of the main data structures in kernel and contains number of elements to track process specific information.&lt;br /&gt;&lt;br /&gt;files_struct: This structure contains the information related to the open files per process. It keeps information to track the open files of a process. One of the elements of this structure is an array of pointers to “file” structure. This array points to different “file” structures and the index of this array is returned to process as a file descriptor when open system call is made. It also keep a count of total number of open files for a process.&lt;br /&gt;&lt;br /&gt;file: This structure represents an open file for a process. Do not confuse it with physical file, which is represented by “inode” structure in kernel. This structure only remains in kernel memory till the file is open for a process. As soon as the process closes, exits or aborts, all the “file” structures (representing open files for that process) are destroyed, if those are not anymore pointed by any other process.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Some of the elements of “file” structure:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;- f_mode: This element tells in what mode the file has been opened by the process. Process can open a file in either read (FMODE_READ) or write (FMODE_WRITE) mode. This element is normally used by device drivers to check in what mode the device file has been opened.&lt;br /&gt;&lt;br /&gt;- f_pos: This element tells the offset (in bytes) from where to read and write in file.&lt;br /&gt;- f_flags: This element also tells us in which mode the file has been opened (read or write), but it’s always recommendable to use “f_mode” element to check the mode of file. This element remembers one important thing, which might be helpful to driver writers and that is if the file has been opened in non-blocking mode or not. By default (if O_NONBLOCK flag is not mentioned at opening time) the file is opened in blocking mode. Driver checks this flag at the time of reading or writing a device. If the device file is opened in blocking mode (default mode) and at read time there is no data to be read from device or at write time driver specific buffer is full, driver puts the process to sleep on one of its local wait queues (we will soon see what wait queue are). But on other hand if the device file has been opened in non-blocking mode then the driver does not put the process to sleep, rather control returns back to user process with error.&lt;br /&gt;- f_count: This element keeps track of how many processes are referring to this instance of file. As we know that all files of parent process are inherited by child process if file does not have close_on_exec element set. If the child process inherits the files from parent process, the “f_count” element of all inherited files is incremented, so that kernel can keep track of number of process this file structure. “file” structure (representing an open file) does not get destroyed on all close system calls, as it might be shared by other process also. During close system call kernel checks the “f_count” element of “file” structure and if it is zero then only “file” structure is released and its memory Is released.&lt;br /&gt;- f_owner: This element tells which process is the owner of this open file. It contains the pid of the owner process. This element is used for sending the SIGIO signal to the right process in case of asynchronous notification. User process can change this element by using fcntl() system call.&lt;br /&gt;- f_op: This is an important element from the perspective of device driver. This element actually points to the structure of pointers to file operations. For a device file (represented by “file” structure), this element points to the structure, which further contains pointers to driver specific functions. We will discuss in detail, the structure (file_operations) to which this element points.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;- file_operations: This is an important data structure for device driver, as this is the structure through which driver registers its functions with kernel, so that kernel can call them on different events, like opening a device file, reading/writing a device file or sending ioctl commands to device. In case of device file, this structure contains pointer to different functions of driver, through which kernel invokes the driver. Now we will briefly discuss the elements of this data structure.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Some of the elements of “file_operations” structure&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;- llseek: This is a pointer to driver function, which actually moves or sets the “f_pos” element of device file (discussed earlier).&lt;br /&gt;- read: This is a pointer to driver function, which actually physically reads data from device.&lt;br /&gt;- write: This is a pointer to driver function, which actually physically writes data to device.&lt;br /&gt;- poll: This function is called by either “poll” or “select” system calls.&lt;br /&gt;- ioctl: This function is called by ioctl system call. This function in drier is used to pass on the special commands to device, format the device or setting the read/write head of device, which are different from normal read / write commands.&lt;br /&gt;- mmap: This function of driver is used to map the device memory area to process virtual address space.&lt;br /&gt;- open: This function is used to open a file. Incase of device file, this function of driver initialize the device or initializes other book keeping data structures.&lt;br /&gt;- flush: This function flushes the driver buffer to physical file. This should be implemented in driver if driver wants to provide a facility to application to make sure that all the data is physically put on device.&lt;br /&gt;- release: This function is called by close system call, but it is not called for every close system call. As described earlier also that one “file” data structure (which represents an open file in kernel) can be referred by more than one process, if processes are sharing the file (best example is FIFO files), in that case close system file does not release the “file” data structure and only decrement the “f_count” element of “file” data structure. If after decrementing “f_count” element turns to be zero, close system call, calls the associated “release” function, which is a driver function in case of device files. So “release” function of driver should clear and release all the memory acquired. “release” is just an opposite of “open” function.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Few Important Kernel Mechanisms used in Drivers&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Wait Queues:&lt;br /&gt;Wait queue is a mechanism in kernel through which the kernel code can put the process to sleep. This is used in different parts of kernel where the kernel decides to put the process to sleep. Kernel puts the process to sleep in case the required event has not yet occurred (for e.g. some process wants to read from device and there is no data to be read), in this case kernel puts the process to sleep and gets back the processor from it by calling the schedule() function, which is a scheduler in Linux Kernel. schedule() function schedules and dispatches the other process.&lt;br /&gt;&lt;br /&gt;Before discussing the function related to sleeping a process, we should look what data structures are used for implementing a wait queue in kernel.&lt;br /&gt;&lt;br /&gt;Wait queue is actually a linked list of “wait_queue_t” type of structures. The head of wait queue is represented by “wait_queue_head_t” structure, which contains the spin lock to synchronize the access to wait queue. “wait_queue_head_t” structure also contains the pointer to the first element in wait queue. Each element in the wait queue is represented by “wait_queue_t” structure, which contains the pointer to the “task_struct” type of structure. It also contains the pointer to next element in the wait queue. “task_struct” represents the alive process in kernel. So with this mechanism of wait queue driver or any kernel part can keep track of process waiting for a specific event to occur.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Putting process to sleep:&lt;/span&gt;&lt;br /&gt;Process can be put to sleep by using any of the following kernel functions. You can call these functions from anywhere in the kernel (drivers, modules or the core kernel) in case you want to put your process to sleep. Whenever a kernel code is executed (when system call is made by the user process), kernel code executes in the context of process which has made a system call. But there is exception to this rule, whenever the interrupt occurs the kernel code (interrupt handler) does not execute in process context, it’s a anonymous context. This is the reason that we should be careful to not to call any function in interrupt handler which can put the execution thread to sleep. If we do so the kernel will hang, that means the system will hang.&lt;br /&gt;&lt;br /&gt;Functions which can put a process to sleep:&lt;br /&gt;- sleep_on(wait_queue_head_t * wait_queue)&lt;br /&gt;- interruptible_sleep_on(wait_queue_head_t * wait_queue)&lt;br /&gt;- sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)&lt;br /&gt;- interruptible_sleep_on_timeout(wait_queue_head_t * wait_queue, long timeout)&lt;br /&gt;&lt;br /&gt;In above functions, “wait_queue” is the wait_queue_head and “timeout” is the value mentioned in terms of jiffies. We will talk about jiffies very soon. Now we will see the difference between above mentioned functions.&lt;br /&gt;&lt;br /&gt;- sleep_on: This function puts the process to sleep in TASK_UNINTERRPTIBLE mode, which means the process will not be waked up in case process receives any signal while it was in sleep. The process will only be waked up any other part of kernel code wakes it up (normally on the occurrence of some event) deliberately by calling any of the waking function (we will be discussing the waking up functions very soon). Process put to sleep with this function can sometimes cause some problem. For e.g. if a process is put to sleep with this function and the event on which it need to be waked up does not occur then your process will not come back to the execution stage. That process can not even be killed by sending a KILL signal, as process in sleep in TASK_UNINTERRUPTIBLE mode ignores all signals. Process put to sleep with this function can be waked if any of the following conditions occur:&lt;br /&gt;&lt;br /&gt;o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting&lt;br /&gt;&lt;br /&gt;- interruptible_sleep_on: This function in kernel is written to avoid the problem caused by “sleep_on” function. This function puts the process to sleep in TASK_INTERRUPTIBLE mode. When a process sleeps in this mode, it can be waked up if any of the following condition occurs:&lt;br /&gt;&lt;br /&gt;o Process receives the signal either from any other process or kernel itself.&lt;br /&gt;o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.&lt;br /&gt;&lt;br /&gt;- sleep_on_timeout: This function is similar to “sleep_on” function but is not that much dangerous as “sleep_on”. Process put to sleep with this function can be waked if any of the following conditions occurs:&lt;br /&gt;&lt;br /&gt;o Time mentioned in the timeout parameter has expired&lt;br /&gt;o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.&lt;br /&gt;&lt;br /&gt;- interruptible_sleep_on_timeout: I hope by now you can easily guess what this function does. Well the process put to sleep with this function is waked up when any of the following conditions occurs:&lt;br /&gt;&lt;br /&gt;o Process receives the signal either from any other process or kernel itself.&lt;br /&gt;o Time mentioned in the timeout parameter has expired&lt;br /&gt;o Process is deliberately waked up by some part of the kernel code on the occurrence of event for which it was waiting.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Waking up a process:&lt;/span&gt;&lt;br /&gt;Process put to sleep should also be waked up by some kernel code else the process will never return to the execution state. If your driver is putting the process to sleep, it’s the responsibility of that driver itself to wake up the sleeping processes when the required event occurs for which those processes are waiting. For e.g. if your driver put the reading process to sleep on its internal waiting queue, if there is nothing to read from driver buffer (driver buffer empty) then the process put to sleep should also be waked up whoever new data arrives in driver buffer (this will occur when device interrupts, so interrupt handler will be responsible for waking up the process sleeping on the driver’s waiting queue).&lt;br /&gt;&lt;br /&gt;Functions which can be explicitly called to wake up the process:&lt;br /&gt;- wake_up(wait_queue_head_t * wait_queue)&lt;br /&gt;- wake_up_interruptible(wait_queue_head_t * wait_queue)&lt;br /&gt;- wake_up_sync(wait_queue_head_t * wait_queue)&lt;br /&gt;- wake_up_interruptible_sync(wait_queue_head_t * wait_queue)&lt;br /&gt;&lt;br /&gt;__________________________________________________________________&lt;br /&gt;&lt;br /&gt;That’s it for this time. In next article I will cover ‘ioctl’ interface of devices and different timing mechanisms available in Linux Kernel.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-111207771762246030?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/111207771762246030/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=111207771762246030' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/111207771762246030'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/111207771762246030'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2005/03/introduction-to-linux-device-driver.html' title='Introduction to Linux Device Driver Programming'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-111198608101619947</id><published>2005-03-27T20:50:00.000-08:00</published><updated>2005-03-29T21:34:28.680-08:00</updated><title type='text'>2.6.x O(1) scheduler</title><content type='html'>&lt;span style="font-weight:bold;"&gt;1.Linux 2.6.x Scheduler&lt;/span&gt;&lt;br /&gt;Author: Saravanan S&lt;br /&gt;e-mail: sar.kernel@gmail.com&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.0 Scheduler - Introduction&lt;/span&gt;&lt;br /&gt;-------------------------------------------------------------------&lt;br /&gt;Scheduler is the basis of a multitasking operating system. Multitasking kernels allow many processes to execute at the same time and each process runs under a impression that it is the only process on the system. It is the part of the kernel code which chooses which process/thread to run next. It decides how to allocate CPU cycles to processes/threads to acheive "good performance". Good performance means many things according to the needs of a system. A scheduler tries to optimise the following&lt;br /&gt;&lt;br /&gt;-CPU utilization: Fraction of time CPU is in use &lt;br /&gt;-Throughput: average number of jobs completed per time unit&lt;br /&gt;-Turnaround Time: average time between job submission and completion   &lt;br /&gt;-Waiting Time: average amount of time a process is ready but waiting&lt;br /&gt;-Response Time: time until the system responds to a command   &lt;br /&gt;-Response Ratio: (Turnaround Time)/(Execution Time) -- long jobs should wait longer&lt;br /&gt;&lt;br /&gt;Different systems require different things to be optimised. Thats the reason why "good performance" means different things. For a High performance computer system likes to have high efficiency(throughput) whereas a destop sytem likes to have a high response time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.1 CPU bound Vs IO bound Processes&lt;/span&gt;&lt;br /&gt;--------------------------------------------------------------------------&lt;br /&gt;IO bound processes are those processes which spend most of the time waiting for the user's response(or waiting for IO)eg:Text editor. CPU bound processes are those processes which spend most of their hogging the CPU eg" Matrix Multiplication,Sequnecing DNA. It is not always clear whether a process is IO bound or CPU bound.It is detetermined when the processes run whether it is IO bound or CPU bound. If the process spends most of its time in waiting then it is a IO bound process. A scheduler gives higher dynamic priority to IO bound processes for the following reasons&lt;br /&gt;&lt;br /&gt;1. Human interactive processes get scheduled immediately on an input from the user&lt;br /&gt;2. Also as IO operations usually take a long time it is good to get them started soon,so that the CPU can work on CPU bound processes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.2 Linux 2.6.x Scheduler challenges&lt;/span&gt;&lt;br /&gt;---------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;1.Boost the priority of IO bound processes thereby increasing the interactiveness of the system&lt;br /&gt;2.Fairness to all processes ( no process should starve)&lt;br /&gt;3.SMP scheduling - It should be able to schedule processes acrosss many processes&lt;br /&gt;4.SMT scheduling - Scheduling multiple threads on a single Multithreading chip. eg: Intel's Hyper Threading Technology&lt;br /&gt;5.NUMA scheduling - Run a single system image across many nodes.&lt;br /&gt;6.Real time scheduling - Linux supports soft real time scheduling&lt;br /&gt;&lt;br /&gt;The linux 2.6.x scheduler achieves all the above scheduling perspectives in O(1)(constant) time. &lt;br /&gt;&lt;br /&gt;The linux 2.6.x scheduler is a priority based round robin premptive scheduler&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.3 Design of the 2.6.x scheduler&lt;/span&gt;&lt;br /&gt;-------------------------------------------------------------------------&lt;br /&gt;1. 140 different priorities to processes( 1 - 100 for real time processes, 101-140 for normal tasks)&lt;br /&gt;2. Real time tasks - two policies&lt;br /&gt; i) SCHED_FIFO - They are scheduled and run as long as they want to&lt;br /&gt; ii)SCHED_RR - Similaar to SCHED_FIFO but have timeslices, within in a certain priority they are scheduled in a round robin fashion. SCHED_FIFO tasks can prempt them.&lt;br /&gt;3. Normal tasks&lt;br /&gt; i)Initially prio= MAX_RT_PRIO + NICE + 20&lt;br /&gt; ii)Assigned a time slice according to the static priority, within the same priority executed in a round robin fashion&lt;br /&gt; iii)Ensures dynamic priority + fairness&lt;br /&gt; &lt;br /&gt;Lets looks at the 2.6.x scheduler algorithm&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.4 Data Structures related to the 2.6.x scheduler&lt;/span&gt;&lt;br /&gt;------------------------------------------------------------------------&lt;br /&gt;1. runqueues - Every processor has a runqueue. It keeps track of all the runnable processes in a CPU.&lt;br /&gt;2. prio_array - There are two priority arrays per runqueue- One active array which contains the tasks whose timeslice is left and another expired array which contains the tasks whose timeslice has expired&lt;br /&gt;&lt;br /&gt;The above 2 data structures form the basis of the O(1) scheduler. They are contained in the kernel/sched.c file as below&lt;br /&gt;&lt;br /&gt;struct runqueue {&lt;br /&gt;        spinlock_t lock;&lt;br /&gt;&lt;br /&gt;        /*&lt;br /&gt;         * nr_running and cpu_load should be in the same cacheline because&lt;br /&gt;         * remote CPUs use both these fields when doing load calculation.&lt;br /&gt;         */&lt;br /&gt;        unsigned long nr_running;/* number of runnable tasks */&lt;br /&gt;#ifdef CONFIG_SMP&lt;br /&gt;        unsigned long cpu_load;/* this CPU's load - average no of processes on the CPU per unit of time*/&lt;br /&gt;#endif&lt;br /&gt;       unsigned long long nr_switches; /* number of context switches */&lt;br /&gt;       unsigned long expired_timestamp, nr_uninterruptible; /* time of last array swap and number of&lt;br /&gt;                                                          uninterruptible processes in queue */&lt;br /&gt;       unsigned long long timestamp_last_tick; /* timestamp of last scheduler tick */&lt;br /&gt;       task_t *curr, *idle; /* this processors current and idle task */&lt;br /&gt;       struct mm_struct *prev_mm; /* the last running task's mm_struct */&lt;br /&gt;       prio_array_t *active, *expired, arrays[2]; /* the active and expired prio_arrays */&lt;br /&gt;       int best_expired_prio; /* highest priority that exists in the expired prio_array */&lt;br /&gt;       atomic_t nr_iowait; /* number of tasks in the queue waiting on i/o */&lt;br /&gt;&lt;br /&gt;#ifdef CONFIG_SMP&lt;br /&gt;        struct sched_domain *sd;/* in SMP systems there can be different scheduler domains */&lt;br /&gt;&lt;br /&gt;        /* For active balancing */&lt;br /&gt;        int active_balance;&lt;br /&gt;        int push_cpu;&lt;br /&gt;/* this migration thread for the processor that this runqueue belongs to */&lt;br /&gt;        task_t *migration_thread;&lt;br /&gt;        struct list_head migration_queue;&lt;br /&gt;#endif&lt;br /&gt;}&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;struct prio_array {&lt;br /&gt;        unsigned int nr_active;/* number of runnable tasks in this prio_array */&lt;br /&gt;        unsigned long bitmap[BITMAP_SIZE];/* bitmap showing which priority levels contain tasks */&lt;br /&gt;        struct list_head queue[MAX_PRIO];/* a list of array heads, one for each priority on the system */&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.5 A simple description of the 2.6.x O(1) algorithm&lt;/span&gt;&lt;br /&gt;--------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;The prio_array data structure is extremely important as it is what allows the Linux scheduling algorithm to perform in O(1) time.&lt;br /&gt;The basic structure in the Linux scheduler is the runqueue, defined above.There is one runqueue per processor, and within that runqueue there are two structures of type prio_array. One is for tasks that have not used up their timeslice yet, the other is for tasks that have used up their timeslice. The former are considered active, the latter expired. Note that active and expired has nothing to do with whether or not a task is runnable - active simply means that since the last time timeslices were allocated, a given task in that queue has not used up its timeslice. A task in the active list still has time available on the CPU, tasks in the expired list have used up their timeslice.&lt;br /&gt;The nr_active value stores the number of runnable tasks in the prio_array. The bitmap is a string of bits, one for each priority level on the system (140 by default), that indicates whether or not there are any tasks in the prio_array at a given priority level. The queue value is an array of pointers to arrays that store all tasks at a given priority level.&lt;br /&gt;So if there is only one runnable task in the prio_array, nr_active will be equal to one. If that task is not RT, and it has a nice value of 20, there will be a one in the 119th position of the bitmap to indicate that there is a task in the prio_array at that priority level. The queue array would have a pointer at the 119th position pointing to an array of length 1, its single element being the task in question. This is very useful because in order to determine the next task to run, the scheduler simply&lt;br /&gt;1)  looks to see if there are any runnable tasks in its active prio_array (i.e. is nr_active &gt; 0)&lt;br /&gt;2)  if so, go to step 3 otherwise go to step 6&lt;br /&gt;3)  find the first 1 in the active prio_array's bitmap. There must be a 1 somewhere since we know that there is a task in the prio_array and it must have a priority level.&lt;br /&gt;4)  run the first task in the array at the position in the prio_array's queue equal to the first 1 found in the bitmap.&lt;br /&gt;5)  when the task is done running for some reason,  recalculate its new timeslice and put it in the expired prio_array. decrement nr_active in the active prio_array, and increment it in the expired prio_array. if the task was the last task at a given priority,clear the priority's bit in the active prio_array and make sure the priority's bit  is set in the expired prio_array. repeat steps 1-4 until no tasks exist in the active prio_array.&lt;br /&gt;6)  when no tasks exist in the active prio_array, swap the active and inactive prio_arrays and start over again. since timeslices are recalculated for each process when it is put onto the expired array, the swap of prio_arrays is fast (i.e. no&lt;br /&gt;sitting around recalculating a timeslice for every task)&lt;br /&gt;  &lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.6 Dynamic priorities&lt;/span&gt;&lt;br /&gt;-------------------------------------------------------------&lt;br /&gt;In order to boost IO bound processes, linux uses a heuristic to determine which processes are interactive and to boost their priority.Tasks that are IO bound tend to sleep on IO, so the metric that linux uses is based on this. The metric is sleep_avg whose value is increased depending on the amount of time it spends in sleeping and decreased depending on the amount of time that it spends in the CPU.It is fairly accurate, since it depends not only on the amount of time that the task spends in sleeping but also on the amount of time that the task runs in the CPU. So CPU bound processes which occasionaly sleep for a long time will not get a priority boost and also IO processes which wait on an IO and the suddenly tend to hog the CPU will also not get a priority boost. In summary depending on the sleep_avg IO bound processes get a maximum prio bonus of -5 and CPU bound processes get a penalty of +5.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.7 Interactive credits&lt;/span&gt;&lt;br /&gt;---------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;They help to control the rise and fall of the interactive status of tasks.Tasks get an interactive credit when they sleep for a larger number of times and lose an credit if they hog the CPU for long time.&lt;br /&gt;High Interactive task &gt; 100 interactive credits&lt;br /&gt;Low Interactive task  &lt; -100 interactive credits&lt;br /&gt;&lt;br /&gt;A CPU hog which waits occasionaly on IO for a long time must not get an sleep_avg increase as it is a CPU hog which waits on IO for once.So they are given low interactivity credit. Similarly an high interactivity task must not get less run time since it hogged the CPU for longer once.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.8 Main scheduler functions&lt;/span&gt;&lt;br /&gt;---------------------------------------------------------------------&lt;br /&gt;A discussion of the important scheduler functions in sched,c follows&lt;br /&gt;&lt;br /&gt;schedule&lt;br /&gt;--------&lt;br /&gt;It starts by disabling preemption and checks if it called of kernel premption,then uniterruptible tasks are removed from runqueue and interruptible tasks with a signal sent becomes runnable. Then a check is made whether there are tasks in the runqueue, if there are no tasks an attempt to rebalance is made.It balancing does not bring in any runnable tasks then the idle task is run.It there are runnable tasks but not in the active array,then the active and the expired array are swappped.At this point there is a runnable task in the active array,then check the bitmap to find the first bit that is set and thereby the highest priority level runnable task.At this a chance is given for dependant sleepers to run(which happens only in a SMT system). The cuurent CPU runs the idle task so that dependant sleeper wakes up and runs.If there is no switch to idle task at this point a check is being made to check whether the task chosen to run next is an RT task, if it is not an RT task and is woken up, then it is given a higher sleep_avg( a bonus for interactive processes). Then the task switch is made if previous task is not equal to next.&lt;br /&gt;&lt;br /&gt;Invocation&lt;br /&gt;The schedule function is invoked in the following cases&lt;br /&gt;1.When a task gives up the CPU voluntarily(sched_yield syscall)&lt;br /&gt;2.schedule_tick function sets the TIF_NEED_RESCHED flag on a task because it has run out of time. The scheduler_tick is executed on every timer interrrupt.&lt;br /&gt;3.When preemption is enabled. Premption can occur as below&lt;br /&gt; i) User Preemption&lt;br /&gt;    .When returning to user space from a system call&lt;br /&gt;    .When returning to user space from a interrupt handler&lt;br /&gt; ii)Kernel Preemption&lt;br /&gt;    .When returning to kernel space from an interrupt handler&lt;br /&gt;    .When kernel code becomes preemptible again&lt;br /&gt;    .If a task in the kernel explicitly calls schedule()&lt;br /&gt;    .If a task in the kernel blocks&lt;br /&gt;    In all the above cases except for the third one a check on the preempt_count is made, if preempt_count is not zero a lock is held by the current process then scheduler is not invoked.&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;effective_prio function&lt;br /&gt;-----------------------&lt;br /&gt;&lt;br /&gt;The effective prio function calcualtes a tasks dynamic priority. sleep_avg is the heuristic used to calculate the dynamic priority of a process.The function checks if the task is a RT task, if so it returns as there is no dynamic priority for a real time task.If the task is not a real time task, the function calculates the dynamic priority of the task as follows&lt;br /&gt; bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;&lt;br /&gt;        prio = p-&gt;static_prio - bonus;&lt;br /&gt;        if (prio &lt; MAX_RT_PRIO)&lt;br /&gt;                prio = MAX_RT_PRIO;&lt;br /&gt;        if (prio &gt; MAX_PRIO-1)&lt;br /&gt;                prio = MAX_PRIO-1;&lt;br /&gt;        return prio;&lt;br /&gt;The current bonus macro maps the sleep_avg on to the range 0-MAX_BONUS(which is 10). So the bonus ranges from -5 to +5 as discussed above which is then added to the dynamic priority(prio). Finally the function after the calculation of the dynamic priority checks whether the prioroty falls with in the non RT range. If so it adjusts the prio to fall within the range.&lt;br /&gt;&lt;br /&gt;Invocation:&lt;br /&gt;1.Called by recalc_task_prio() which is called by the main scheduled function when a task is inserted in to the expired array and also when tasks are activated i.e, when a task is moved in to the runqueue.&lt;br /&gt;2.The thread and process wakeup calls.  &lt;br /&gt;3.scheduler_tick() which runs during every timer interrupt.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;task_timeslice function&lt;br /&gt;-----------------------&lt;br /&gt;&lt;br /&gt;The tasks timeslice is completely based on static priority.The higher the task's static priority the larger the task's timeslice. Essentialy the function calculates the timeslice as minimum timeslice plus the task's static priority scaled on to the maximum possible range.&lt;br /&gt;&lt;br /&gt;Invocation&lt;br /&gt;It is the interface used by the schduler to know the timeslice of a process&lt;br /&gt;&lt;br /&gt;load_balance function&lt;br /&gt;---------------------&lt;br /&gt;In a SMP system balancing among the the runqueues of different processors is done using load_balance function.load_balance looks for the busiest group in the domain.Before proceeding further let us know what is a scheduler domain.Scheduler domains comes in to picture in case of SMP and NUMA systems. In a NUMA system, a parent scheduling domain contains all the CPUs in the system(CPU's of all nodes), the child scheduler domain contains all the CPU's in each node, and each chis has one group per physical CPU(group contains the virtual SMT CPUs).The load_balance function tries to find the busiest group(group for an SMT) in the domain,if there is no busiest group it exits.Then it identifies the busiest runqueue within the group(a runqueue is busy if it has more than 25% processes than the current runqueue).After identifying the busiest runqueue it tries to pull tasks in to the current runqueue by identifying high priority tasks in the expired array(expired so as to maintain cache hotness).&lt;br /&gt;&lt;br /&gt;Invocation&lt;br /&gt;1.rebalance_tick()- run on every timer tick,checks whether load balancing was not done for the time interval balance_inetrval before invoking load_balance.&lt;br /&gt;2.schedule() - invoked when there is not runnable processes in the current runqueue.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.9 Scheduler tuning&lt;/span&gt;&lt;br /&gt;-----------------------------------------------------------&lt;br /&gt;Some of tha parameters in sched.c than can be used for scheduler tuning are as follows&lt;br /&gt;1.MIN_TIMESLICE and MAX_TIMESLICE - determines the default timeslice. Increased timeslices improve efficieny as context switches decrease but response time will be slow.&lt;br /&gt;2.PRIO_BONUS_RATIO - This is parameter which is used to calculate how much penalty or bonus must be given if the task is CPU bound or IO bound. It essentially controls to which degree the static_prio is effective.If it is high static_prio is less effective and dynamice priority becomes more effective, thereby IO bound processes are boosted further.&lt;br /&gt;3.MAX_SLEEP_AVG - Increasing this value means tasks have to sleep longer to be considered interactive.Effiency might improve as there might be fewer increases in dynamic priority.&lt;br /&gt;4.STARVATION_LIMIT - This value controls fairness, if a task has not run for this limit then interactive tasks stop running for this task to run. Decreasing the value increases fairness but at the same time interactivity decreases.Increasing the value improves interactivity at the expense of non interactive tasks.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1.10 Future of 2.6.x scheduler&lt;/span&gt;&lt;br /&gt;---------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;1.The 2.6.x scheduler is a generic scheduler which tries to perform well in server market(efficiency important),desktop market(response time important).There are considerations to have the schedulers designed specifically for each of these systems separately in the kernel and provide may be a proc interface to select the scheduler depending on the system.Ingo had rejections to such a kind of a idea saying that it will decrease the amount of work that will be put in to a generice scheduler which performs in all conditions.&lt;br /&gt;2. Shared runqueues - SMT support in the kernel is not perfect, for example if there are idle virtual processors the balance will be done in these virtual CPUs instead of on a differnt physical CPU. The solution out forward by Robert M love is to have shared runqueues, ie.., runqueues will be shared in the case of SMT for virtual CPUs.&lt;br /&gt;&lt;br /&gt;References&lt;br /&gt;----------&lt;br /&gt;1.John aas's documentation of the 2.6.8.1 scheduler&lt;br /&gt;josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf&lt;br /&gt;2.Robert Love's Linux kernel development - First edition&lt;br /&gt;The scheduler chapter of the first edition is available online as a sample chapter&lt;br /&gt;http://www.samspublishing.com/articles/article.asp?p=101760&lt;br /&gt;3.Kernel source( kernel/sched.c file and include/linux/sched.h files)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-111198608101619947?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/111198608101619947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=111198608101619947' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/111198608101619947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/111198608101619947'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2005/03/26x-o1-scheduler.html' title='2.6.x O(1) scheduler'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-110534417501432568</id><published>2005-01-09T23:59:00.000-08:00</published><updated>2005-01-10T00:02:55.013-08:00</updated><title type='text'>Some fun with Linux Netfilter Hooks</title><content type='html'>|=------------=[ Some Tricks with Linux netfilter hooks ]=---------------=|&lt;br /&gt;|=-----------------------------------------------------------------------=|&lt;br /&gt;&lt;br /&gt;This document is based on my understanding of bioforge's "Hacking the Linux Kernel Network Stack" article in Phrack issue 61. Please do correct me if I am wrong somewhere.The sourcecodes provided here is compiled and tested in 2.4.28 kernel.&lt;br /&gt;&lt;br /&gt;Netfilter  is  a   subsystem  in the Linux 2.4  kernel. Netfilter provides an generic and abstract interface to the standard routing code.This is currently used in Linux kernel for packet filtering,mangling,NAT(network address  translation) and queuing packets to the userspace.Netfilter makes connection tracking possible through the use of various hooks in the kernel's network code. These hooks are places that kernel code, either statically  built  or  in  the form  of a loadable  module,  can register functions to be called for specific  network events. An example of such an event is the reception of a packet.&lt;br /&gt;&lt;br /&gt;Although  Linux 2.4 supports  hooks for IPv4,IPv6 and DECnet, only IPv4 will be discussed in this document.&lt;br /&gt;&lt;br /&gt;Netfilter defines five hooks for IPv4. The declaration of the symbols for these can be  found in  "linux/netfilter_ipv4.h". These hooks are displayed in the table below:&lt;br /&gt;&lt;br /&gt;Table 1: Available IPv4 hooks&lt;br /&gt;&lt;br /&gt;   Hook                 Called&lt;br /&gt;NF_IP_PRE_ROUTING   After sanity checks, before routing decisions.&lt;br /&gt;NF_IP_LOCAL_IN      After routing decisions if packet is for this host.&lt;br /&gt;NF_IP_FORWARD       If the packet is destined for another interface.&lt;br /&gt;NF_IP_LOCAL_OUT     For packets coming from local processes on&lt;br /&gt;		    their way out.&lt;br /&gt;NF_IP_POST_ROUTING  Just before outbound packets "hit the wire".&lt;br /&gt;&lt;br /&gt;The  NF_IP_PRE_ROUTING  hook  is called  as the first  hook after a packet has been received.  This is the hook that the module presented later  will utilise. Yes the other hooks are  very  useful  as  well, but  for now  we will  focus  only on NF_IP_PRE_ROUTING.&lt;br /&gt;&lt;br /&gt;After hook  functions have done  whatever  processing they need to do with a packet they must  return one of the  predefined  Netfilter return codes.&lt;br /&gt;These codes are:&lt;br /&gt;&lt;br /&gt;Table 2: Netfilter return codes&lt;br /&gt;&lt;br /&gt;Return Code          Meaning&lt;br /&gt;  NF_DROP        Discard the packet.&lt;br /&gt;  NF_ACCEPT      Keep the packet.&lt;br /&gt;  NF_STOLEN      Forget about the packet.&lt;br /&gt;  NF_QUEUE       Queue packet for userspace.&lt;br /&gt;  NF_REPEAT      Call this hook function again.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The   NF_DROP  return  code  means  that  this  packet  should  be  dropped completely  and  any  resources  allocated  for  it  should  be   released. NF_ACCEPT tells Netfilter  that so far the  packet is  still acceptable and that it  should move to the  next stage of the network stack.  NF_STOLEN is an interesting one because it tells Netfilter to "forget" about the packet.&lt;br /&gt;What this  tells Netfilter is  that the hook  function will take processing of this packet from  here and that Netfilter  should drop all processing of it.  This does  not  mean,  however, that  resources  for  the  packet  are released. The packet and it's respective sk_buff structure are still valid, it's just that  the hook  function has  taken  ownership of the packet away&lt;br /&gt;from  Netfilter. NF_REPEAT requests that Netfilter calls the hook  function again.&lt;br /&gt;&lt;br /&gt;Registration of a  hook  function is a very  simple  process  that revolvesaround the  nf_hook_ops structure,defined in "linux/netfilter.h".The definition of this structure is as follows:&lt;br /&gt;&lt;br /&gt;          struct nf_hook_ops {&lt;br /&gt;                  struct list_head list;&lt;br /&gt;&lt;br /&gt;                  /* User fills in from here down. */&lt;br /&gt;                  nf_hookfn *hook;&lt;br /&gt;                  int pf;&lt;br /&gt;                  int hooknum;&lt;br /&gt;                  /* Hooks are ordered in ascending priority. */&lt;br /&gt;                  int priority;&lt;br /&gt;          };&lt;br /&gt;&lt;br /&gt;The  list  member of  this  structure is  used to  maintain  the  lists  of Netfilter hooks and has no importance for hook registration as far as users are concerned.  hook is a  pointer to a  nf_hookfn  function.  This is  the function  that will be  called  for  the  hook.  nf_hookfn  is  defined  in "linux/netfilter.h" as well. The pf field specifies a protocol family.  Valid&lt;br /&gt;protocol families are available from "linux/socket.h" but for IPv4 we want to use PF_INET.  The  hooknum  field specifies the  particular hook to install this function  for and is one of the values listed in table 1. Finally, the priority field specifies where in the order of execution this hook function should be placed.For IPv4,acceptable values are defined in "linux/netfilter_ipv4.h" in the "nf_ip_hook_priorities" enumeration.For the purposes of demonstration modules we will be using NF_IP_PRI_FIRST.&lt;br /&gt;&lt;br /&gt;Registration  of a  Netfilter hook  requires using a  nf_hook_ops structure with the nf_register_hook() function.nf_register_hook() takes the address of an "nf_hook_ops" structure and returns an integer value.However,if you actually   look  at   the  code   for  the  nf_register_hook() function in "net/core/netfilter.c", you will notice that it only ever returns a  value of zero. Provided below is example code that simply registers a function  that&lt;br /&gt;will drop all  packets that  come  in.  This code  will also  show  how the&lt;br /&gt;Netfilter return values are interpreted.&lt;br /&gt;&lt;br /&gt;Listing 1. Registration of a Netfilter hook&lt;br /&gt;/* Sample code to install a Netfilter hook function that will&lt;br /&gt; * drop all incoming packets. */&lt;br /&gt;&lt;br /&gt;#define __KERNEL__&lt;br /&gt;#define MODULE&lt;br /&gt;&lt;br /&gt;#include &lt;linux/module.h&gt;&lt;br /&gt;#include &lt;linux/kernel.h&gt;&lt;br /&gt;#include &lt;linux/netfilter.h&gt;&lt;br /&gt;#include &lt;linux/netfilter_ipv4.h&gt;&lt;br /&gt;&lt;br /&gt;/* This is the structure we shall use to register our function */&lt;br /&gt;static struct nf_hook_ops nfho;&lt;br /&gt;&lt;br /&gt;/* This is the hook function itself */&lt;br /&gt;unsigned int hook_func(unsigned int hooknum,&lt;br /&gt;                       struct sk_buff **skb,&lt;br /&gt;                       const struct net_device *in,&lt;br /&gt;                       const struct net_device *out,&lt;br /&gt;                       int (*okfn)(struct sk_buff *))&lt;br /&gt;{&lt;br /&gt;    return NF_DROP;           /* Drop ALL packets */&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;/* Initialisation routine */&lt;br /&gt;int init_module()&lt;br /&gt;{&lt;br /&gt;    /* Fill in our hook structure */&lt;br /&gt;    nfho.hook = hook_func;         /* Handler function */&lt;br /&gt;    nfho.hooknum  = NF_IP_PRE_ROUTING; /* First hook for IPv4 */&lt;br /&gt;    nfho.pf       = PF_INET;&lt;br /&gt;    nfho.priority = NF_IP_PRI_FIRST;   /* Make our function first */&lt;br /&gt;&lt;br /&gt;    nf_register_hook(&amp;nfho);&lt;br /&gt;    &lt;br /&gt;    return 0;&lt;br /&gt;}&lt;br /&gt;	&lt;br /&gt;/* Cleanup routine */&lt;br /&gt;void cleanup_module()&lt;br /&gt;{&lt;br /&gt;    nf_unregister_hook(&amp;nfho);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now  its  time to  start looking  at what  data gets  passed  into  hook&lt;br /&gt;functions  and how that data  an be used to make filtering decisions. So&lt;br /&gt;let's look more  closely at  the prototype  for nf_hookfn functions. The&lt;br /&gt;prototype is given in linux/netfilter.h as follows:&lt;br /&gt;&lt;br /&gt;	  typedef unsigned int nf_hookfn(unsigned int hooknum,&lt;br /&gt;                                         struct sk_buff **skb,&lt;br /&gt;                                         const struct net_device *in,&lt;br /&gt;                                         const struct net_device *out,&lt;br /&gt;                                         int (*okfn)(struct sk_buff *));&lt;br /&gt;&lt;br /&gt;The first argument to  nf_hookfn  functions  is a value specifying one of the hook types given in table 1. The second argument is more interesting. It is a pointer to a pointer to a  sk_buff  structure, the structure used by  the network  stack to describe packets.  This structure is defined in "linux/skbuff.h" .&lt;br /&gt;&lt;br /&gt;Possibly the  most useful fields out of  sk_buff structures are the three unions that describe the transport header (ie. UDP, TCP, ICMP, SPX), the network header (ie. IPv4/6, IPX, RAW) and the link layer header (Ethernet or RAW). The names of these unions are h, nh and mac respectively.  These unions contain several structures, depending on what protocols are in use in a particular packet.  One should note that  the transport  header and network header may very well point to the same  location in memory.  This is  the  case for TCP packets where  h  and  nh  are both  considered  as pointers  to IP header structures.  This means  that attempting  to get a&lt;br /&gt;value from h-&gt;th thinking it's pointing to the TCP header will  result in false results because &lt;br /&gt;h-&gt;th will actually be pointing to the  IP  header,just like nh-&gt;iph.&lt;br /&gt;&lt;br /&gt;The  two  arguments  that  come  after   skb   are pointers to net_device structures.net_device  structures are  what the  Linux  kernel uses to describe network interfaces of all sorts.It is defined in "linux/netdevice.h". The first of these structures, in,  is  used  to  describe the  interface the  packet  arrived on.   Not surprisingly, the  out  structure  describes  the interface the packet is leaving on.  It is  important to  realise that  usually only one of these structures will be provided.For instance, in will only be provided for the NF_IP_PRE_ROUTING and NF_IP_LOCAL_IN hooks.out will only be provided for the NF_IP_LOCAL_OUT and NF_IP_POST_ROUTING  hooks. At this stage I haven't  tested  which  of  these  structures  are   available for the NF_IP_FORWARD hook but if you make sure the pointers  are non-NULL before attempting to  dereference  them you  should be fine.&lt;br /&gt;&lt;br /&gt;Finally,the last item passed into a hook function is a function pointer called  okfn  that takes a  sk_buff  structure as  its only  argument and returns an integer. I'm not too sure on what this function does.  Looking in "net/core/netfilter.c" there are two places where this okfn is called. These  two places are  in the functions  nf_hook_slow() and nf_reinject()&lt;br /&gt;where at a  certain place this  function is called on a  return value of NF_ACCEPT from a Netfilter hook. If anybody has more information on okfn please let me know.&lt;br /&gt;&lt;br /&gt;Now that we've looked at the most interesting and useful bits of information that our hook functions receive, it's time to look at how we can use that information to filter packets.&lt;br /&gt;&lt;br /&gt;We will build a module which filters packets based on their TCP destination port.This is only a bit more fiddly than checking IP addresses because  we need  to  create  a  pointer  to  the TCP header ourselves.  Remember what was discussed earlier about  transport  headers&lt;br /&gt;and  network  headers?  Getting a pointer  to the TCP header  is a simple matter of allocating a pointer to a struct tcphdr (define in linux/tcp.h) and  pointing after the IP header in our packet data.Perhaps an example would help.  Listing 2 presents code to check if the destination TCP port of a packet matches some  port we want to drop all  packets for.&lt;br /&gt;&lt;br /&gt;Listing 2. Checking the TCP destination port of a received packet&lt;br /&gt;          /* Sample code to install a Netfilter hook function that will&lt;br /&gt; 	   * drop all incoming packets to a particular TCP destination port.*/&lt;br /&gt;&lt;br /&gt;          #define __KERNEL__&lt;br /&gt;          #define MODULE&lt;br /&gt;&lt;br /&gt;          #include &lt;linux/module.h&gt;&lt;br /&gt;          #include &lt;linux/kernel.h&gt;&lt;br /&gt;          #include &lt;linux/skbuff.h&gt;&lt;br /&gt;          #include &lt;linux/ip.h&gt;                  /* For IP header */&lt;br /&gt;	  #include &lt;linux/tcp.h&gt;&lt;br /&gt;	  #include &lt;linux/in.h&gt;			 /* For IPPROTO_TCP */&lt;br /&gt;          #include &lt;linux/netfilter.h&gt;&lt;br /&gt;          #include &lt;linux/netfilter_ipv4.h&gt;&lt;br /&gt;&lt;br /&gt;          /* This is the structure we shall use to register our function */&lt;br /&gt;          static struct nf_hook_ops nfho;&lt;br /&gt;&lt;br /&gt;          /* Port no we want to drop packets going to, in NB order */&lt;br /&gt;          unsigned char *deny_port = "\x00\x19";   /* port 25 */&lt;br /&gt;&lt;br /&gt;          /* This is the hook function itself */&lt;br /&gt;          unsigned int hook_func(unsigned int hooknum,&lt;br /&gt;                                 struct sk_buff **skb,&lt;br /&gt;                                 const struct net_device *in,&lt;br /&gt;                                 const struct net_device *out,&lt;br /&gt;                                 int (*okfn)(struct sk_buff *))&lt;br /&gt;          {&lt;br /&gt;              struct sk_buff *sb = *skb;&lt;br /&gt;&lt;br /&gt;              struct tcphdr *thead;&lt;br /&gt;&lt;br /&gt;              /* We don't want any NULL pointers in the chain&lt;br /&gt;	       * to the IP header. */&lt;br /&gt;              if (!sb ) return NF_ACCEPT;&lt;br /&gt;              if (!(sb-&gt;nh.iph)) return NF_ACCEPT;&lt;br /&gt;&lt;br /&gt;              /* Be sure this is a TCP packet first */&lt;br /&gt;              if (sb-&gt;nh.iph-&gt;protocol != IPPROTO_TCP) {&lt;br /&gt;                  return NF_ACCEPT;&lt;br /&gt;              }&lt;br /&gt;&lt;br /&gt;              thead = (struct tcphdr *)(sb-&gt;data +&lt;br /&gt;                                       (sb-&gt;nh.iph-&gt;ihl * 4));&lt;br /&gt;&lt;br /&gt;              /* Now check the destination port */&lt;br /&gt;              if ((thead-&gt;dest) == *(unsigned short *)deny_port) {&lt;br /&gt;                  return NF_DROP;&lt;br /&gt;              }&lt;br /&gt;          &lt;br /&gt;	      return NF_ACCEPT;&lt;br /&gt;          }&lt;br /&gt;&lt;br /&gt;          /* Initialisation routine */&lt;br /&gt;          int init_module()&lt;br /&gt;          {&lt;br /&gt;              /* Fill in our hook structure */&lt;br /&gt;              nfho.hook     = hook_func;&lt;br /&gt;              /* Handler function */&lt;br /&gt;              nfho.hooknum  = NF_IP_PRE_ROUTING; /* First for IPv4 */&lt;br /&gt;              nfho.pf       = PF_INET;&lt;br /&gt;              nfho.priority = NF_IP_PRI_FIRST;   /* Make our func first */&lt;br /&gt;          &lt;br /&gt;              nf_register_hook(&amp;nfho);&lt;br /&gt;&lt;br /&gt;              return 0;&lt;br /&gt;          }&lt;br /&gt;          &lt;br /&gt;	  /* Cleanup routine */&lt;br /&gt;          void cleanup_module()&lt;br /&gt;          {&lt;br /&gt;              nf_unregister_hook(&amp;nfho);&lt;br /&gt;          }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So, as you can see, using Netfilter hooks we can create our own custom filters.There are a lot of interesting things using these hooks. i.e-&lt;br /&gt;&lt;br /&gt;1.Kernel level Firewall ( For an implementation plz check bioforge's article).&lt;br /&gt;2.Kernel level sniffer.&lt;br /&gt;3.Module to hide packets from libpcap so that user level sniffers can't see it. &lt;br /&gt;(For more details plz check bioforge's article).&lt;br /&gt;4.Kernel level Backdoor daemon which will allow users to upload files and execute commands remotely. (Sounds like a cool trojan. isn' it ? ) &lt;br /&gt;....&lt;br /&gt;....&lt;br /&gt;The list is only limited by your imagination :). With the powers made availble to a kernel level programmer, you can do whatever you want ( Just beaware of those pretty nasty kernel faults :-P).&lt;br /&gt;&lt;br /&gt;Happy experimenting with linux network stack.&lt;br /&gt; &lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-110534417501432568?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/110534417501432568/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=110534417501432568' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110534417501432568'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110534417501432568'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2005/01/some-fun-with-linux-netfilter-hooks.html' title='Some fun with Linux Netfilter Hooks'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-110474095987725778</id><published>2005-01-03T01:29:00.000-08:00</published><updated>2005-01-03T01:21:22.533-08:00</updated><title type='text'>Memory Management in Linux Kernel</title><content type='html'>Linux Memory Management: Chapter 02 : Segmentation&lt;br /&gt;-----------------------------------------------------------------------------------------------------------------------&lt;br /&gt;WRITTEN BY : Karthikeyan Raghuraman&lt;br /&gt;REFERENCE  : Understanding the Linux Kernel, O'Reilly Publications&lt;br /&gt;DISCLAIMER : The content below is my understanding of the Chapter 1 in the above mentioned book. I myself being a newbie  do not guarantee the authenticity of the following information.&lt;br /&gt;CONTACT    : karthikeyan dot raghuraman at gmail dot com&lt;br /&gt;-----------------------------------------------------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;The OS is not forced to keep track of physical memory all by itself.todays Micro processors are built in with hardware which make the memory management effecient as well as robust against programming errors.&lt;br /&gt;&lt;br /&gt;**Memory Addresses&lt;br /&gt;------------------&lt;br /&gt;++This is specific to the more common X86 processors.++&lt;br /&gt;&lt;br /&gt;There are three kinds of addresses in the X86 processors.&lt;br /&gt;1.LOGICAL ADDRESS&lt;br /&gt;	This is the SEGMENT:OFFSET representation of the actual address.&lt;br /&gt;2.LINEAR ADDRESS&lt;br /&gt;	A single 32 bit integer that can be used to address till 4GB of space(2^32 bytes)&lt;I am not sure if the current 64 bit procesors have a 64 bit Address bus&gt;.&lt;br /&gt;3.PHYSICAL ADDRESS&lt;br /&gt;	Used to address the memory cells in the memory chips. They are the address sent in the Address Bus.&lt;br /&gt;&lt;br /&gt;The flow of Address Values when the address is represented in Logical Address format.&lt;br /&gt;&lt;br /&gt;Logical Address   |		    | Linear Address   |           |Physical Address&lt;br /&gt;-----------------&gt;|Segmentation Unit|-----------------&gt;|Paging Unit|------------------&gt; AddressBus&lt;br /&gt;		  |		    |		       |	   |	&lt;br /&gt;		  &lt;br /&gt;&lt;br /&gt;**Address Translation in the x86 processors [1]&lt;br /&gt;----------------------------------------------&lt;br /&gt;The address translation occurs in two different ways&lt;br /&gt;	1. REAL Mode.&lt;br /&gt;	2. PROTECTED Mode.&lt;br /&gt;&lt;br /&gt;REAL Mode:&lt;br /&gt;	This is used mainly for Backward compatibility of the processors.&lt;br /&gt;	BIOS uses real mode of addressing.A real mode address comprises of a segment and offset.The corresponding address is given by&lt;br /&gt;		&lt;br /&gt;		SEGMENT ADDRESS * 16 + OFFSET &lt;br /&gt;		say u have address &lt;br /&gt;		FFFF:0001 &lt;br /&gt;		then the addres is  FFFF1. &lt;br /&gt;		FFFF * 16 =  FFFF0&lt;br /&gt;		FFFF0 + 0001 = FFFF1&lt;br /&gt;		&lt;br /&gt;	If you could recollect the x86 architecture after x386 the Global Descriptor Table/Local Descriptor tables are used for calculating the physical address. But the initialization of these registers is done when the computer is booted and is done in the REAL mode.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;**PROTECTED MODE Address translation	&lt;br /&gt;------------------------------------&lt;br /&gt;As most of the operations in the processor occurs in this mode we might need to touch base with the basics of the processor a liitle bit before proceeding with the Address translation.&lt;br /&gt;&lt;br /&gt;* Segmentation Registers&lt;br /&gt;-------------------------&lt;br /&gt;&lt;br /&gt;We know that logical address consists of two parts. SEGMENT identifier:OFFSET&lt;br /&gt;&lt;br /&gt;SEGMENT identifier : 16 bit field for SEGMENT SELECTOR.&lt;br /&gt;OFFSET		   : 32 bit field.&lt;br /&gt;&lt;br /&gt;To retrieve segment selectors quickly we have the SEGMENT REGISTERS in the processor like. There are totally 6 of these registers.&lt;br /&gt;CS :Code Segment&lt;br /&gt;DS :Data Segment &lt;br /&gt;SS :oh no not the Reich army of Himmler;) it is Stack Segment&lt;br /&gt;ES :Extended Segment&lt;br /&gt;FS :&lt;br /&gt;GS :&lt;br /&gt;&lt;br /&gt;The CS register has a 2 bit field which specifies the Current Privilege level(CPL) of the CPU. 0 is highest priority and 3 is lowest priority&lt;br /&gt;&lt;br /&gt;++ Linux uses only two priority levels. 0 and 3 for Kernel and User mode respectively.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;* Segment Descriptor:&lt;br /&gt;---------------------&lt;br /&gt;Each segment is represented by an 8 byte segment descriptor. They are stored either in the Globat Descriptor Table(GDT) or the Local Table Descriptor(LDT).&lt;br /&gt;&lt;br /&gt;Generally only one GDT is defined and each process has its own LDT.&lt;br /&gt;&lt;br /&gt;&lt;-&gt; Address of GDT in Main memory: This is in the _gdtr_ processor register.&lt;br /&gt;&lt;-&gt; Address of LDT in Main memory: This is in the _ldtr_ processor register.&lt;br /&gt;&lt;br /&gt;What is in a Segment Descriptor&lt;br /&gt;-------------------------------&lt;br /&gt;1. BASE Field         : A 32 bit field that contains the linear address of the first byte in the segment.&lt;br /&gt;2. Granularity(G) bit : If set size of the segment in multiples of 4KB else in bytes&lt;br /&gt;3. LIMIT field        : 20 bit field. If G bit is 0 then the size may vary from 1b to 1MB else 4KB to 4GB&lt;br /&gt;4. System(S) bit      : If RESET indicates Kernel is using this else it is a normal segment.&lt;br /&gt;5. Type Field         : 4 bit field indicating the access rights for the descriptor and the segment.&lt;br /&gt;&lt;br /&gt;* List of Segment Descriptors&lt;br /&gt;------------------------------&lt;br /&gt;-CS Descriptor : It can be present in GDT/LDT and has the S flag set.&lt;br /&gt;-DS Descriptor : It can be present in GDT/LDT and has the S flag set. Stacks represented using Generic data segments.&lt;br /&gt;-TSS Descriptor: Task State Segment Descriptor. This segment is used to save the contents of the processor registers. &lt;br /&gt;		 It can appear only in the GDT. The TYPE field has a value 11/9&lt;1011,1001&gt;. S flag is cleared.&lt;br /&gt;-LDTD	       : Local Descriptor Table Descriptor.&lt;br /&gt;		 Indicates that the Segment Descriptor refers to a segment containing an LDT.&lt;br /&gt;		 Present in the GDT only.&lt;br /&gt;		 TYPE Flag : 2&lt;br /&gt;		 S flag	   : 0&lt;br /&gt;		 +Contents of this Descriptor&lt;br /&gt;		 ----------------------------&lt;br /&gt;		 1. DPL : Descriptor Privilege Level. A 2 bit field. Represents the minimum CPL&lt;which is present in CS&gt;   is required to access the Descriptor.&lt;br /&gt;		 2. Segment Present Flag: This is set when the segment is in the main memory else it is reset.&lt;br /&gt;		 3. Data/Code Flag : This flag says if Data is present or Code in the segment.&lt;I am not clear with this flag if anybody can help me with this wud be nice&gt;&lt;br /&gt;		 4. AVL flag :  used by different OS but is ignored by LINUX.&lt;wow one bit less is a lot worth&gt;&lt;br /&gt;		 &lt;br /&gt;&lt;br /&gt;** SEGMENT Selectors&lt;br /&gt;---------------------&lt;br /&gt;- Every segment register maps to a non programmable register in the processor which directs you to the Segment   descriptor.&lt;br /&gt;- Every time a segement register is loaded with the value in a segment selector, the corresponding segment descriptor address is stored in the non-programmable register.&lt;br /&gt;&lt;br /&gt;What does a Segment Selector contain ?&lt;br /&gt;--------------------------------------&lt;br /&gt;1. A 13 bit segment descriptor entry pointer, which identifies the entry in the GDT/LDT.&lt;br /&gt;2. A Table Indicator Flag which describes whether the entry is in GDT/LDT.&lt;reset : GDT, set : LDT&gt;&lt;br /&gt;3. An Request Privilege Level (RPL) which is the same as CPL in the code segment descriptor.&lt;br /&gt;&lt;br /&gt;So now we know that the value of the CPL is used in Code Segment, the LDTD&lt;DPL&gt; and the Segment selector&lt;RPL&gt;.&lt;br /&gt;&lt;br /&gt;How is a logical Address converted to a Linear Address&lt;br /&gt;------------------------------------------------------&lt;br /&gt;		 gdt/ldt&lt;br /&gt;		.-------.&lt;br /&gt;		|	|&lt;br /&gt;		|_______|&lt;br /&gt;	.-----&gt;	|descriptor		&lt;br /&gt;		|_______|-------------&gt; + -----&gt; Linear Address&lt;br /&gt;	|	|	|		|&lt;br /&gt;	|	|	|		|&lt;br /&gt;	|	`-------' 		|&lt;br /&gt;	|				|&lt;br /&gt;	+&lt;-----  gdtr/ldtr		|&lt;br /&gt;	|       ----------		|&lt;br /&gt;	|       ----------		|&lt;br /&gt;	X 8          ^ 			|&lt;br /&gt;	|            |			|&lt;br /&gt;       ------------------	---------------&lt;br /&gt;	Index    |   TI	    :	    Offset	&lt;br /&gt;       ------------------	---------------&lt;br /&gt;       Segment Selector&lt;br /&gt;       ------------------------------------------&lt;br /&gt;             	Logical Address&lt;br /&gt;&lt;br /&gt;Explanation&lt;br /&gt;------------&lt;br /&gt;1. The Segment Selector has the Fields Index&lt;13 bit&gt; and the Table Index Flag&lt;1bit&gt;&lt;br /&gt;2. Each Segment Descriptor is 8 byte long.&lt;br /&gt;3. If the Descriptor X is stored at address MM in the memory, the next Descriptor will be at MM+8.&lt;br /&gt;4. Hence to obtain the address of the descriptors, the Index is MULTIPLIED BY 8.&lt;br /&gt;5. The Table Index flag is used to determine if the table is in the LDT/GDT.&lt;br /&gt;6. If GDT then the GDTR&lt;gdt register&gt; value is added to the result of step 3 else the LDTR is added.&lt;br /&gt;7. Now the value in step 6 represents the address of the descriptor in the GDT/LDT.8. Access the Descriptor.&lt;br /&gt;9. Now from the Descriptor's Base Field&lt;32 bit field&gt; we get the first byte of the segment.&lt;br /&gt;10. Now to this value add the offset in the Logical address.&lt;br /&gt;11. The value in Step 10 is the Linear address.&lt;br /&gt;&lt;br /&gt;Gyan:&lt;br /&gt;----&lt;br /&gt;The first byte of the GDT/LDT is always zero. This is to ensure that a logical Address with a Null Segment selector is invalid.&lt;br /&gt;The Maximum number of segment descriptors that can be stored in GDT is (2^13 - 1 =8191)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;the above topics covered the basics of Segmentation in a x86 processor.&lt;br /&gt;Now for some Linux stuff&lt;br /&gt;&lt;br /&gt;**Segmentation in Linux&lt;br /&gt;-----------------------&lt;br /&gt;Does Linux use Segmentation ?? the fact is it uses it in the most limited way.&lt;br /&gt;Linux relies mainly on Paging than on Segmentation. This is because&lt;br /&gt;- Memory Management is easy when all the processes use the same segment register value, this boils down to &lt;br /&gt;  the processes using the same set of linear addresses&lt;br /&gt;- Linux is ported into many systems. The RISC architecture systems do not support/in a limited way provide segmentation.&lt;br /&gt;&lt;br /&gt;In this architecture it is possible to store all the segments in a single GDT.&lt;can some body throw some light on&lt;br /&gt;why this is not a limiting factor&gt;&lt;br /&gt;&lt;br /&gt;LDT are not used by the KERNEL, but an Interface is exposed which allows the process to create there own LDTs.&lt;br /&gt;&lt;br /&gt;Segments Used by LINUX&lt;br /&gt;-----------------------&lt;br /&gt;* The kernel code segment. The GDT entries for this are &lt;br /&gt;	- Base : 0x00000000 &lt;hex 8 zeroes&gt;&lt;br /&gt;	- Limit: 0xFFFFF &lt;br /&gt;	- G    : 1 &lt;in Pages&gt;&lt;br /&gt;	- S    : 1 &lt;for Codesegment&gt;&lt;br /&gt;	- Type : 0xA&lt;segment which can be read and executed&gt;&lt;br /&gt;	- DPL  : 0 &lt;kernel mode&gt;&lt;br /&gt;	- D/B  : 1 &lt;for 32 bit offset address&gt;&lt;br /&gt;	&lt;br /&gt;	The KERNEL mode segment selector is given by the macro __KERNEL_CS. To address this segment the kernel loads&lt;br /&gt;	the return value of this macro in to the CS register.&lt;br /&gt;&lt;br /&gt;* The Kernel Data Segment The GDT entries are &lt;br /&gt;	- Base : 0x00000000 &lt;hex 8 zeroes&gt;&lt;br /&gt;	- Limit: 0xFFFFF &lt;br /&gt;	- G    : 1 &lt;in Pages&gt;&lt;br /&gt;	- S    : 1 &lt;for Codesegment&gt;&lt;br /&gt;	- Type : 2&lt;segment which can be read and written&gt; ** this is the only variation with the above&lt;br /&gt;	- DPL  : 0 &lt;kernel mode&gt;&lt;br /&gt;	- D/B  : 1 &lt;for 32 bit offset address&gt;&lt;br /&gt;	&lt;br /&gt;	The KERNEL mode segment selector is given by the macro __KERNEL_DS.&lt;br /&gt;	&lt;br /&gt;* The User Code Segment. The GDT entries are&lt;br /&gt;	- Base : 0x00000000 &lt;hex 8 zeroes&gt;&lt;br /&gt;	- Limit: 0xFFFFF &lt;br /&gt;	- G    : 1 &lt;in Pages&gt;&lt;br /&gt;	- S    : 1 &lt;for Codesegment&gt;&lt;br /&gt;	- Type : 0xA&lt;segment which can be read and written&gt;&lt;br /&gt;	- DPL  : 3 &lt;User mode&gt;&lt;br /&gt;	- D/B  : 1 &lt;for 32 bit offset address&gt;&lt;br /&gt;	&lt;br /&gt;	This segment can be accessed in both the Kernel and the User modes.&lt;br /&gt;	The segment selector is defined by the Macro __USER_CS.&lt;br /&gt;&lt;br /&gt;* The User Data Segment. The GDT entries are&lt;br /&gt;	- Base : 0x00000000 &lt;hex 8 zeroes&gt;&lt;br /&gt;	- Limit: 0xFFFFF &lt;br /&gt;	- G    : 1 &lt;in Pages&gt;&lt;br /&gt;	- S    : 1 &lt;for Codesegment&gt;&lt;br /&gt;	- Type : 2&lt;segment which can be read and written&gt;.&lt;br /&gt;	- DPL  : 3 &lt;user mode&gt;&lt;br /&gt;	- D/B  : 1 &lt;for 32 bit offset address&gt;&lt;br /&gt;		&lt;br /&gt;	The KERNEL mode segment selector is given by the macro __USER_DS.&lt;br /&gt;	&lt;br /&gt;Segmentation was introduced for the logical distinction between different types of codes.What we find from the above &lt;br /&gt;descriptions of the GDT entries is, all the different actually overlap which each other which kind of defeats the &lt;br /&gt;purpose of segmentation introduced in the x86 processors.&lt;br /&gt;&lt;br /&gt;So we could say that LINUX uses segmentation in such a limited way, that it could actually do away with it.&lt;br /&gt;&lt;br /&gt;* Task State Segment.&lt;br /&gt;	The descriptors of these segments are stored in the GDT.&lt;br /&gt;	Base : this is associated with the tss field of each process&lt;br /&gt;	G    : this is cleared.&lt;br /&gt;	Limit: 0xEB&lt;TSS is 236 bytes long&gt;&lt;br /&gt;	Type : 9/11&lt;br /&gt;	DPL  : 0 &lt;only process in kernel mode can access the TSS&gt;&lt;br /&gt;&lt;br /&gt;* A Default LDT Segment&lt;br /&gt;	&lt;br /&gt;	This segment is shared by all processes.&lt;br /&gt;	&lt;br /&gt;	Stored in the default_ldt variable.&lt;br /&gt;	Includes a single null Segment Descriptor.&lt;br /&gt;	Each process has its own LDT segment descriptor&lt;which points to the common default LDT&gt;.&lt;br /&gt;	&lt;br /&gt;	Generally the values are set as&lt;br /&gt;	Base Field : default_ldt value.&lt;br /&gt;	Limit      : 7&lt;br /&gt;	&lt;br /&gt;	If a process requires an LDT then a new 4096Byte segment is created. The default LDT segment descriptor&lt;br /&gt;	for this process is replaced with the value of the current segment in the GDT.&lt;br /&gt;	&lt;br /&gt;GYAN:&lt;br /&gt;-----&lt;br /&gt;Now for each process the system has to maintain two descriptors, One for TSS and one for LDT.&lt;br /&gt;So totally 12 entries.&lt;br /&gt;No of entries that could be made in the GDT is 2^13 - 1 = 8191&lt;br /&gt;No of processes that can run in the system at any moment is (8191-12)/2 ~ 4090&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-110474095987725778?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/110474095987725778/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=110474095987725778' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110474095987725778'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110474095987725778'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2005/01/memory-management-in-linux-kernel.html' title='Memory Management in Linux Kernel'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-110439540573111406</id><published>2004-12-30T00:30:00.000-08:00</published><updated>2004-12-30T00:30:05.730-08:00</updated><title type='text'>Kernel : Basics</title><content type='html'>&lt;a href="http://whatisthekernel.blogspot.com/"&gt;Kernel Basics&lt;/a&gt;&lt;br /&gt;BASICS Of the Linux Kernel : Chapter 01.&lt;br /&gt;-----------------------------------------------------------------------------------------------------------------------&lt;br /&gt;WRITTEN BY : Karthikeyan Raghuraman&lt;br /&gt;REFERENCE  : Understanding the Linux Kernel, O'Reilly Publications&lt;br /&gt;DISCLAIMER : The content below is my understanding of the Chapter 1 in the above mentioned book. I myself being a newbie  do not guarantee the authenticity of the following information.&lt;br /&gt;CONTACT    : karthikeyan dot raghuraman at gmail dot com&lt;br /&gt;-----------------------------------------------------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;An overview of the Unix/Linux Kernels&lt;br /&gt;---------------------------------------&lt;br /&gt;Kernels provide an environment for applications to run&lt;br /&gt;The kernel implements a lot of services and exposes the corresponding interfaces.&lt;br /&gt;Applications use these interfaces and do not interact directly with the Hardware&lt;br /&gt;&lt;br /&gt;**Process/Kernel Model&lt;br /&gt;--------------------&lt;br /&gt;A process can run in either of the two modes&lt;br /&gt;	- User Mode&lt;br /&gt;	- Kernel Mode&lt;br /&gt;	&lt;br /&gt;A program running in User mode can not access the Kernel programs/data structures, these restrictions do not apply when executing in Kernel Mode.&lt;br /&gt;A program executes most of the time in User Mode and switches to the Kernel Mode only when requesting a service provided by the kernel.&lt;br /&gt;The Linux systems apart from the user process a few privileged processes called Kernel Threads run...&lt;br /&gt;&lt;br /&gt;*Activating Kernel Routines&lt;br /&gt;---------------------------&lt;br /&gt;1. Invoke a system Call&lt;br /&gt;2. CPU signals an Exception&lt;br /&gt;3. Peripheral issues an Interrupt.&lt;br /&gt;&lt;br /&gt;** Process Implementation&lt;br /&gt;--------------------------&lt;br /&gt;- Each process has a process descriptor.&lt;br /&gt;- When the kernel stops executing a process it saves the status of registers into this process descriptor&lt;br /&gt;	. PC : Program Counter&lt;br /&gt;	. SP : Stack Pointer&lt;br /&gt;	. GPR: General Purpose Register&lt;br /&gt;	. FPR: Floating Point Registers&lt;br /&gt;	. PSW: Processor Status Word&lt;br /&gt;	. Memory Registers used to keep track of RAM access &lt;MAR/MBR&gt;&lt;br /&gt;	&lt;br /&gt;- When the kernel wants to invoke a previously running process, it uses the process's descriptor to load the registers&lt;br /&gt;- When a process waits for an event to occur, it is put in a queue.&lt;br /&gt;- There exists different queues each for a specific event.&lt;br /&gt;&lt;br /&gt;**Reentrant Kernels&lt;br /&gt;--------------------&lt;br /&gt;-A reentrant process consists of logically separate code and data segments and a private stack.&lt;br /&gt;-Many reentrant process can share the same code segment but each instance has its own data segment and stack.&lt;br /&gt;&lt;br /&gt;-Reentrant kernel means several processes could be executing at the same time in the kernel mode &lt;br /&gt;&lt;br /&gt;-KERNEL CONTROL PATH : the sequence of instructions executed by the kernel to handle a system call&lt;br /&gt;-the execution of the Kernel Control path is interleaved.&lt;br /&gt;&lt;br /&gt;**Process Address Space&lt;br /&gt;------------------------&lt;br /&gt;-Every process has its own address space.&lt;br /&gt;-User Mode Execution&lt;br /&gt;	. Private Stack,Data and Code areas.&lt;br /&gt;-Kernel Mode Execution&lt;br /&gt;	.Kernel Code and Data. Private Stack.&lt;br /&gt;-Address space sharing for Communication&lt;br /&gt;-mmap: memory map command. Allows a part of the file/memory to be mapped to the process address space.&lt;br /&gt;&lt;br /&gt;**Synchronization and Critical Regions&lt;br /&gt;--------------------------------------&lt;br /&gt;- Reentrant kernels require synchronization.&lt;br /&gt;- the Race and DeadLock conditions.&lt;br /&gt;- Atomic operations disallow such possibilities of Race/Deadlock.&lt;br /&gt;- Could be solved using&lt;br /&gt;	. Non preemptive kernels &lt;valid for uniprocessor systems&gt;&lt;br /&gt;	. Interrupt Disabling before entering Critical region.&lt;valid for uniprocessor systems&gt;&lt;br /&gt;	. Semaphores&lt;br /&gt;	. SpinLocks&lt;br /&gt;	&lt;br /&gt;**Signals and IPC&lt;br /&gt;------------------&lt;br /&gt;-System Events&lt;br /&gt;	.Asynchronous Notification &lt;SIGTERM on pressing ^c at the terminal&gt;&lt;br /&gt;	.Synchronous Errors/Exceptions &lt;SIGSEGV when a process accesses an illegal address&gt;&lt;br /&gt;- POSIX defines 20 signals : 2 User definable.&lt;br /&gt;- On reception of a Signal a process may&lt;br /&gt;	. handle the signal&lt;br /&gt;	. ignore it&lt;br /&gt;- If the process is not performing any of the above mentioned actions the kernel wud take one of the default action listed below.&lt;br /&gt;	.Terminate the process.&lt;br /&gt;	.Core Dump and terminate the process.&lt;Where is this address space dump put ?&gt;&lt;br /&gt;	.Ignore the signal.&lt;br /&gt;	.Suspend the process.&lt;br /&gt;	.Resume the process execution.&lt;br /&gt;- Shared Memory provides the fastest way for exchanging data between processes&lt;br /&gt;&lt;br /&gt;** I have skipped Process and Memory Management.&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-110439540573111406?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/110439540573111406/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=110439540573111406' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110439540573111406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/110439540573111406'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2004/12/kernel-basics.html' title='Kernel : Basics'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8801463.post-109870772285935106</id><published>2004-10-25T05:28:00.000-07:00</published><updated>2004-11-02T01:42:38.856-08:00</updated><title type='text'>Overview of the linux kernel</title><content type='html'>&lt;span style="font-weight:bold;"&gt;Abstract&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;  This article discusses the basic kernel features of the linux kernel.It starts from the history of the linux kernel to basic operataing designs required for a kernel and how linux implements it in a nutshell.Then it concludes by promising future articles on different layers of the  linux kernel and also an articles on Linux kernel 2.6 features,Future of Linux which will be contributed by different members of the lkg_india group.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Audience&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I myself being a kernel newbie,has intended to write the article in such a fashion that it will be understood by anyone who has used the computer.It can also be read by experienced kernel developers so as to refresh their thoughts and also suggest/crticise the mistakes in the article.This article is dedicated for free for the benefit and education of all. That a person seeking knowledge should have the opportunity to find it.Thanks to every other document written with the same vein which has made this article possible.&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;br /&gt;&lt;br /&gt;Intoduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;To start off immediately what does kernel mean?Well thats what this blog is for,for us to understand what the kernel is.Hopefully future articles in the blog will make us understand the kernel completely and clearly.Well here is the literal meaning of the word "kernel" straight from dict.org&lt;br /&gt;&lt;br /&gt;"The central, substantial or essential part of anything;the gist; the core;"&lt;br /&gt;&lt;br /&gt;This is how the Online dictionary of computing defines the word kernel&lt;br /&gt;&lt;br /&gt;"The essential part of Unix or other operating systems, responsible for resource allocation,low-level hardware interfaces, security etc"&lt;br /&gt;&lt;br /&gt;Operating System&lt;br /&gt;&lt;br /&gt;Any computer system includes a basic set of programs called the operating system. The most important program in the set is called the kernel.The other programs are less crucial utilities; they can provide a wide variety of interactive experiences for the user as well as doing all the jobs the user bought the computer for but the essential shape and capabilities of the system are determined by the kernel.&lt;br /&gt;&lt;br /&gt;Is kernel the entire operating system?&lt;br /&gt;&lt;br /&gt;No,as discussed above.The kernel is the core component of the operating system.The operating system contains the Kernel plus other systen utilities which use the kernel to provide higher level house keeping tasks.Technically speaking,Linux is only the kernel,as it does not provide system utilities like file system utilities,compilers,editors,the graphical user interface which are provided by any other operating system.So linux users typically rely on commercial ditributions like Suse,Red Hat etc., to have the entire operating system.Having known the what a Operating system and what the kernel is,let us know something about the history of the Linux kernel.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Linux,the revolutionay open source kernel&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Linux was intially developed by Linus in Aug 1991.As a source of inspiration listed below is his mail on the famous comp.os.newsgroup &lt;br /&gt;&lt;br /&gt;"----- Message from "Linus Benedict Torvalds" &lt;torvalds@klaava.Helsinki.FI&gt;&lt;br /&gt;on Mon, 26 Aug 1991 02:27:08 +0530 -----&lt;br /&gt;                                                                            &lt;br /&gt;                                                                            &lt;br /&gt;      Subject: What would you like to see most in minix?                    &lt;br /&gt;                                                                            &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Hello everybody out there using minix-&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I'm doing a (free) operating system (just a hobby, won't be big&lt;br /&gt;and professional like gnu) for 386(486) AT clones. This has&lt;br /&gt;been brewing since april, and is starting to get ready. I'd like&lt;br /&gt;any feedback on things people like/dislike in minix; as my OS&lt;br /&gt;resembles it somewhat (same physical layout of the file-sytem&lt;br /&gt;due to practical reasons)among other things.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I've currently ported bash (1.08) an gcc (1.40), and things seem to work.&lt;br /&gt;This implies that i'll get something practical within a few months, and I'd&lt;br /&gt;like to know what features most people want. Any suggestions are welcome,&lt;br /&gt;but I won't promise I'll implement them :-)&lt;br /&gt;&lt;br /&gt;Linus Torvalds torvalds@kruuna.helsinki.fi"&lt;br /&gt;&lt;br /&gt;As seen from the mail there was an effort by the FSF(Free Software Foundation) headed by Richard Stallman to build a complete free professional Operating System. So lets talk about FSF before proceeding further.&lt;br /&gt;&lt;br /&gt;FSF - RICHARD STALLMAN&lt;br /&gt;&lt;br /&gt;The FSF was founded by Richard Stallman in 1984.Its also know as the GNU software project which was launched to built a complete free operating system.When richard started working for writing a free OS he felt that he should use a free editor to write programs for the Operating system,then he wrote the GNU emacs editor.Then they needed a free compiler to compile their C programs,thereby was born the GNU Compiler collection(gcc).Later when linus released Linux kernel free and it became popular,it was adopted by FSF as their kernel.So GNU utilities were used with the Linux kernel to make the complete Operating System.Now Linux remains as one of the most popular Open source operating system in the world.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Features of the Linux Kernel&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;I.Monolithic kernel with module support&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The linux kernel is monolithic with module suport.So now let us try to decipher the meaning of the previous sentence and why linux adopted a monolithic kernel with module strategy.&lt;br /&gt;&lt;br /&gt;Monolithic kernel &lt;br /&gt;A  monolithic kernel is a single large complex "do-it-yourself" kernel program which is composed of several different logical entities(kernel layers).Each kernel layer is integrated in to the large kernel program and runs in kernel mode(more on kernel mode later)on behalf of the current process.&lt;br /&gt;&lt;br /&gt;Microkernel &lt;br /&gt;&lt;br /&gt;A micro kernel consists of a small set of important fucntions in the kernel generally a simple scheduler,synchronisation primitives.Several System processes that run on top of the kernel to implement other OS layer functions like memory allocators,device drivers,system calls,file system etc., These kernel layers cordinate together by message passing between them.Therby because of the message passing the microkernel is slower than the monolithic kernel.&lt;br /&gt;&lt;br /&gt;Monolithic Vs Microkernel&lt;br /&gt;&lt;br /&gt;Monolithic Kernel is faster than microkernel as stated above.However,Microkernel has some theoritical advantages over the monolithic kernel.They are as follows,&lt;br /&gt;&lt;br /&gt;1.The microkernel occupies less RAM,since system processes(as disscussed above) that are not doing their functionalities are swapped out or destroyed.&lt;br /&gt;&lt;br /&gt;2.The architecture of the microkernel forces the programmers to adopt a modularized approach &lt;br /&gt;as  different layers of the kernel are independant of each other.Moreover,the different layers interact with each other through clear well defined software interfaces.&lt;br /&gt;&lt;br /&gt;3.Moreover, an existing microkernel operating system can be fairly easily ported to other architectures, since all hardware dependent components are generally encapsulated in the microkernel code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Modules,the linux way&lt;br /&gt;&lt;br /&gt;Modules are a kernel feature that effectively achieves many of the theoretical advantages of microkernels without introducing performance penalties. A module is an object file whose code can be linked to (and unlinked from) the kernel at runtime. The object code usually consists of a set of functions that implements a filesystem, a device driver, or other features at the kernel's upper layer. The module, unlike the external layers of microkernel operating systems, does not run as a specific process. Instead, it is executed in Kernel Mode on behalf of the current process, like any other statically linked kernel function.&lt;br /&gt;&lt;br /&gt;Advantages provided by a monolithic kernel with modules&lt;br /&gt;&lt;br /&gt;1.Less main memory usage&lt;br /&gt;  A module is linked to the kernel when its functionality is needed and unlinked when its no longer used.This mechanism is done automatically by the kernel and is transparent to the user.&lt;br /&gt;&lt;br /&gt;2.Modularized approach&lt;br /&gt;  Modules force the programmers to intoduce well defined software interfaces for interaction.&lt;br /&gt;&lt;br /&gt;3.Platform independence&lt;br /&gt;  A module does not depend on a fixed hardware platform.A module like device driver is infact specfic to the device but not to the hardware platform(x86,sparc etc..,)&lt;br /&gt;&lt;br /&gt;4.Faster&lt;br /&gt;  Since the module is linked in to the kernel,it is faster like the monolithic kernel as there is no message passing as in microkernel.Infact there is small time we lose for linking and unlinking of the modules which is less than that of the time required for message passing in microkernels.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;II.Linux Filesystem&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Files are a basic abstraction provided by the operating system along with processes(more about processes later).A file is an information container structured as a sequence of bytes.From the user's point of view, files are organized in a tree-structured name space starting from /(parent) which is called the root directory.&lt;br /&gt;&lt;br /&gt;Different file types in linux&lt;br /&gt;&lt;br /&gt;The beauty of linux lies in the fact that it almost treats everything as files including devices. There are the following types of files&lt;br /&gt;&lt;br /&gt;1. Regular files&lt;br /&gt;2. Directory files&lt;br /&gt;3. Symbolic links&lt;br /&gt;4. Device files(character,block).&lt;br /&gt;5. Pipes&lt;br /&gt;6. Sockets&lt;br /&gt;&lt;br /&gt;The first three types are constituents of the linux filesystem.Device files are related to I/O devices and device drivers integrated into the kernel.Pipes and sockets are special files used for interprocess communication.&lt;br /&gt;&lt;br /&gt;Inode&lt;br /&gt;&lt;br /&gt;All information needed by the filesystem to handle a file in included in a data structure called the inode.Each file has its own inode which the filesystem uses to identify the file.The following information is kept in the inode&lt;br /&gt;1.File type(as discussed above)&lt;br /&gt;2.Number of hard links associated with the file(see next section)&lt;br /&gt;3.File length in bytes&lt;br /&gt;4.Inode number that identifies the file within the filesystem &lt;br /&gt;5.User ID of the file owner(discussed below)&lt;br /&gt;5.Group ID of the file(discussed below) &lt;br /&gt;6.The last modify time&lt;br /&gt;7.Access rights and file mode (discussed below)&lt;br /&gt;&lt;br /&gt;Hard links and symbolic links&lt;br /&gt;&lt;br /&gt;The same file may have several links included in the same directory or in different ones, thus several filenames.&lt;br /&gt;&lt;br /&gt;Hardlink&lt;br /&gt;&lt;br /&gt;A hardlink is just a different filename but points to the same inode on the disc.So the file information of the both the file and the hard link will coincide as they point to the same inode.The unix command used to create a hard link is&lt;br /&gt;$ln f1 f2&lt;br /&gt;&lt;br /&gt;f2 is the hard link for file f1&lt;br /&gt;&lt;br /&gt;Limitations of hard link&lt;br /&gt;&lt;br /&gt;1.Hard links cannot exist to directories as it might transform the tree structure in to a graph with cycles thus making ti impossible to locate a file according to its name.&lt;br /&gt;&lt;br /&gt;2.Hard links can be created only for files in the same filesystem as inode is the same.This is a serious limitation as Linux supports variuos other filesystems.&lt;br /&gt;&lt;br /&gt;Softlink&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Symbolic links are short files that contain an arbitrary pathname of another file. The pathname may refer to any file located in any filesystem,it may even refer to a  non exixtent file or a file in another filesystem.Thus symbolic links are short files than contain the pathname of the file that is linked to.So a symbolic link will have a separate inode for itself.The unix command to create a symbolic link is&lt;br /&gt;$ln -s f1 f2&lt;br /&gt;&lt;br /&gt;f2 is a symbolic link to f1.The linux filesystem will create a sym link f2 and will write in to it the pathname for f1.In this way f2 can refer to f1.&lt;br /&gt;&lt;br /&gt;Users and Groups&lt;br /&gt;&lt;br /&gt;A multiuser system is a computer that is able to concurrently and independently execute several applications belonging to two or more users. "Concurrently" means that applications can be active at the same time and contend for the various resources such as CPU, memory, hard disks, and so on. "Independently" means that each application can perform its task with no concern for what the applications of the other users are doing.&lt;br /&gt;&lt;br /&gt;In a multiuser system, each user has a private space on the machine: typically, he owns some quota of the disk space to store files, receives private mail messages, and so on. The operating system must ensure that the private portion of a user space is visible only to its owner. In particular, it must ensure that no user can exploit a system application for the purpose of&lt;br /&gt;of violating the private space of the another user.&lt;br /&gt;All users are identified by a unique number called the User ID , or UID. Usually only a restricted number of persons are allowed to make use of a computer system. When one of these users starts a working session, the operating system asks for a login name and a password.If the user does not input a valid pair, the system denies access. Since the password the assumed to be private the users privacy is maintained.In order to selectively share material with other users, each user is a member of one or more groups, which are identified by a unique number called a Group ID,or GID. Each file is associated with a UID and GID.&lt;br /&gt;&lt;br /&gt;Access rights and mode&lt;br /&gt;&lt;br /&gt;The users of a file fall in to one of this three classes&lt;br /&gt;&lt;br /&gt;1.The owner of the file&lt;br /&gt;2.The users who belong to the same groups as that of the owner&lt;br /&gt;3.All other users(others) in the system&lt;br /&gt;&lt;br /&gt;There are three types of access rights for each of these classes namely&lt;br /&gt;Read,write and execute.&lt;br /&gt;Thus the set of access rights associated with a file in linux consists of nine(3(for diff classes of users) * 3(for different access rights)) different flags.There are three additional flags which define the file mode which have a meaning when applied to executable files.They are&lt;br /&gt;1.SUID flag&lt;br /&gt;&lt;br /&gt;If the executable file has the SUID flag set, the process gets the UID of the file owner.&lt;br /&gt;&lt;br /&gt;2.SGID&lt;br /&gt;&lt;br /&gt;If the executable file has the SUID flag set, the process gets the GID of the file group.&lt;br /&gt;&lt;br /&gt;The suid and sgid programs are important programs to be protected in a secure system as they change the id's during execution.So an intruder might use these programs for executing something malicious on behalf of the user who is suid or sgid.&lt;br /&gt;&lt;br /&gt;3.sticky&lt;br /&gt;An executable file with the sticky flag set corresponds to a request to the kernel to keep the program in memory after its execution terminates.These flag is set when many processes share the same program for example vi might be used by many processes.So vi can have its sticky bit set.This flag has become obsolete as other approaches like copy-on-write are used now for sharing code pages between processes.&lt;br /&gt;&lt;br /&gt;chmod is the unix command used to allow the user to set access and mode flags for a file.&lt;br /&gt;&lt;br /&gt;Example&lt;br /&gt;$chmod 4777 f1&lt;br /&gt;&lt;br /&gt;4777 in binary is 100111111111.So there are 12 bits corresponding to User classes,access and mode permissions as below.&lt;br /&gt;&lt;br /&gt;100---will set the suid flag for f1(mode of the file)&lt;br /&gt;111---will give read,write,exec access to owner of f1&lt;br /&gt;111---will give read,write,exec access to other users of the group which the user belongs to.&lt;br /&gt;111---will give read,write,exec access to other remaining users in the system.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Virtual File System technology&lt;br /&gt;&lt;br /&gt;Linux follows a object oriented Virtual File System technology inspired from SVR4 and solaris.There linux supports most of the filesystems like DOS,FAT,ext3,resierfs,JFS etc..,Also porting a file system to Linux is very easy task because of the VFS technology that linux follows.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;III.Processes in Linux&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Process is another fundamental abstraction(apart from File) provided by the OS.Having seen a overview of the filesystem in linux let us see how linux handles processes.&lt;br /&gt;&lt;br /&gt;A process can be defined either as "an instance of a program in execution," or as the "execution context" of a running program. In traditional operating systems, a process executes a single sequence of instructions in an address space,the address space is the set of memory addresses that the process is allowed to reference.Operating systems like linux allow multiple execution flows,that is,multiple sequences of instructions executed in the same address space.&lt;br /&gt;&lt;br /&gt;Multiuser systems must allow different processes to be active concurrently and thereby contend for tbe resources,mainly the CPU.Such systems are said to be multitasking systems.The scheduler is the part of the kernel which decided which process will run on the CPU at a given instant of time.It is done in such a manner that every process feels that it is the only process running on the CPU.This concept is called virtualization of the processor.Another virtualisation that is provided by  linux(or any OS) to processes is the virtual memory where in a process feels that it has the entire memory on system available to it.Processes of a multiuser system must be preemptive ie., the scheduler of the kernel will decide how long each process can hold the CPU.Thereby if a higher priority process comes in to execution the scheduler will preempt the lower priority running process.&lt;br /&gt;&lt;br /&gt;Linux is a multiprocessing system with preemptive processes.&lt;br /&gt;&lt;br /&gt;Process management in Linux&lt;br /&gt;&lt;br /&gt;The fork( ) and exit( ) system calls are used respectively to create a new process and to terminate it.Linux maintains a clear distinction between the process and the program by using exec( )-like system call to load a new program.After exec has been done the process resumes execution  with a brand new address space containing the loaded program.&lt;br /&gt;&lt;br /&gt;Process Creation&lt;br /&gt;Processes in Linux follow a parent child relationship.The Process which invokes the fork is the parent and the process that is created is the child.The init process(created by the init.c of the kernel) is the root parent of all the other processes in linux.The task_struct is the data structure in the kernel that defines the process.Parents and children can easily find each other as the data structure contains information about the relationship.Linux implements fork using the copy-on-write approach which defers address space duplication of parent to the child on fork.The address space is copied only when a write is being made in to the address space.So till a write happens the parent and the child share only the Page tables and not the address space.The copy-on-write has the following advantage&lt;br /&gt;&lt;br /&gt;1.Most instances of fork are followed by an exec to a program as the child does some other functionailty when compared to the program.Thereby it is a waste(an overhead) to copy the address space of the parent to the child and then again immediately overwrite the address space.So this overhead is reduced by the copy-on-write approach.&lt;br /&gt;&lt;br /&gt;Process Termination&lt;br /&gt;The exit system call is used to terminate a process.The kernel handles this system call by releasing the resources owned by the process and sending the parent process a SIGCHLD signal, which is ignored by default.&lt;br /&gt;&lt;br /&gt;Zombie processes&lt;br /&gt;The parent enquires about the termination of the child process by the wait( ) system call which  allows a process to wait until one of its children terminates; it returns the process ID (PID) of the terminated child.A special zombie process state is introduced to represent terminated processes on which the parent has not issued a wait system call.The process descriptor of the child is released after the wait is executed by the parent and the child process goes to the stopped state.Now the question arises if the parent does not issue a wait call and if it terminates what happens to the child in Zombie state.There will be many zombies which will occupy useful memory.The solution lies in init process which takes over as the parent of child process whose parents have terminated.It routinely issues wait calls thereby getting rid of the zombies.&lt;br /&gt;&lt;br /&gt;IV.The process/Kernel model&lt;br /&gt;&lt;br /&gt;Having known about the kernel architecture and the basic abstractions(file,process) let us now see how the linux kernel is modelled.On multiuser/multitasking systems the operating system must hide all low level details concerning the physical organization of the computer from applications run by user.So when a user application wants to access a hardware,it request the kernel which evaluates the request and if it chooses to grant access,it interacts with the hardware on behalf of the process.This thereby enhances the security on multiuser systems which prevent user applications from damaging the system hardware resources.Probably,this is the reason why DOS does  not require such a model as it is single user system which allows the user to do anything with the systems hardware.&lt;br /&gt;&lt;br /&gt;Well how does linux implement the above mentioned feature.All modern operating systems,implement the above mentioned feature with the help of hardware specific features which forbids the user programs to interact directly with the hardware.For example,the x86 CPU provides four rings of execution from ring 0 to ring 3 in the order of descending privileges.Linux uses the two rings namely ring 0 and ring 3 for implementing the feature.User mode applications run in ring 3 of x86 and kernel runs in ring 0 of x86.Therefore these rings correspond to the User mode and kernel mode concept of the linux operating system.&lt;br /&gt;&lt;br /&gt;Note:Intel introduced protected mode(ring levels) starting from 80386.So Linux was developed for 80386 and above(commonly referred as i386/i686 architectures).&lt;br /&gt;&lt;br /&gt;When a program is executed in User Mode, it cannot directly access the kernel data structures or the kernel programs. When an application executes in Kernel Mode, however, these restrictions no longer apply. Each CPU model provides special instructions to switch from User Mode to Kernel Mode and vice versa. A program executes most of the time in User Mode and switches to Kernel Mode only when requesting a service provided by the kernel. When the kernel has satisfied the program's request, it puts the program back in User Mode.&lt;br /&gt;&lt;br /&gt;The kernel is a process manager and not a process by itself.The creation,deletion and handling of processes are done by a group of routines in the kernel.The process/kernel model in linux allows a process to request a kernel service through system calls.System calls thereby take the process from user mode to kernel mode,does the necessary request and puts the process back in the user mode.&lt;br /&gt;&lt;br /&gt;A switch to Kernel mode from user mode will happen by any one of these methods&lt;br /&gt;&lt;br /&gt;1.Through system call by a process as discussed above&lt;br /&gt;2.The CPU exceutes an exception.The exception has to be handled by the kernel on behalf of the process that caused the exception.For example,the kernel must handle the page fault exception.&lt;br /&gt;3.A device issues an interrupt to the CPU to notify the CPU an event.In this case the kernel executes the corresponding interrupt handler for the device.&lt;br /&gt;&lt;br /&gt;Linux also has some privileged kernel level processes.They are called kernel threads.Linux uses kernel threads in a limited way for certain functionalites.They execute in the kernel thread.keventd is an example of a kernel thread.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;V.Kernel Synchronisation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On multiuser/multitasking systems many processes must be handled by the kernel at a single instant of time as several processes may be executing in the kernel at the same time.For example ,a process executing on a CPU in a uniprocessor system might be waiting on some I/0.During this time the process in kernel mode is interleaved and another process is executed in the kernel.When the I/O interrupt is finished the process waiting for I/O is executed again.So the Linux kernel must be reentrant,ie..,multiple processes must be handled by the kernel maintaining the global data structures they use in a consistent state.&lt;br /&gt;Before proceeding further,we will see the definition of a kernel control path.A kernel control path is a sequence of instructions executed in the kernel mode on behalf of a process or on behalf of an interupt.&lt;br /&gt;&lt;br /&gt;How can Kernel reentrancy be achieved?&lt;br /&gt;  &lt;br /&gt;The following are the methods to achieve kernel reentrancy&lt;br /&gt;&lt;br /&gt;1.Reentrant functions&lt;br /&gt;2.Atomic operations&lt;br /&gt;3.Non Premptive kernel&lt;br /&gt;4.Interrupt disabling&lt;br /&gt;5.Locking mechanisms(semaphores and spinlocks)&lt;br /&gt;&lt;br /&gt;Reentrant functions&lt;br /&gt;Reentrant functions are those functions which operate only on local variable and not on global data structures.But the kernel cannot only be limited to reentrant functions.Also the kernel has a fixed stack which is small,so local variable must be less.&lt;br /&gt;&lt;br /&gt;Before looking at other mechanisms of achieving reentarncy let us see what a race condition is and what is a critical region.&lt;br /&gt;&lt;br /&gt;Assume that there is a global data structure(a resource) R1.If R1 is one the resource is free.Now a kernel control path KCP1 reads the value of R1 which is 1 i.e.., the resource is free.Now if KCP1 is interleaved from the kernel and if KCP2 reads the value of R1 which is still 1.It will take the resource and decrement the value to 0.Now if KCP1 resumes execution it still sees that the resource is free and decrements the value of R1 to -1,thereby taking the resource.So both the KCP's are using the resources leading to dangerous effects.This is a race condition.&lt;br /&gt;&lt;br /&gt;Atomic operations&lt;br /&gt;It is always safe to access a global variable with a single atomic uninteruptible instruction.For example in the previuos example if the two kernel paths have read the data and decremented the value of R1 in a singl operation there would have been no race condition.&lt;br /&gt;&lt;br /&gt;But its always not possible to access data structures in a single atomic operation.Any section of code that should be finished by a process befored another process is scheduled is called the critical region.Further mechanisms we will see how to protect critical regions.&lt;br /&gt;&lt;br /&gt;Non Premptive kernel&lt;br /&gt;A simple solution to synchronisation problems is to make the kernel non preemptive ie., when a process is in kernel mode it cannot be interleaved by any other process until it voluntarily relinquishes the CPU(in which case it makes sure that the Data structures are in a consistent state).Therefore in a non preemptive kernel all the global data structures except those that are used by interupts and exceptions are safe(as interleaving of a process in kernel mode happens when a interuppt or exception occurs).Non premptability is ineffective in Multiprocessor systems as two kernel control paths executing in different CPU's can access the same data structure.The linux kernel was non preemptable until 2.5 devt series and 2.6 stable series.Now the kernel is preemptible.Preemptible kernels are more suited for time critical real time processes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Interrupt disabling&lt;br /&gt;Another method to achieve synchronisation is to disable interuppts before entering a critical region and enabling it after the critical region.This is a simple solution but is not optimal as large critical region will have the Hardware interrupts freezed for a long time leading to a freeze.Also it is ineffective in multiprocessor systems as the data structure might be accessed by another process running in a different CPU.&lt;br /&gt;&lt;br /&gt;Locking mechanisms&lt;br /&gt;The locking mechanisms lock the corresponding global data structure in question thereby making sure that when the lock is acquired only a single process can access the data structure.They are effective both in Uniprocessor and Multiprocessor systems.The locking mechanisms used by the linux kernel are&lt;br /&gt;&lt;br /&gt;1.Semaphores&lt;br /&gt;2.Spin Locks&lt;br /&gt;&lt;br /&gt;Semaphores&lt;br /&gt;A semaphore is simply a counter associated with a data structure; the semaphore is checked by all kernel threads before they try to access the data structure.Each semaphore may be viewed as an object composed of: &lt;br /&gt;&lt;br /&gt;a)An integer variable &lt;br /&gt;b)A list of waiting processes&lt;br /&gt;c)Two atomic methods: down() and up()&lt;br /&gt;&lt;br /&gt;The down() method decrements the value of the semaphore. If the new value is less than 0, the method adds the running process to the semaphore list and then blocks (i.e., invokes the scheduler). The up() method increments the value of the semaphore and, if its new value is greater than or equal to 0, reactivates one or more processes in the semaphore list. Each data structure to be protected has its own semaphore, which is initialized to 1. When a kernel control path wishes to access the data structure, it executes the down() method on the proper semaphore. If the value of the new semaphore isn't negative, access to the data structure is granted. Otherwise, the process that is executing the kernel control path is added to the semaphore list and blocked. When another process executes the up() method on that semaphore, one of the processes in the semaphore list is allowed to proceed.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;SpinLocks&lt;br /&gt;In multiprocessor systems, semaphores are not always the best solution to the synchronization problems. Some kernel data structures should be protected from being concurrently accessed by kernel control paths that run on different CPUs. In this case, if the time required to update the data structure is short, a semaphore could be very inefficient. To check a semaphore, the kernel must insert a process in the semaphore list and then suspend it. Since both operations are relatively expensive, in the time it takes to complete them, the other kernel control path could have already released the semaphore. In these cases, multiprocessor operating systems make use of spin locks. A spin lock is very similar to a semaphore, but it has no process list: when a process finds the lock closed by another process, it "spins" around repeatedly, executing a tight instruction loop until the lock becomes open. Of course, spin locks are useless in a uniprocessor environment. When a kernel control path tries to access a locked data structure, it starts an endless loop. Therefore, the kernel control path that is updating the protected data structure would not have a chance to continue the execution and release the spin lock. The final result is that the system hangs.&lt;br /&gt;&lt;br /&gt;Signals and interprocess communication&lt;br /&gt;The linux kernel implements signals for communicating to the process an event.For example,SIGKILL is sent to the process if it receives a terminate signal.The linux kernel implements 32 different posix signals.User processes can communicate with each other with the help of SYS V IPCs like sahred memory,pipes,fifos,semaphores and message queues.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;VI.Memory Management in Linux&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Memory management is the most complex(as it is architecture dependant) and important activity in the kernel.Linux supports Virtual memory management.Linux uses paging to implement virtual memory concept.Linux does not use segmentation.Memory management will be described in detail in a later article.For now the advantages a virtual memory offers are&lt;br /&gt;&lt;br /&gt;1.Several processes can be executed concurrently. &lt;br /&gt;2.It is possible to run applications whose memory needs are larger than the available physical memory.&lt;br /&gt;3.Processes can execute a program whose code is only partially loaded in memory. &lt;br /&gt;4.Each process is allowed to access a subset of the available physical memory.&lt;br /&gt;5.Processes can share a single memory image of a library or program&lt;br /&gt;6.Programs can be relocatable, that is, they can be placed anywhere in physical memory.&lt;br /&gt;7.Programmers can write machine-independent code, since they do not need to be concerned about physical memory organization. &lt;br /&gt;&lt;br /&gt;RAM USAGE&lt;br /&gt;&lt;br /&gt;The usage of memory is one very important thing that has to be taken care by the kernel.Linux clearly distiguishes between Memory that is dedicated to the kernel and the memory that can be used by the processes.The static kernel image is loaded from the 1st Megabyte of the RAM and is pinned(it is not swapped or paged out).The remaining part of the RAM is used for &lt;br /&gt;&lt;br /&gt;1.Kernel dynamic structures&lt;br /&gt;2.Memory for processes&lt;br /&gt;3.Caches for disks etc., to get better performance.&lt;br /&gt;&lt;br /&gt;How the memory is allocated for above mentioned three is very important and hence requires a separate article.In a few words, linux takes care of memory allocation problems like internal fragmentation,external fragmentation etc., by making use of buddy system algorithm and slab allocation mechanism.So Linux uses a Slab allocator on top of a buddy system algorithm.&lt;br /&gt;&lt;br /&gt;Virtual Adddress space of a process&lt;br /&gt;Every process in linux has a virtual address space(ranging from 0 to 4GB on a 32 bit intel CPU).&lt;br /&gt;A address space of a process contains all the virtual memory addresses  that the process can reference.The kernel usually stores a process virtual address space as a list of memory area descriptors(for example memory area descriptors to the code segment,stack segment,heap etc.,).Linux uses demand paging ie., the page is  allocated after a page fault happens.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;VII.Device drivers&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Having discussed most of kernel subsystems namely filesystem,process,synchronisation,interrupts,syscalls,meory management lets see in a nutshell about device drivers.The kernel interacts with I/O devices with help of device drivers.Device drivers are included in the kernel and consist of data structures and functions that control one or more devices like hard disks,keyboard,mouse etc.,Each driver interacts with the remaining part of the kernel (even with other drivers) through a specific interface.There the device driver layer can be seen as the last layer in the kernel interacting with other layers through well defined interfaces.This approach helps programmers to write device specfic code in a separate module without knowing about the kernel source code as well as the internal architecture.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;VIII.Linux implementation of threads&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Finally,it is mandatory to discuss how linux supports multithreaded application programming as it does in a unique manner.In linux threads are also processes which share the address space of the processes.Linux implements threads as processes because the process creation time in linux is much faster compared to other OS'es.A thread is therefore a process which can scheduled independently of the main process and it shares the same address space.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;SUMMARY&lt;/span&gt;&lt;br /&gt;In summary a linux kernel is monolithic with modular support and consists of the following subsystems,&lt;br /&gt;1.Filesystem&lt;br /&gt;2.Process management&lt;br /&gt;3.Memory management&lt;br /&gt;4.Synchronisation&lt;br /&gt;5.System calls&lt;br /&gt;6.Device drivers&lt;br /&gt;7.Interrupts and exceptions&lt;br /&gt;8.Signals&lt;br /&gt;The explanation and deciphering of the different kernel layers will be explained in future articles contirbuted by the various members of the lkg_india.I also promise two more articles which will be extension of this article namely &lt;br /&gt;1.Linux kernel 2.6 features&lt;br /&gt;2.Linux VS other Operating systems and the future of linux&lt;br /&gt;&lt;br /&gt;Colophon&lt;br /&gt;&lt;br /&gt;Thanks for the time the reader has devoted to reading the article.Any suggesttion/critcisms are welcome and i kindly request the readers to post in a comment after reading the article.Negavtive feedbacks are very much welcome as they stabilise the article.&lt;br /&gt;&lt;br /&gt;By &lt;br /&gt;Saravanan S(sar.kernel@gmail.com)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8801463-109870772285935106?l=whatisthekernel.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://whatisthekernel.blogspot.com/feeds/109870772285935106/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8801463&amp;postID=109870772285935106' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/109870772285935106'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8801463/posts/default/109870772285935106'/><link rel='alternate' type='text/html' href='http://whatisthekernel.blogspot.com/2004/10/overview-of-linux-kernel.html' title='Overview of the linux kernel'/><author><name>Saravanan</name><uri>http://www.blogger.com/profile/12068489124788465950</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry></feed>
