Sunday, July 3, 2016

PXE/Kickstart guide

Outline of the steps

* Obtain installation media
* Create Kickstart config file
* Setup NFS server
* Obtain PXE bootloader
* Create PXE config file
* Setup TFTP server
* Setup DHCP server


Installation Media

I was installing CentOS 5.5/x86_64 during this process, so I downloaded the two DVD images via torrent onto my NFS server. My BitTorrent client created the directory CentOS-5.5-x86_64-bin-DVD with the files:
CentOS-5.5-x86_64-bin-DVD-1of2.iso  md5sum.txt      sha1sum.txt      sha256sum.txt
CentOS-5.5-x86_64-bin-DVD-2of2.iso  md5sum.txt.asc  sha1sum.txt.asc  sha256sum.txt.asc
I moved this directory to /share/images to make it available via NFS.

Next I mounted the first ISO file as a loop image and copied the initrd and kernel to my DHCP server:

$ sudo mount /share/images/CentOS-5.5-x86_64-bin-DVD/CentOS-5.5-x86_64-bin-DVD-1of2.iso /mnt/dvd/ -t iso9660 -o loop
$ scp /mnt/dvd/images/pxeboot/*i* root@dhcp-server:/tftpboot

Kickstart File

I created the directory /share/kickstart for Kickstart config files on my NFS server.

I created the Kickstart file (test64-ks) using a previous CentOS install as a basis, and editing it based on snippets I found scattered around the 'Web.

# Kickstart file automatically generated by anaconda.
# Modified substantially by chort

install
nfs --server 10.25.0.129 --dir /share/images/CentOS-5.5-x86_64-bin-DVD/
#url --url http://mirror.centos.org/centos/5.4/os/x86_64
lang en_US.UTF-8
keyboard us

# don't define more NICs than you have, the install will bomb if you do
network --device eth0 --onboot yes --bootproto static --ip 10.25.42.139 --netmask 255.255.0.0 --gateway 10.25.0.1 --nameserver 10.25.0.5
#network --device eth1 --onboot no --bootproto dhcp
#network --device eth2 --onboot no --bootproto dhcp
#network --device eth3 --onboot no --bootproto dhcp

# grab the hash from an account in /etc/shadow that has the password you want to use
rootpw --iscrypted $1$fi0JeZ1p$Il0CxFxe0jqpNnkrOqC.0.
firewall --enabled --port=22:tcp
authconfig --enableshadow --enablemd5
selinux --disabled
timezone --utc America/Los_Angeles

bootloader --location=mbr --driveorder=sda
# The following is the partition information you requested
# Note that any partitions you deleted are not expressed
# here so unless you clear all partitions first, this is
# not guaranteed to work
clearpart --all --drives=sda
# 100MB /boot partition
part /boot --fstype ext3 --size=100 --ondisk=sda
# everything else goes to LVM
part pv.4 --size=0 --grow --ondisk=sda
volgroup VolGroup00 --pesize=32768 pv.4
# 2GB swap fs
logvol swap --fstype swap --name=LogVol01 --vgname=VolGroup00 --size=2048
# 5GB / fs
logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=5120
# 10GB + remaining space for /opt fs
logvol /opt --fstype ext3 --name=LogVol02 --vgname=VolGroup00 --size=10240 --grow

%packages
@base
@core
@dialup
@editors
@text-internet
keyutils
trousers
fipscheck
device-mapper-multipath
bind
bind-chroot
bind-devel
caching-nameserver
compat-libstdc++-33
compat-glibc
gdb
ltrace
ntp
OpenIPMI-tools
screen
sendmail-cf
strace
sysstat
-bluez-utils

%post
/usr/bin/yum -y update >> /root/post_install.log 2>&1
/sbin/chkconfig --del bluetooth
/sbin/chkconfig --del cups
/sbin/chkconfig ntpd on
/sbin/chkconfig named on

NFS Server

Make sure NFS is enabled:
$ for i in nfs nfslock portmap ; do sudo chkconfig --list $i ; done

Edit /etc/exports to enable access to the share for the machines that will PXE boot:

# sample /etc/exports file
#/               master(rw) trusty(rw,no_root_squash)
#/projects       proj*.local.domain(rw)
#/usr            *.local.domain(ro) @trusted(rw)
#/home/joe       pc001(rw,all_squash,anonuid=150,anongid=100)
#/pub            (ro,insecure,all_squash)
#/pub            (ro,insecure,all_squash)

/share  *.bkeefer.se.example.com(ro,no_root_squash)

I restart the nfs service after I edit /etc/exports

$ sudo service nfs restart

Bootloader

Next, on the DHCP server, I grabbed the PXE bootloader from the syslinux package. You should be able to install this through yum:
$ sudo yum install syslinux

Copy the bootloader to the TFTP server directory:

$ sudo cp /usr/lib/syslinux/pxelinux.0 /tftpboot

Create the pxelinux.cfg directory in /tftpboot and edit the default file:

# You can have multiple kernels, if so name each with it's version
# This configuration only has one possible kernel so I didn't rename it
default linux
label linux
  kernel vmlinuz
  append ksdevice=eth0 load_ramdisk=1 initrd=initrd.img network ks=nfs:10.25.0.129:/share/kickstart/test64-ks

TFTP Server

Configure the TFTP server by editing /etc/xinetd.conf/tftp file:
# default: off
# description: The tftp server serves files using the trivial file transfer \
# protocol.  The tftp protocol is often used to boot diskless \
# workstations, download configuration files to network-aware printers, \
# and to start the installation process for some operating systems.
service tftp
{
 socket_type  = dgram
 protocol  = udp
 wait   = yes
 user   = root
 server   = /usr/sbin/in.tftpd
 server_args  = -vvs /tftpboot
 disable   = no
 per_source  = 11
 cps   = 100 2
 flags   = IPv4
}
I changed "disable = yes" -> "disable = no" and "server_args = -s /tftpboot" -> "server_args = -vvs /tftpboot". xinetd probably doesn't need to be restarted, but I did any way:
$ sudo service xinetd restart

I had only a single machine to boot, so I used a fixed IP base on the Ethernet address. Make sure you edit /var/lib/dhcp.lease* to erase references to the MAC and restart dhcpd. Here's the /etc/dhcpd.conf

shared-network SE-NET {

 subnet 10.25.42.0 netmask 255.255.255.0 {
  authoritative;
  allow booting;
  option routers   10.25.0.1;
  option subnet-mask  255.255.0.0;
  option domain-name  "bkeefer.se.example.com";
  option domain-name-servers 10.25.0.5;
  option time-offset  -28800;
  option ntp-servers  ntp.example.com;

  host test64 {
   hardware ethernet 00:0c:29:b3:81:99;
   fixed-address 10.25.42.139;
   next-server 10.25.0.5;
   filename "pxelinux.0";
  }
 }
}

I haven't had any luck with restarting dhcpd, so I do stop followed by start:

$ sudo service dhcpd stop && sudo service dhcpd start

Note that there are also forward and reverse DNS entries to match 10.25.42.139 to test64.bkeefer.se.example.com .


Final Step

At this point you should be able to edit the BIOS for the machine you're booting to make sure the network card is in the boot order (as long as there's no OS installed, it should boot off the NIC no matter where it is in the order).

Thursday, June 9, 2016

Veritas Cluster Server (VCS) - Basics

Basics

What are the different service group types ?

Service groups can be one of the 3 type :
1. Failover – Service group runs on one system at a time.
2. Parallel – Service group runs on multiple systems simultaneously.
3. Hybrid – Used in replicated data clusters (disaster recovery setups). SG behaves as Failover within the local cluster and Parallel for the remote cluster.

Where is the VCS main configuration file located ?

The main.cf file contains the configuration of the entire cluster and is located in the directory/etc/VRTSvcs/conf/config.

How to set VCS configuration file (main.cf) ro/rw ?

To set the configuration file in read-only/read-write :

# haconf -dump -makero     (Dumps in memory configuration to main.cf and makes it read-only)
# haconf -makerw           (Makes configuration writable)

Where is the VCS engine log file located ?

The VCS cluster engine logs is located at /var/VRTSvcs/log/engine_A.log. We can either directly view this file or use command line to view it :
# hamsg engine_A

How to check the complete status of the cluster

To check the status of the entire cluster :
# hastatus -sum

How to verify the syntax of the main.cf file

To verify the syntax of the main.cf file just mention the absolute directory path to the main.cf file :
# hacf -verify /etc/VRTSvcs/conf/config

What are the different resource types ?

1. Persistent : VCS can only monitor these resources but can not offline or online them.
2. On-Off : VCS can start and stop On-Off resource type. Most resources fall in this category.
3. On-Only : VCS starts On-Only resources but does not stop them. An example would be NFS daemon. VCS can start the NFS daemon if required, but can not take it offline if the associated service group is take offline.

Explain the steps involved in Offline VCS configuration

1. Save and close the configuration :
# haconf -dump -makero
2. Stop VCS on all nodes in the cluster :
# hastop -all
3. Edit the configuration file after taking the backup and do the changes :
# cp -p /etc/VRTSvcs/conf/config/main.cf /etc/VRTSvcs/conf/config/main.cf_17march
# vi /etc/VRTSvcs/conf/config/main.cf
4. Verify the configuration file syntax :
# hacf -verify /etc/VRTSvcs/conf/config/
5. start the VCS on the system with modified main.cf file :
# hastart
6. start VCS on other nodes in the cluster.
Note : This can be done in another way by just stopping VCS and leaving services running to minimize the downtime. (hastop -all -force

GAB, LLT and HAD

What is GAB, LLT and HAD and whats their functionalities ?

GAB, LLT and HAD forms the basic building blocks of vcs functionality.
LLT (low latency transport protocol) – LLT transmits the heartbeats over the interconnects. It is also used to distribute the inter system communication traffic equally among all the interconnects.
GAB (Group membership services and atomic broadcast) – The group membership service part of GAB maintains the overall cluster membership information by tracking the heartbeats sent over LLT interconnects. The atomic broadcast of cluster membership ensures that every node in the cluster has same information about every resource and service group in the cluster.
HAD (High Availability daemon) – the main VCS engine which manages the agents and service group. It is in turn monitored by a daemon named hashadow.

What are the various GAB ports and their functionalities ?

a  -->     gab driver
b  -->     I/O fencing (to ensure data integrity)
d  -->     ODM (Oracle Disk Manager)
f  -->     CFS (Cluster File System)
h  -->     VCS (VERITAS Cluster Server: high availability daemon, HAD)
o  -->     VCSMM driver (kernel module needed for Oracle and VCS interface)
q  -->     QuickLog daemon
v  -->     CVM (Cluster Volume Manager)
w  -->     vxconfigd (module for cvm)

How to check the status of various GAB ports on the cluster nodes

To check the status of GAB ports on various nodes :
# gabconfig -a

Whats the maximum number of LLT links (including high and low priority) can a cluster have ?

A cluster can have a maximum of 8 LLT links including high and low priority LLT links.

How to check the detailed status of LLT links ?

The command to check detailed LLT status is :
# lltstat -nvv

What are the various LLT configuration files and their function ?

LLT uses /etc/llttab to set the configuration of the LLT interconnects.
# cat /etc/llttab
set-node node01
set-cluster 02
link nxge1 /dev/nxge1 - ether - -
link nxge2 /dev/nxge2 - ether - -
link-lowpri /dev/nxge0 – ether - -
Here, set-cluster -> unique cluster number assigned to the entire cluster [ can have a value ranging between 0 to (64k – 1) ]. It should be unique across the organization.
set-node -> a unique number assigned to each node in the cluster. Here the name node01 has a corresponding unique node number in the file /etc/llthosts. It can range from 0 to 31.
Another configuration file used by LLT is – /etc/llthosts. It has the cluster-wide unique node number and nodename as follows:
# cat /etc/llthosts
0 node01
1 node02
LLT has an another optional configuration file : /etc/VRTSvcs/conf/sysname. It contains short names for VCS to refer. It can be used by VCS to remove the dependency on OS hostnames.

What are various GAB configuration files and their function ?

The file /etc/gabtab contains the command to start the GAB.
# cat /etc/gabtab
/sbin/gabconfig -c -n 4
here -n 4 –> number of nodes that must be communicating in order to start VCS.

How to start/stop GAB

The commands to start and stop GAB are :
# gabconfig -c        (start GAB)
# gabconfig -U        (stop GAB)

How to start/stop LLT

The commands to stop and start LLT are :
# lltconfig -c       -> start LLT
# lltconfig -U       -> stop LLT (GAB needs to stopped first)

What’s a GAB seeding and why manual GAB seeding is required ?

The GAB configuration file /etc/gabtab defines the minimum number of nodes that must be communicating for the cluster to start. This is called as GAB seeding.
In case we don’t have sufficient number of nodes to start VCS [ may be due to a maintenance activity ], but have to do it anyways, then we have do what is called as manual seeding by firing below command on each of the nodes.
# gabconfig -c -x

How to start HAD or VCS ?

To start HAD or VCS on all nodes in the cluster, the hastart command need to be run on all nodes individually.
# hastart

What are the various ways to stop HAD or VCS cluster ?

The command hastop gives various ways to stop the cluster.
# hastop -local
# hastop -local -evacuate
# hastop -local -force
# hastop -all -force
# hastop -all
-local -> Stops service groups and VCS engine [HAD] on the node where it is fired
-local -evacuate -> migrates Service groups on the node where it is fired and stops HAD on the same node only
-local -force -> Stops HAD leaving services running on the node where it is fired
-all -force -> Stops HAD on all the nodes of cluster leaving the services running
-all -> Stops HAD on all nodes in cluster and takes service groups offline

Resource Operations

How to list all the resource dependencies

To list the resource dependencies :
# hares -dep

How to enable/disable a resource ?

# hares -modify [resource_name] Enabled 1      (To enable a resource)
# hares -modify [resource_name] Enabled 0        (To disable a resource)

How to list the parameters of a resource

To list all the parameters of a resource :
# hares -display [resource]

Service group operations

How to add a service group(a general method) ?

In general, to add a service group named SG with 2 nodes (node01 and node02) :
haconf –makerw
hagrp –add SG
hagrp –modify SG SystemList node01 0 node02 1
hagrp –modify SG AutoStartList node02
haconf –dump -makero

How to check the configuration of a service group – SG ?

To see the service group configuration :
# hagrp -display SG

How to bring service group online/offline ?

To online/offline the service group on a particular node :
# hagrp -online [service-group] -sys [node]      (Online the SG on a particular node)
# hagrp -offline [service-group] -sys [node]        (Offline the SG on particular node)
The -any option when used instead of the node name, brings the SG online/offline based on SG’s failover policy.
# hagrp -online [service-group] -any
# hagrp -offline [service-group] -any

How to switch service groups ?

The command to switch the service group to target node :
# hagrp -switch [service-group] -to [target-node]

How to freeze/unfreeze a service group and what happens when you do so ?

When you freeze a service group, VCS continues to monitor the service group, but does not allow it or the resources under it to be taken offline or brought online. Failover is also disable even when a resource faults. When you unfreeze the SG, it start behaving in the normal way.
To freeze/unfreeze a Service Group temporarily :
# hagrp -freeze [service-group]
# hagrp -unfreeze [service-group]
To freeze/unfreeze a Service Group persistently (across reboots) :
# hagrp -freeze -persistent[service-group]
# hagrp -unfreeze [service-group] -persistent

Communication failures : Jeopardy, split brain

Whats a Jeopardy membership in vcs clusters

When a node in the cluster has only the last LLT link intact, the node forms a regular membership with other nodes with which it has more than one LLT link active and a Jeopardy membership with the node with which it has only one LLT link active.
jeopardy in vcs cluster
Effects of jeopardy : (considering example in diagram above)
1. Jeopardy membership formed only for node03
2. Regular membership between node01, node02, node03
3. Service groups SG01, SG02, SG03 continue to run and other cluster functions remain unaffected.
4. If node03 faults or last link breaks, SG03 is not started on node01 or node02. This is done to avoid data corruption, as in case the last link is broken the nodes node02 and node01 may think that node03 is down and try to start SG03 on them. This may lead to data corruption as same service group may be online on 2 systems.
5. Failover due to resource fault or operator request would still work.

How to recover from a jeopardy membership ?

To recover from jeopardy, just fix the failed link(s) and GAB automatically detects the new link(s) and the jeopardy membership is removed from node.

Whats a split brain condition ?

Split brain occurs when all the LLT links fails simultaneously. Here systems in the cluster fail to identify whether it is a system failure or an interconnect failure. Each mini-cluster thus formed thinks that it is the only cluster thats active at the moment and tries to start the service groups on the other mini-cluster which he think is down. Similar thing happens to the other mini-cluster and this may lead to a simultaneous access to the storage and can cause data corruption.

What is I/O fencing and how it prevents split brain ?

VCS implements I/O fencing mechanism to avoid a possible split-brain condition. It ensure data integrity and data protection. I/O fencing driver uses SCSI-3 PGR (persistent group reservations) to fence off the data in case of a possible split brain scenario.
i:o fencing in vcs
In case of a possible split brain
As show in the figure above assume that node01 has key “A” and node02 has key “B”.
1. Both nodes think that the other node has failed and start racing to write their keys to the coordinator disks.
2. node01 manages to write the key to majority of disks i.e. 2 disks
3. node02 panics
4. node01 now has a perfect membership and hence Service groups from node02 can be started on node01

Whats the difference between MultiNICA and MultiNICB resource types ?

MultiNICA and IPMultiNIC
– supports active/passive configuration.
– Requires only 1 base IP (test IP).
– Does not require to have all IPs in the same subnet.
MultiNICB and IPMultiNICB
– supports active/active configuration.
– Faster failover than the MultiNICA.
– Requires IP address for each interface.

Troubleshooting

How to flush a service group and when its required ?

Flushing of a service group is required when, agents for the resources in the service group seems suspended waiting for resources to be taken online/offline. Flushing a service group clears any internal wait states and stops VCS from attempting to bring resources online.
To flush the service group SG on the cluster node, node01 :
# hagrp -flush [SG] -sys node01

How to clear resource faults ?

To clear a resource fault, we first have to fix the underlying problem.
1. For persistent resources :
Do not do anything and wait for the next OfflineMonitorInterval (default – 300 seconds) for the resource to become online.
2. For non-persistent resources :
Clear the fault and probe the resource on node01 :
# hares -clear [resource_name] -sys node01
# hares -probe [resource_name] -sys node01

How to clear resources with ADMIN_WAIT state ?

If the ManageFaults attribute of a service group is set to NONE, VCS does not take any automatic action when it detects a resource fault. VCS places the resource into the ADMIN_WAIT state and waits for administrative intervention.
1. To clear the resource in ADMIN_WAIT state without faulting service group :
# hares -probe [resource] -sys node01
2. To clear the resource in ADMIN_WAIT state by changing the status to OFFLINE|FAULTED :
# hagrp -clearadminwait -fault [SG] -sys node01

Setting up VERITAS Clusterr I/O Fencing

Steps to configure I/O fencing

##### Using the installer script ######

1. Initialize disks for I/O fencing
Minimum number of disks required to configure I/O fencing is 3. Also number of fencing disks should always be an odd number. We’ll be using 3 disks of size around 500 MB as we have a 2 node cluster. Initialize the disks to be used for the fencing disk group. We can also test whether the disks are SCSI3 PGR compatible by using the vxfentsthdw command for the fendg.
# vxdisk -eo alldgs list
# vxdisksetup -i disk01
# vxdisksetup -i disk02
# vxdisksetup -i disk03
2. Run the installvcs script from the install media with fencing option
# cd /cdrom/VRTS/install
# ./installvcs -fencing
Cluster information verification: Cluster Name: geekdiary
Cluster ID Number: 3
Systems: node01 node02
Would you like to configure I/O fencing on the cluster? [y,n,q] y
3. Select disk based fencing
We will be doing a disk based fencing rather than a server based fencing also called as CP (coordinator point) client based fencing.
Fencing configuration
1) Configure CP client based fencing 2) Configure disk based fencing
3) Configure fencing in disabled mode
Select the fencing mechanism to be configured in this Application Cluster:[1-3,q] 2
4. Create new disk group
You can create a new disk group or use an existing disk group for fencing. We will be using a new fencing DG which is a preferred way of doing it.
Since you have selected to configure Disk based fencing, you would be asked to give either the Disk group to be used as co-ordinator of asked to create disk group and the mechanism to be used.
Select one of the options below for fencing disk group: 1) Create a new disk group
2) Using an existing disk group
3) Back to previous menu
Press the choice for a disk group: [1-2,b,q] 1
5. Select disks to be used for the fencing DG
Select the disks which we initialized in step 1 to create our new disk group.
List of available disks to create a new disk group 1) 
     2) disk01
     3) disk02
     4) disk03
     ...
     b) Back to previous menu
Select an odd number of disks and at least three disks to form a disk group.
Enter the disk options, separated by spaces:
[1-4,b,q] 1 2 3
6. Enter the fencing disk group name, fendg
enter the new disk group name: [b] fendg
7. Select the fencing mechanism : raw/dmp(dynamic multipathing)
Enter fencing mechanism name (raw/dmp): [b,q,?] dmp
8. Confirm configuration and warnings
I/O fencing configuration verification Disk Group: fendg
Fencing mechanism: dmp
Is this information correct? [y,n,q] (y) y
Installer will stop VCS before applying fencing configuration. To make sure VCS shuts down successfully, unfreeze any frozen service groups in the cluster.
Are you ready to stop VCS on all nodes at this time? [y,n,q] (n) y

##### Using Command line ######

1. Initialize disks for I/O fencing
First step is same as above method. We’ll initialize 3 disks of 500 MB each. on one node :
# vxdisk -eo alldgs list
# vxdisksetup -i disk01
# vxdisksetup -i disk02
# vxdisksetup -i disk03
2. Create the fencing disk group fendg
# vxdg -o coordinator=on init fendg disk01
# vxdg -g fendg adddisk disk02
# vxdg -g fendg adddisk disk03
3. create the vxfendg file
# vxdg deport vxfendg
# vxdg -t import fendg
# vxdg deport fendg
# echo "fendg" > /etc/vxfendg  (on both nodes)
4. Enabling fencing
# haconf -dump -makero
# hastop -all
# /etc/init.d/vxfen stop
# vi /etc/VRTSvcs/conf/config/main.cf    ( add SCSI3 entry )
cluster geekdiary (
UserNames = { admin = "ass76asishmHajsh9S." }
Administrators = { admin }
HacliUserLevel = COMMANDROOT
CounterInterval = 5
UseFence = SCSI3
)
# hacf -verify /etc/VRTSvcs/conf/config
# cp /etc/vxfen.d/vxfenmode_scsi3_dmp /etc/vxfenmode    (if you are using dmp )
7. Start fencing
# /etc/init.d/vxfen start
# /opt/VRTS/bin/hastart

Testing the fencing configuration

1. Check status of fencing
# vxfenadm -d
Fencing Protocol Version: 201
 Fencing Mode: SCSI3
 Fencing Mechanism: dmp
 Cluster Members:
        * 0 (node01)
          1 (node02)
RSM State Information
     node 0 in state 8 (running)
     node 1 in state 8 (running)
2. Check GAB port “b” status
# gabconfig -a
GAB Port Memberships 
==================================
Port a gen 24ec03 membership 01
Port b gen 24ec06 membership 01
Port h gen 24ec09 membership 01
3. Check for configuration files
# grep SCSI3 /etc/VRTSvcs/conf/config/main.cf
UseFence = SCSI3
# cat /etc/vxfenmode
...
vxfen_mode=scsi3
...
scsi3_disk_policy=dmp
# cat /etc/vxfendg
fendg
cat /etc/vxfentab
...
/dev/vx/rdmp/emc_dsc01
/dev/vx/rdmp/emc_dsc02
/dev/vx/rdmp/emc_dsc03
4. Check for SCSI reservation keys on all the coordinator disks
In my case I have 2 nodes and 2 paths per disk, so I should be able to see 4 keys per disk (1 for each path and 1 for each node) in the output of below command.
# vxfenadm -s all -f /etc/vxfentab
Reading SCSI Registration Keys...
Device Name: /dev/vx/rdmp/emc_dsc01 Total Number Of Keys: 4
key[0]:
[Numeric Format]: 32,74,92,78,21,28,12,65
        [Character Format]: VF000701
* [Node Format]: Cluster ID: 5 Node ID: 1 Node Name: node02
key[1]:
[Numeric Format]: 32,74,92,78,21,28,12,65 [Character Format]: VF000701
* [Node Format]: Cluster ID: 5 Node ID: 1 Node Name: node02
key[2]:
[Numeric Format]: 32,74,92,78,21,28,12,66 [Character Format]: VF000700
* [Node Format]: Cluster ID: 5 Node ID: 0 Node Name: node01
key[3]:
[Numeric Format]: 32,74,92,78,21,28,12,66 [Character Format]: VF000700
* [Node Format]: Cluster ID: 5 Node ID: 0 Node Name: node01

Monday, August 3, 2015

Linux Booting Procedure


The stages involved in Linux Booting Process are:
BIOS
Boot Loader

    - MBR
    - GRUB
Kernel
Init
Runlevel scripts

BIOS
·         This is the first thing which loads once you power on your machine.
·         When you press the power button of the machine, CPU looks out into ROM for further instruction.
·         The ROM contains JUMP function in the form of instruction which tells the CPU to bring up the BIOS
·         BIOS determines all the list of bootable devices available in the system.
·         Prompts to select bootable device which can be Hard Disk, CD/DVD-ROM, Floppy Drive, USB Flash Memory Stick etc (optional)
·         Operating System tries to boot from Hard Disk where the MBR contains primary boot loader.

Boot Loader 
To be very brief this phase includes loading of the boot loader (MBR and GRUB/LILO) into memory to bring up the kernel.

MBR (Master Boot Record)
·         It is the first sector of the Hard Disk with a size of 512 bytes.
·         The first 434 - 446 bytes are the primary boot loader, 64 bytes for partition table and 6 bytes for MBR validation timestamp.
NOTE: Now MBR directly cannot load the kernel as it is unaware of the filesystem concept and requires a boot loader with file system driver for each supported file systems, so that they can be understood and accessed by the boot loader itself.

To overcome this situation GRUB is used with the details of the filesystem in 
/boot/grub.conf and file system drivers

GRUB (GRand Unified Boot loader)

This loads the kernel in 3 stages

GRUB stage 1: 
·         The primary boot loader takes up less than 512 bytes of disk space in the MBR - too small a space to contain the instructions necessary to load a complex operating system. 
·         Instead the primary boot loader performs the function of loading either the stage 1.5 or stage 2 boot loader.
GRUB Stage 1.5: 
·         Stage 1 can load the stage 2 directly, but it is normally set up to load the stage 1.5. 
·         This can happen when the /boot partition is situated beyond the 1024 cylinder head of the hard drive. 
·         GRUB Stage 1.5 is located in the first 30 KB of Hard Disk immediately after MBR and before the first partition.
·         This space is utilized to store file system drivers and modules.
·         This enabled stage 1.5 to load stage 2 to load from any known location on the file system i.e./boot/grub

GRUB Stage 2:
·         This is responsible for loading kernel from /boot/grub/grub.conf and any other modules needed
·         Loads a GUI interface i.e. splash image located at /grub/splash.xpm.gz with list of available kernels where you can manually select the kernel or else after the default timeout value the selected kernel will boot
The original file is /etc/grub.conf of which you can observe a symlink file at/boot/grub/grub.conf

Sample /boot/grub/grub.conf
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.18-194.26.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.26.1.el5 ro root=/dev/VolGroup00/root clocksource=acpi_pm divisor=10
        initrd /initrd-2.6.18-194.26.1.el5.img
title Red Hat Enterprise Linux Server (2.6.18-194.11.4.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.11.4.el5 ro root=/dev/VolGroup00/root clocksource=acpi_pm divisor=10
        initrd /initrd-2.6.18-194.11.4.el5.img
title Red Hat Enterprise Linux Server (2.6.18-194.11.3.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.11.3.el5 ro root=/dev/VolGroup00/root clocksource=acpi_pm divisor=10
        initrd /initrd-2.6.18-194.11.3.el5.img



Kernel
This can be considered the heart of operating system responsible for handling all system processes.

Kernel is loaded in the following stages:
1.    Kernel as soon as it is loaded configures hardware and memory allocated to the system.
2.    Next it uncompresses the initrd image (compressed using zlib into  zImage or bzImage formats) and mounts it and loads all the necessary drivers.
3.    Loading and unloading of kernel modules is done with the help of programs like insmod, and rmmod present in the initrd image.
4.    Looks out for hard disk types be it a LVM or RAID.
5.    Unmounts initrd image and frees up all the memory occupied by the disk image.
6.    Then kernel mounts the root partition as specified in grub.conf as read-only.
7.    Next it runs the init process

Init Process
·         Executes the system to boot into the run level as specified in /etc/inittab
Sample output defining the default boot runlevel inside /etc/inittab
# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)
#
id:5:initdefault:
As per above O/P system will boot into runlevel 5

You can check current runlevel details of your system using below command on the terminal
# who -r
         run-level 3  Jan 28 23:29                   last=S
·         Next as per the fstab entry file system's integrity is checked and root partition is re-mounted as read-write (earlier it was mounted as read-only).

Runlevel scripts
8.    A no. of runlevel scripts are defined inside /etc/rc.d/rcx.d
Runlevel  Directory
0 /etc/rc.d/rc0.d
1 /etc/rc.d/rc1.d
2 /etc/rc.d/rc2.d
3 /etc/rc.d/rc3.d
4 /etc/rc.d/rc4.d
5 /etc/rc.d/rc5.d
6 /etc/rc.d/rc6.d
·         Based on the selected runlevel, the init process then executes startup scripts located in subdirectories of the /etc/rc.d directory.
·         Scripts used for runlevels 0 to 6 are located in subdirectories /etc/rc.d/rc0.d through/etc/rc.d/rc6.d, respectively.


·         Lastly, init runs whatever it finds in /etc/rc.d/rc.local (regardless of run level). rc.localis rather special in that it is executed every time that you change run levels.
NOTE: rc.local is not used in all the distros as for example Debian.

Next if everything goes fine you should be able to see the Login Screen on your system.