AIX: Problem Determination and Resolution

Problem Determination and Resolution

1)ping
-->ping
determine the status of the network and various remote foreign hosts
-Tracking and isolating H/W & S/W problems
-Testing, measuring & managing networks

+Display the route buffer on the returned packets
-->ping -R server2

If you cannot reach other computers on the same subnetwork with the ping, look for problems on your system's network configuration ,use arp & ifconfig.

2)arp
Display and modifies the internet to physical address(MAC address)translation tables used by ARP. The arp command displays the current ARP entry for the host specified by the Hostname variable.
Modifies MAC table used by the ARP(Address Resolution Protocol)
-->ping 9.3.5.193
No response
-->ping 9.3.5.196
Response
-->arp -a | grep 9.3.5.19
9.3.5.193=No MAC
9.3.5.196=MAC - 0:2:55:A8:00:dd
check cable connections, H/W

3)ifconfig
-->ifconfig -a -d
Show only those interfaces that are down.
If a interface is down and you have problem in reaching the subnet on which the interface is configured, run

-->errpt
to check any errors has been reported for the interface (for ex. duplicate IP address in the network)

-->diag
Diagnostic over the interface
If the interfaes do not have problems ,then they are in active state, and your system cannot reach to the computers on same subnetwork , you should check that the interfaces subnet mask is correct.
Suppose to change the subnet mask to 255.255.255.252 for en1 interface
-->ifconfig en1 netmask 255.255.255.252 up

4)traceroute(it creates load on system so dont use on production server)
-->traceroute
trace the route of an IP packet,network testing , measurement,management ,Primarily for manual fault isolation.

A-2) H/W problems
1)errpt
Generates an error report from entries in an errorlog, but it does not perform error log analysis .
for analysis use
-->diag
-->errpt -a

+class -General source of the error
H-H/W
S-S/W
O-informational messages
U-Undermined

+Type - Severity of the error that has occured.
PEND-The loss of availability of a device or component is imminent
PERF-The performance of the device has degraded to below and acceptable level.
PERM-A condition that could not be recovered from.
-severe errors, defective H/W,S/W module
TEMP-A condition that was recovered from after a number of unusual attempts.
UNKN-not possible to determine the severity of the error.
INFO-Error log entry is informational and was not the result of an error.

+Resource Name-Name of the resource that has detected the error.
Location code-Path to the device,Drawer,slot,connector
,port

2)diag
Diag uses the errorlog to diagnose H/W problems.
System delets -H/W entries 90 days older
        -S/W entries 30 days older
-->diag
Diagnostic Routines - System Verification - Problem Determination.

B)Reasons to monitor root mail
1)mail
-->mail
Most of the processes send a mail to the root account with detailed information
-->diagela
Diagnostic Automatic Error log Analysis
provides the capability to do error log analysis whenever a permanant H/W error is logged.
It sends a message to your console and to all system groups. The message contains SRN or a corrective action, diagela is enabled by default at BOS installation time.

2)crontab
sends mail to root

3)Other software packages,especially security related ones,have the ability to specify the administrator.
ex. incase of security breach, illegal file permission change, or unauthorized passwd-file access , the system administrator receives a message.

C)System dump facility
System generates a system dump when a severe error occurs. System dumps can also be user-initiated by users with root user authority.
A dump creates a picture of your system's memory contents. Sysadmins and programmers can generate a dump & analyze its contents when debugging new applications.

a)Configuring a dump device
At the installation time, dump device(/dev/hd6 bydefault, primary) created . Secondary dump device /dev/sysdumpnull.
If your system has 4GB or more of memory then the default dump device is /dev/lg-dumplv & is a dedicated dump device.
A primary dump device is a dedicated dump device, secondary dump device is shared dump device
The dump device can be configured to either tape or a logical volume on the hard disk to store the system dump.
+To list the current dump destination
-->sysdumpdev -l
+Change primary dump device from /dev/hd6 to logical volume /dev/dumpdev
-->sysdumpdev -P -p /dev/dumpdev

+Info about previous dump
-->sysdumpdev -L

+Minimum size for the dump space can be determined by
-->sysdumpdev -e

+increase size of dump device
-->extendlv

1+ Start a system dump
Dump can be system initiated or user initiaed . If your system stops with 888 number flashing in the operator panel display , the system has generated a dump and saved it to a primary dump device

2+Understanding 888 error messages
It means either a H/W or S/W problem has been detected and a diagnostic message is ready to be read.
Record info contained in the 888 sequence message,
-888
-102-unexpected system halt
-mmm-cause of halt-crash code h/w,s/w
-ddd-Dump Status-Dump code
-888
when the system dump completes,the system either halts or reboots , depending upon the setting of the auto restart attribute of sys0
-->lsattr -El sys0 -a autorestart
if autorestart true ,Automatically REBOOT system after a crash is True
Change this setting
-->chdev -l sys0 -a autorestart =false
sys0 changed
-->lsattr -El sys0 -a autorestart

+ User initiated dump
-->sysdumpstart -p
write dump to the primary device
-->sysdumpstart -s
to secondary dump device

3+Copy a system dump
-->pax
allow you to copy,creat and modify files that are greater than 2 GB in size such as system dumps from one location to another. This is useful in migrating dumps, as the tar & cpio commands cannot handle manipulating files that are larger than 2GB in size. pax can also view and modify files in the tar and cpio format.
To view the contents of the tar file /tmp/test.rar
-->pax -vf /tmp/test.tar
To create a pax command archive on tape that contains two files
-->pax -x pax -wvf /dev/rmt0 /var/adm/ras/cfglog /var/adm/ras/nimlog

To untar the tar file /tmp/test.tar
to the current directory
-->pax -rvf /tmp/test.tar

To copy the file run.pax to the /tmp directory
-->pax -rw run.pax /tmp

4+snap
Used to gather configuration information of the system. It is a method of sending lslpp & errpt output to your service center, for diagnosing problems.
Default directory for the output from the snap command
-->/tmp/ibmsupt
8MB of temporary disk space is required when executing snap.
To copy general system information ,including file system, kernel parameters and dump information to rmt0
-->/usr/sbin/snap -gfkD -o /dev/rmt0
also copy atest case of problem in /tmp/ibmsupt directory

5+ Analysing system dumps
kdb -allows you to examine a system dump or running kernel

D) alog
--/var/adm/ras/bootlog
Boot log contains info generated by cfgmgr & rc.boot
To change the size of the boot log
--> echo " boot log resizing " | alog -t boot -s 8192

Display the bootlog
--> alog -t boot -o | more

E) Determine Appropriate actions for user problems -commands
1)usrck
Verifies the correctness of the user definitions in the userdatabase file, by checking the definitions for ALL the users or for the users specified by the user parameter.
This command checks
1>/etc/passwd
entries , duplicate names are reported and removed. Duplicate IDs are reported but not fixed .
If entry has fewer than six colon separeted fields entry is reported.
2>/etc/passwd - /etc/security/user, /etc/security/limits.
usrck verifies that each user name listed in the /etc/passwd file has a stanza in the /etc/security/user, Also verifies that each group name listed in /etc/group has stanza in /etc/security/group file.

To verify that all the users exist in the user database, and have any errors reported (but not fixed)
-->usrck -n ALL

To delete, from the user definitions ,those users who are not in the user database files,& have any errors reported
-->usrck -y ALL(-y fix & reports errors)

2) grpck
Verifies the correctness of the group definitions in the user database files by checking the definitions for all the groups or for the groups specified by the Group parameter.
To verify that all the group members and admins exist in the user database ,and have any errors reported (but not fixed)
-->grpck -n ALL

To verify that all group members and admins exist in the user database & to have errors fixed, but not reported
-->grpck -p ALL

To verify the uniqueness of the group name & groupID defined for the abc group
-->grpck -n

Only report and not correct
-->grpck -t abc

Ask interactively
-->grpck -y abc
fixes errors and reports them.

3)pwdck
Verifies the correctness of passwd info .
verify that all local users have valid passwords
-->pwdck -y ALL
This report errors, & fixes them.

Ensure that user joey has a valid stanza in /etc/security/passwd
-->pwdck -y joey
fixes errors and reports them

4)sysck
Checks file definitions against the extracted files from the installation and update media and updated the SWVPD.
Used during installation & update of s/w products.
sysck updates the filename,product name,type ,checksum,size of each file in SWVPD database.
A product that uses the installp command to install has an inventory file in its image.
To add the definitions to the inventory database and check permission ,links,checksums.
-->sysck -i -f smart.rte.inventory smart.rte

To remove any links to files for a product that has been removed from the system and remove the files from the inventory database
-->sysck -u -f smart.rte.inventory smart.rte

5)lsgroup & lsuser
-->lsgroup -f ALL >> /tmp/check
-->lsuser -f ALL >> /tmp/check
write output in file /tmp/check
-->lsuser joey
used by root for a specific user

6)The user limits
/etc/security/limits file specifies the process resourlce limits for each user.
-->mkuser
-->chuser
-->lsuser
-->rmuser

F)Identifying H/W problems
a)Replacing hot plug devices
-->lsslot -c pci
Display the number ,location ,and capabities of hot plug pci slots.
Before replacing a hot plug adapter or disk, you should unconfigure all other devices or interfaces that are dependent on the physical device you want to remove.
-->lsdev -C | grep sis
device in available state
The Hot Plug Task can be started with either SMIT or diagnostic (DIAG) tools menu.
-->diag
-Task Selction (Diagnostic ,ADvanced Diagnostics,Service Aids)
-Hot Plug Task -PCI HOT PLUG MANAGER
        -RAID HOT PLUG DEVICES
        -SCSI & SCSI RAID HOT PLUG MANAGER
-PCI HOT PLUG MANAGER
-Unconfigure a device-Device name -ent2
Go back to
-PCI HOT PLUG MANAGER MENU
-Replace/remove a PCI HOT PLUG Adapter.
After this option has been selected ,the pci slot will be put into a state that allows the pci adapter to be removed.
A blinking attention light will identify the slot that contains the adapter that has been selected for replacement
change the adapter now
cfgmgr new device
configure IP
-->smitty chinet
A repair action should be logged in Aix error report against the ent2 device, this will show others that error logged in tghe error reports has been solved
To enter repair action-diag-Task selection-log repair action-ent2 device.

G)Failed disk replacement
Reasons to replace a disk
-failed
-report i/o errors and you want to replace it.
-does not satisfy /meet your requirements.

Scenario1
If the disk you are going to replace is mirrored ,then
1. Remove copies of all logical volumes that were residing on that disk using rmlvcopy of unmirrorvg
2.Remove the disk from vg using reducevg
3.Remove the disk definition using rmdev
4.Physically remove the disk. If the disk is not Hot-Swappable , you may required to reboot the filesystem.
5)Make replacement disk available , If the disk is hot-swappable ,you can run cfgmgr, otherwise you need to reboot the system.
6)Include newly added disk in vg using extendvg.
7)Recreate & synchronize the copies for all lv using mklvcopy, or mirrorvg.

Scenario2
If the disk you are going to replace is not mirrored and is still funcitonal , then
1. Make the replacement disk available.
If the disk is hot-swappable , you can run cfgmgr; otherwise reboot is required.
2.Include newly added disk to vg using extendvg
3.Migrate all partitions from the failing disk to new disk using migratepv or migratelp.
If the disks are part of rootvg, consider the following
-If old disk contains a copy of the BLV, you have to clear it using --> chpv -c hdiskn
-New BLV must be created on new disk using bosboot
-Bootlist must be updated using bootlist.
-If old disk contains a paging space or a primary dump device you should disable them. After the migratepv command completes , you should reactivate them.
4.Remove old disk, reducevg
5.Remove old disk, definition,rmdev

Scenario3
If the disk is not mirrored ,has failed completely and there are other disks available in the vg then,
1.Identify all logical volumes that have at least one partition located on the failed disk
2. Close the lv, unmount all corresponding fs
3.Remove the file systems & logical volumes using rmfs
4.Remove the failing disk form vg ,using reducevg
5.Remove disk definition,rmdev
6.Physically remove disk, if it is not HOT-SWAPPABLE
reboot is required.
7.Make replacement disk available, if it is HOT-SWAPPABLE run cfgmgr , if not reboot is required.
8.extendvg new disk
9.Recreate all lv, fs using mklv, crfs.
10.If you have a backup of your data,restore your data from backup.

Scenario4
If the disk is not mirrored, failed completely, no other disks available in the vg(vg has only one disk or all pv failed simultaneously) & the vg is not rootvg then.
1-Export vg definition from system using exportvg
2-Ensure that /etc/filesystem does not contain any incorrect stanzas
3-Remove the disk definitions using the rmdev command
4-Physically remove disk,cfgmgr or reboot
5-Make new replacement disk available ,cfgmgr or reboot
6-If you have a vg backup , restore it using restvg
7-If you dont have vg backup, recreate the vg,lv,fs
8-If u have a backup of your data,restore your data from backup

Scenario5
If disk is not mirrored ,has failed completely, no other disk available in vg & vg is rootvg then
-Replace the failing disk
-Boot in maintainence mode
-Restore the system from an mksysb image

I)Troubleshoot graphical problems

1)Full /home filesystem
-users will not be able to log in
-looks like hang
-go through command line

2)Name Resolution problems
-nslookup
-verify your systems network access
-server is up and running
-start and stop server-->smitty spnamerslv

2)export DISPLAY=server3:2.0
server3-xhost +server2
grant access to server2 on server3 to connect to the X server
server-xhost -server2
deny access

+TTY display problems
-->clear
failed
-->smitty
failed
TERM variable is not set to the correct value
--export TERM =vt100

J)perfpmr

AIX

Monday, 26 May 2014

Problem Determination and Resolution

No comments: