Sunday, March 1, 2015

Oracle Linux -- Find Top Writers to Disk

Problem

On your Oracle Linux 5 box, you are experiencing a system slowdown. The response time is very low. The system performance degrades to unacceptable limits.

You are executing vmstat to check the system state:

[root@wordpress ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  3      0 1433712  38344 382980    0    0     0  1416 1333  278  0  2  0 98  0
 0  3      0 1433712  38344 382980    0    0     0  1196 1278  269  0  2  0 97  0
 0  3      0 1433712  38344 382980    0    0     0  2132 1544  323  1  4  0 96  0
 0  3      0 1433712  38344 382980    0    0     0   128 1019  257  0  1  0 99  0
 0  2      0 1433712  38344 382980    0    0     0  2628 1522  336  0  3  0 98  0
 0  3      0 1433712  38344 382980    0    0     0   148 1057  339  1  1 46 53  0
 0  3      0 1433712  38344 382980    0    0     0  1108 1266  315  0  2  0 99  0
 


There are excessive I/O waits (the wa column >90%) and something is writing to disk (the bo column). 


Solution

The vmstat analysis show that there are excessive writes to disks. The iotop utility will help you easily identify the top writers. But if iotop is not available on your system, you might use the manual approach below to identify the top processes writing to disks.

The /proc/<PID>/io pseudo-file contains I/O statistics for a particular process identified by PID. For example, a process with PID=157 shows statistics below:

[root@wordpress ~]# cat /proc/157/io
rchar: 0
wchar: 0
syscr: 0
syscw: 0
read_bytes: 0
write_bytes: 57344
cancelled_write_bytes: 0

So, if you could find all the processes with a high write_bytes statistics which yet keeps growing you would identify the top writers.

To find the top writers follow these steps:


1) Identify all the io pseudo-files for the all running processes:

[root@wordpress ~]# find /proc/ -name "io" | grep -v "task"
/proc/1/io
/proc/2/io
.
.
.
/proc/5327/io
/proc/5328/io

Note that the list is in a sorted order. 


2) Next find the strings containing the write_bytes statistics, optionally eliminating 0 valued statistics:

[root@wordpress ~]# find /proc/ -name "io" | grep -v "task" | xargs grep "^write_bytes" | grep -v "write_bytes: 0"
grep: /proc/5336/io: No such file or directory
grep: /proc/5337/io: No such file or directory
/proc/1/io:write_bytes: 9453568
/proc/157/io:write_bytes: 57344
/proc/417/io:write_bytes: 33939456
.
.
.
/proc/3970/io:write_bytes: 12288
/proc/4914/io:write_bytes: 4096
/proc/4917/io:write_bytes: 348160

Ignore any No such file or directory errors. There can be processes spawned and died between invocations of commands on the pipeline. Thus some inconsistencies are expected but can be merely ignored. The writing process will remain alive.


3) Take two different snapshots of the writing statistics. Use the above command and redirect the output to a file:

find /proc/ -name "io" | grep -v "task" | xargs grep "^write_bytes" | grep -v "write_bytes: 0" > snap1

Wait a few seconds, and take the second snapshot:

find /proc/ -name "io" | grep -v "task" | xargs grep "^write_bytes" | grep -v "write_bytes: 0" > snap2


4) Now using the diff utility, compare the two snapshots

[root@wordpress ~]# diff snap1 snap2
2c2
< /proc/157/io:write_bytes: 53248
---
> /proc/157/io:write_bytes: 57344
40c40
< /proc/3684/io:write_bytes: 196755456
---
> /proc/3684/io:write_bytes: 196771840
41a42
> /proc/3737/io:write_bytes: 4096

The comparison showed that the process PID=3684 wrote so far about 196MB and keeps writing.

Taking more snapshots and comparing them might provide more detailed information. 


5) Having the PID for top writers identify the with the ps utility:

[root@wordpress ~]# ps -ef | grep 3684
root      3684  3318  3 10:32 ?        00:00:31 /usr/bin/python -tt /usr/libexec/yum-updatesd-helper --check --dbus
root      3925  3737  0 10:47 pts/1    00:00:00 grep 3684




No comments: