Tuesday, February 22, 2011

How can I confirm there was an I/O problem?

Last Friday afternoon , our production support team ( this team manages application servers in my working environment) reported that arounnd 2:18 PM there were application queuing and some job time-out issues. Naturally, as a knee-jerk reaction, the problem was described as "Database is slow". Later, some "SCSI trans" errors are found somewhere , which indicated there were some I/O problem. So my question is as a DBA, how can I confirm there is an I/O problem?

At first, I tried to look into the AWR data. I modified awr_io_file.sql script by Karl Arao , in hope to report Average time per read time for the data file during the problem period. However, the script was running more than 20 min without results. I killed the job and decided to research on this mining AWR data approach later. BTW, one drawback I am aware of is that even I can get this data, it will be averaged over 15 min (our AWR snapshot interval). Averaging could hide the real problem.

After googling a while, I found that 'sar' utility in my Solaris platform can shed some light on I/O issue in the recent past history.

If we do 'man sar', we can found the following description about sar -f option:

sar [-aAbcdgkmpqruvwy] [-e time] [-f filename]  [-i sec]  [-s time]

sar  extracts  data  from  a  previously  recorded filename,
either the one specified by the -f option  or,  by  default,
the    standard    system    activity    daily   data   file
/var/adm/sa/sadd for the current day dd.  The  starting  and
ending  times  of the report can be bounded using the -e and
-s arguments with time specified in the  form  hh[:mm[:ss]].
The  -i option selects records at sec second intervals. Oth-
 erwise, all intervals found in the data file are reported.


This is the first time that I am aware of that there are 'standard system activity daily data file'
at /var/adm/sa/. As of now, I can see the following files:


myservnm:/var/adm/sa [orclsid] $ ls -l 
total 2177984
-rw-r--r--   1 sys      sys      241097472 Feb 18 23:55 sa18
-rw-r--r--   1 sys      sys      241097472 Feb 19 23:55 sa19
-rw-r--r--   1 sys      sys      241097472 Feb 20 23:55 sa20
-rw-r--r--   1 sys      sys      241097472 Feb 21 23:55 sa21
-rw-r--r--   1 sys      sys      78691536 Feb 22 07:45 sa22

So it looks like we have 5 days data.

What I did last Friday afternoon was:

sar -A  -e 14:28  -s 14:10   > ~/sar1410_1428.txt

If you don't specify the -f, it will extract from the current day's file, i.e. sa18 last Friday.
In the output file, we can see various info recorded at 5 min snapshot

For example, CPU utilization:

SunOS myservnm 5.10 Generic_142900-05 sun4u    02/18/2011
 
14:10:02    %usr    %sys    %wio   %idle
14:15:03      17      14       0      69
14:20:04      10      11       0      78
14:25:05      20      18       0      62
 
Average       16      15       0      70


In the device activity section, I can find at 14:20 snapshot some lines have very high avserv time.
while normally I saw avserv < 30 ms in my system. for examples:
14:10:02   device        %busy   avque   r+w/s  blks/s  avwait  avserv
 
14:15:03   sd7               1     0.0       1      12     0.0     9.0
           sd7,e             1     0.0       1      12     0.0     9.2
           sd8              32     0.3      46     426     0.0     7.4
....
14:20:04   sd7               1     0.0       1      16     0.0     9.1
           sd7,e             1     0.0       1      16     0.0     9.3
...          
sd1133,e          1     0.0       1      26     0.0    12.9
           sd1134           44     0.8       3      73    78.6   238.1
           sd1134,e         44     0.8       3      73    79.1   239.6
           sd1135           42     0.4       1      20     0.0   485.3
           sd1135,e         42     0.4       1      20     0.0   494.7
           sd1136           42     0.4       0      25     0.0  1000.9
           sd1136,e         42     0.4       0      25     0.0  1042.2
           sd1142            1     0.0       1      40     0.0    13.0
...
  sd1204,e          4     0.0       3      81     0.0    16.0
           sd1205           43     0.4       1      22     0.0   440.3
           sd1205,e         43     0.4       1      22     0.0   448.0
           sd1206           43     0.8       2      60    85.3   254.3
           sd1206,e         43     0.8       2      60    85.9   256.1
           sd1207            1     0.0       1      25     0.0    12.7



This webpage  describes the meaning of some sar options and report fields.
This article called "Identifying disk I/O-bound systems" has information about intepreting sar report.

I am happy I have another tool under my tool-belt.

2 comments:

Anonymous said...

Hi Denis,

Yes the awr_io_file.sql will really run slow if you are going to run through all the SNAP_IDs. Especially if you have thousands of data samples.

I usually edit the script and filter on the time range I'm interested in. In your case it's 2:18 PM. There is a filter section on the bottom part of the script you just have to uncomment it. If you want a more granular latency info on your datafiles you can use the scripts filemetric.sql (10mins interval - coming from metric views) and filestat.sql (you can set it even at 1sec interval - coming from v$ views) at http://karlarao.wordpress.com/scripts-resources/

And regarding SAR, yes I also do that.. usually I correlate it with the Oracle performance data and to have a faster analysis I feed those SAR files to kSar to have a graphical view of performance data. And if you have let's say 7days worth of SAR data, you can just append them all in one file and you'll appreciate more the time series graph.


- Karl Arao

yds said...

Karl - I did try to edit the time range on that day. But your script has a union all for datafile and tempfile respectively, I only edited the time for the tempfile part without being aware there is another part, so when the script had been running more than 20 min, I gave up. I figured it out later. Anyway thanks for sharing your experience and great scripts.