At first, I tried to look into the AWR data. I modified awr_io_file.sql script by Karl Arao , in hope to report Average time per read time for the data file during the problem period. However, the script was running more than 20 min without results. I killed the job and decided to research on this mining AWR data approach later. BTW, one drawback I am aware of is that even I can get this data, it will be averaged over 15 min (our AWR snapshot interval). Averaging could hide the real problem.
After googling a while, I found that 'sar' utility in my Solaris platform can shed some light on I/O issue in the recent past history.
If we do 'man sar', we can found the following description about sar -f option:
sar [-aAbcdgkmpqruvwy] [-e time] [-f filename] [-i sec] [-s time] sar extracts data from a previously recorded filename, either the one specified by the -f option or, by default, the standard system activity daily data file /var/adm/sa/sadd for the current day dd. The starting and ending times of the report can be bounded using the -e and -s arguments with time specified in the form hh[:mm[:ss]]. The -i option selects records at sec second intervals. Oth- erwise, all intervals found in the data file are reported.
This is the first time that I am aware of that there are 'standard system activity daily data file'
at /var/adm/sa/. As of now, I can see the following files:
myservnm:/var/adm/sa [orclsid] $ ls -l total 2177984 -rw-r--r-- 1 sys sys 241097472 Feb 18 23:55 sa18 -rw-r--r-- 1 sys sys 241097472 Feb 19 23:55 sa19 -rw-r--r-- 1 sys sys 241097472 Feb 20 23:55 sa20 -rw-r--r-- 1 sys sys 241097472 Feb 21 23:55 sa21 -rw-r--r-- 1 sys sys 78691536 Feb 22 07:45 sa22
So it looks like we have 5 days data.
What I did last Friday afternoon was:
sar -A -e 14:28 -s 14:10 > ~/sar1410_1428.txt
If you don't specify the -f, it will extract from the current day's file, i.e. sa18 last Friday.
In the output file, we can see various info recorded at 5 min snapshot
For example, CPU utilization:
SunOS myservnm 5.10 Generic_142900-05 sun4u 02/18/2011 14:10:02 %usr %sys %wio %idle 14:15:03 17 14 0 69 14:20:04 10 11 0 78 14:25:05 20 18 0 62 Average 16 15 0 70
In the device activity section, I can find at 14:20 snapshot some lines have very high avserv time.
while normally I saw avserv < 30 ms in my system. for examples:
14:10:02 device %busy avque r+w/s blks/s avwait avserv 14:15:03 sd7 1 0.0 1 12 0.0 9.0 sd7,e 1 0.0 1 12 0.0 9.2 sd8 32 0.3 46 426 0.0 7.4 .... 14:20:04 sd7 1 0.0 1 16 0.0 9.1 sd7,e 1 0.0 1 16 0.0 9.3 ... sd1133,e 1 0.0 1 26 0.0 12.9 sd1134 44 0.8 3 73 78.6 238.1 sd1134,e 44 0.8 3 73 79.1 239.6 sd1135 42 0.4 1 20 0.0 485.3 sd1135,e 42 0.4 1 20 0.0 494.7 sd1136 42 0.4 0 25 0.0 1000.9 sd1136,e 42 0.4 0 25 0.0 1042.2 sd1142 1 0.0 1 40 0.0 13.0 ... sd1204,e 4 0.0 3 81 0.0 16.0 sd1205 43 0.4 1 22 0.0 440.3 sd1205,e 43 0.4 1 22 0.0 448.0 sd1206 43 0.8 2 60 85.3 254.3 sd1206,e 43 0.8 2 60 85.9 256.1 sd1207 1 0.0 1 25 0.0 12.7
This webpage describes the meaning of some sar options and report fields.
This article called "Identifying disk I/O-bound systems" has information about intepreting sar report.
I am happy I have another tool under my tool-belt.
2 comments:
Hi Denis,
Yes the awr_io_file.sql will really run slow if you are going to run through all the SNAP_IDs. Especially if you have thousands of data samples.
I usually edit the script and filter on the time range I'm interested in. In your case it's 2:18 PM. There is a filter section on the bottom part of the script you just have to uncomment it. If you want a more granular latency info on your datafiles you can use the scripts filemetric.sql (10mins interval - coming from metric views) and filestat.sql (you can set it even at 1sec interval - coming from v$ views) at http://karlarao.wordpress.com/scripts-resources/
And regarding SAR, yes I also do that.. usually I correlate it with the Oracle performance data and to have a faster analysis I feed those SAR files to kSar to have a graphical view of performance data. And if you have let's say 7days worth of SAR data, you can just append them all in one file and you'll appreciate more the time series graph.
- Karl Arao
Karl - I did try to edit the time range on that day. But your script has a union all for datafile and tempfile respectively, I only edited the time for the tempfile part without being aware there is another part, so when the script had been running more than 20 min, I gave up. I figured it out later. Anyway thanks for sharing your experience and great scripts.
Post a Comment