Tuesday, June 23, 2009

Remember everything is connected when troubleshooting database issue

Yesterday (Monday) users of a reporting database experienced slowness in extracting data from the database. They indicated that at the speed of current extracting, they would miss the business requirement to deliver the data to executives. I was called to check the database. I noticed their sessions were waiting for "buffer busy wait" event frequently, other than that nothing seemed abnormal in the database. Later, several sessions finished. "Buffer busy wait" become not outstanding. However, one of their major job was still running. Another job errored out with ORA-1555 snapshot too old eventually. They noticed the volume of data that they extracted was larger than usual.

Discussing for a while among users and DBAs, finally we realized that this was due to a large scrub that was carried out against the online production database by me per development team's request last Sunday night. Our online production database is replicated to the reporting database. In that scrub, one DATE type column of a table was updated to SYSDATE', whereas the reporting database users extract their data based on the value of that column. This caused the increased volume of data because more records now fall into the range of time that needs to be extracted.

I felt lucky secretly that I was able to provide the log file of that scrub with starting time and ending time information. This is of course not something I should be proud of. We should always have logs as detail as possible of whatever changes we make against a database, right?. The truth is in my current working environment, DBA is always handed over scripts without taking this into consideration. So if DBA does not take extra step to modify the script to generate useful log information , he could be in trouble if something goes wrong. The scrub took more than 8 hours from about 9:30 PM Saturday to 5:40 AM Sunday. So we knew what was the range of values that the DATE column has when it was updated to 'SYSDATE'. The users decided to exclude that range of data as a temporary fix for their Monday's report. Development team was requested to review their scrubs (i.e. Why updating the DATE column? Initial discussion gave me impression that this was not necessary), since we have even larger scrubs yet to complete of this kind.

No comments: