Thursday, April 28, 2011

DFS Lock Handle (How to solve the Wrong Problem)

We started our Data Warehouse quest adventure on Tru64 back in 2003. At that time the Oracle 9i RAC and Tru64 was the best fastest platform on TPCH tests.

Evidently, all great operating systems needs to be retired sooner or later. One side note here for Tru64: I really loved their idea of clustering. Network clustering, Storage clustering, Service Clustering. All part of the operating system. These days you need to pull together couple of products to make a kind of Tru64 environment.

Anyway, back to Oracle migration. We migrated from 9i to 9.2 then to 10.2.0.4 on Tru64. Since 10.2.0.4 was the terminal release on Tru64 we start building the new Data Warehouse on Red Hat Enterprise Linux 5 and 10.2.0.4. We wanted to speed up the migration so we used the transportable tablespace technique.

We used OCFS2 as file system on the newly 2 node RAC environment.

The migration went fine and the performance seems fine for most of the business users. One problem surfaced though: RMAN backups.

On the previous environment we used EMC as storage vendor. On the migrated platform we switched to NetApp. What do you do when you have backup issues? Evidently blame the storage groups !


On the Tru64 environment we were able to take full backups every Sunday and incrementals every day. On the new RHEL environment we were barely able to take a full backup Saturdays. We kind of find a solution on this backing up only the archive logs on week-days and full backups at end of the week.

Once in a while, the DWH will become really slow. That will make a lot of users angry and our lives really terrible. And to make things complicated that start up to build as a pattern. Every other Tuesday !!

When things will become really slow, the NetApp engineers will start capturing storage stats looking for problems. However, they will not find anything really broke, then a Wednesday will come and things will become nice and 'fast' again.

In the meantime we were trying to look at AWR reports and one event will show up constantly on top on 'Tuesdays'. That will be evidently the DFS lock handle. Netapp will blame Oracle, Oracle will blame Netapp.

How do you solve a deadlock? Well, Oracle is picking up a session and kill the transaction. We decided to upgrade to 10.2.0.5. After a relatively easy patch week-end we got on the other side with a better environment. That's what we believed for couple of days at least. We definitely killed the DFS lock handle event and got to severe contention on redo logs switches after only 3 days !

So, we are gathering our iostats again and send it to NetApp. They will gather their stats again and answer back there is no problem with the storage.

Out of desperation we are looking the Oracle Documentation and research on the asynchronous I/O. There are basically 2 parameters that controls the I/O subsystem:

1. DISK_ASYNC_IO : [TRUE]|FALSE

2. FILESYSTEMIO_OPTIONS: ASYNCH|DIRECTIO|SETALL|[NONE]

The default value for DISK_ASYNC_IO is set to TRUE. FILESYSTEMIO_OPTIONS has NONE as the default.

We quickly researched on metalink and find out that OCFS2 is indeed supported for both direct i/o and asynchronous i/o. After we set the FILESYSTEMIO_OPTIONS to SETALL we got a really nice surprise. All the problems went away !

One side note here. The asynchronous i/o is turned off by default on cooked file systems. On ASM the asynchronous i/o is enabled by default since Oracle 'knows' he has direct access to the raw devices.

Solving the real problem is most of the time an easy task. In our situation was as simple as changing one single parameter. Evidently, the real problem is finding the real problem !