Monday 16 February 2009

Analytic Workspace Manager Plugin Best Practices

The Analytic Workspace Manager (or AWM for short) is the tool to use when wanting to quickly create OLAP structures in the Oracle OLAP engine embedded within the Oracle relational database. Clearly you can use OWB etc. but AWM provides a lightweight easy to use interface for creating quick demos, restructuring content or processing etc. However, what the AWM provides may well not be everything to everyone and thus it is possible to create your own plugins, using Java, to get it to do what you want.

The documentation around doing this, I found, is a little light on the ground so to speak. So I have tried to document my findings here to help those that may wish to follow suit. However, before I start I should probably point out that in no way is the following post endorsed by Oracle. These are merely my observations after coding my own plugin for the Analytic Workspace Manager.

So here goes...
Step one is to find the Developing_AWM_Plugins_11g.pdf document. When I started coding my plugin for the 11g version of AWM, I could only find the 10g document which is not quite accurate for 11g. The OLAP API has changed somewhat in 11g and this is reflected in the OLAP API interface between the AWM and user plugin code. Looking at the document, I guess the pertinent points are that the 11g version of AWM requires the plugin to be compiled with Java 5 SE. (Or version 1.5 in "old school", just why did Sun confuse matters?) We'll come back to this but suffice to say that you'll need to remember that Java 5 brings with it support for generics.

Also in the document, if you have coded for 10g plugins, you may notice that the interface method signatures have changed a little. The AW object passed into the handle method is now an oracle.AWXML.AW object. The params parameter to the handle method provides a map, where the DATAPROVIDER key provides the ability to get a connection to the underlying Oracle database without requiring username/password information.

Step two, and arguably the most important step, is the understanding that on startup the Analytic Workspace Manager loads all Java code in the plugin directory and looks for any class files that implement the AWMPlugin interface. This is also documented in the above mentioned document, however, what is not made clear is that there will naturally be a performance impact on the startup time of AWM, the more classes and JAR files that you add to the plugin directory. Also, if you use 3rd-party JAR files you are at the mercy of those maintainers.

For example, placing the "jfreechart" JAR files in the plugin directory causes exceptions to be thrown in the AWM command window regarding missing Eclipse files. (I'm guessing that "jfreechart" is coded using Eclipse and some code dependencies are left behind.) Allowing this to happen with a user-defined plugin is pretty inelegant at best but most likely just down-right rude. We should avoid that, and we can.

The best practice I have come up with is as follows:
  • The interface between the AWM and your plugin should reside in its own JAR file and only contain the minimum number of class files etc. in order to start the actual plugin. This keeps the startup time for the AWM to the minimum.
  • The actual plugin code should be packaged into its own JAR file and placed elsewhere in the file system. (I use a subdirectory of the plugin directory so that it can be found using relative paths from the plugin interface.)
  • Associated plugin dependencies should also be placed into a subdirectory (I use the same one as my own code as mentioned above) of the plugin directory.
Now, Oracle kindly provide a directory from which plugins can be found and executed but this clearly isn't scalable from a startup point of view if you require the use of large dependencies or lots of plugins. Oracle unfortunately don't provide the equivalent of an "Ext" directory into which dependencies could be placed. So here's the dilemma... Do we customise the Windows batch file (or *nix shell script) to include our dependencies in the classpath, or do we place our dependencies in one of the directories that is already referenced by the Oracle prescribed classpath?

My answer... neither! Customising the startup script is prone to failure if Oracle ever decide to update it. It also means deploying the plugin is a pain, in that everyone's installation may be different so deploying the script needs to take that into consideration. Hijacking the Oracle directories is also pretty "rude" and prone to cause issues down the line, not to mention taking into consideration once again the uniqueness of each installation. No, best practice would be to customise a URLClassLoader in order to dynamically load the dependencies at runtime and start the plugin. This unfortunately requires the use of reflection and is a bit more long winded but it will be far more robust and stand you in good stead during further developments.

My plugin used both "log4j" and "jfreechart", both of which had to be loaded onto the classpath with "log4j" needing to be instantiated in order to log progress. I may blog more on this in detail in the future, as doing this can be tricky and I couldn't find a definitive guide on this on the web.

Finally, coming back to generics, note that the AWMPlugin interface needs to be backwards compatible and as such does not make use of generics. This tends to generates compiler warnings. The way I would recommend handling this is to annotate the interface methods with "@SuppressWarnings("unchecked")" to suppress those warnings but only on the methods provided by Oracle. If you need to assign one of the variables found in the interface parameter list to to your own variable, you can use the generic wildcard, such as declaring "Map<?>", indicating the supertype.

And that's all there is to it.

Monday 2 February 2009

Analysis Services 2005/2008 Heap Corruption

Recently I have been working for a financial institution who are using Analysis Service 2005 for their ad-hoc analysis. This particular client was experiencing issues with one of the Analysis Services servers producing memory mini-dump files during processing and causing service crashes and stalls whilst other servers were performing acceptably well.

The analysis of this issue started towards the end of 2007 and the reason for this has only now been diagnosed with the able help of Microsoft support. The particular system setup for this client is quite complex, using Oracle 10gR2 as the back-end relational database using flashback for referential integrity and consistency, the 11g (11.1.0.6.0) 64 bit OLE DB driver for connectivity from AS to Oracle and SSIS for scheduled processing. The AS database also makes extensive use of linked measure groups in order to provide a single cube of data with mixed granularity.

In order to diagnose the cause of the heap corruption we had to jump through quite a series of hoops. We initially went down the route of sending the mini-dump files to Microsoft for analysis but these naturally only provided limited and static information. Sending full dump files proved very tricky in the short term as the information that the cubes contain is confidential and cannot be sent off-site. In the end, restarting AS and only processing the dimensions and cubes (i.e. no queries being sent to AS) proved adequate to be able to send the full memory dump files off-site, as it would be extremely hard to put this in-memory information together in any meaningful way.

Still, the full dump files are also static and thus proved inconclusive. It could be seen from the dump files that heap corruption was indeed occuring during processing and there always appeared to be an Oracle error in the dump file as well. It seemed logical that the Oracle 11g driver could be the cause, as most clients tend to use the 10.2.0.3.0 Oracle OLE DB driver when connecting to Oracle 10.2.0.3.0 or later. In order to try and eliminate the 11g driver as the cause, we backed out the 11g driver and installed the 10.2.0.3.0 driver. (Note that in true Oracle fashion, this proved to be an exercise in and of itself. There is no Oracle 10.2.0.3.0 64 bit client installation. Noooooooo, that's waaaaay too easy! You have to install 10.2.0.1.0 and then patch it to 10.2.0.3.0. Oh, but wait, there is a patch before you apply the patch in order to get the patch to work etc. yadayadayada.) Anyhoo, I digress...

In the end, we found that using the Oracle 10.2.0.3.0 driver did not fix the problem. The next step then, was to divide and conquer so to speak...

Part one was to try and create a reduced test case by systematically removing complexity from the AS database. This was achieved by first removing the linked measure groups (I'm not aware of many people using this feature) and then to remove cubes and dimensions until a simple AS database could be created in which the issue could still be reproduced. The idea of this approach was to be able to get a small AS database with a tiny amount of non-confidential data so that Microsoft could perform in-house testing.

Part two was to use Time Travel Trace testing (TTT) which effectively allows one to debug the issue by being able to play forward and rewind the process over a given time span. This would allow one to see the changes made to the heap and replay them to identify the point at which the heap was corrupted. Rather amusingly though, there is a slight catch. This system was using AS2005 on Windows Server 2003 as are most installations I would suspect. However, TTT only works on Windows Server 2003 on a 32 bit architecture or Windows Server 2008 on a 64 bit architecture or if virtualised, only on Hyper-V.
Right, so a few changes to the test case then. As it happened we used a 32 bit environment and as luck would have it we were actually able to reproduce the issue, though TTT failed to work. Initially it exhibited the same symptoms as seen when used on Vista, where it "watched" the incorrect "thread" for activity. This should have been fixed be setting the processor affinity for AS. It wasn't...

As it happens, the TTT testing wasn't required. The reduced test case was effective and we were able to remove all confidential data. Microsoft took this in-house and managed to find the answer. So, are you ready for this? Do you want to know why memory heap corruption was occurring during processing? Well, do you?

Rather bizarrely, the reason for the heap corruption was the use of the ".Properties" function when specifying the default member in a dimension. Yup, the use of a specific function when specifying the default dimension member caused heap corruption during processing, not querying but processing. So how lucky am I that the reduced test case that I created used one of the only two dimensions out of 30 odd using that function when specifying the default member? Now I'm not sure why that function causes the issue and nor are the AS development team. They are still looking into it and from what I understand, were equally surprised. As it happens, the MDX that was being used to specify the default member wasn't particularly efficient in the first place (I must point out that it wasn't my code) and that once the code was changed to use the "Exists" function, it solved the problem.

Clearly your mileage may vary but nonetheless please be aware that the use of the ".Properties" function when specifying the default dimension member can cause heap corruption during processing. This is true for both AS2005 CU11 (or SP3 CU1) and AS2008 CU2. I'm still not sure why this only occurred on one of the servers when the other servers have the exact same database build but hey, removing the ".Properties" function solved the issue so clearly that was the problem.

You have been warned...