Monday 2 February 2009

Analysis Services 2005/2008 Heap Corruption

Recently I have been working for a financial institution who are using Analysis Service 2005 for their ad-hoc analysis. This particular client was experiencing issues with one of the Analysis Services servers producing memory mini-dump files during processing and causing service crashes and stalls whilst other servers were performing acceptably well.

The analysis of this issue started towards the end of 2007 and the reason for this has only now been diagnosed with the able help of Microsoft support. The particular system setup for this client is quite complex, using Oracle 10gR2 as the back-end relational database using flashback for referential integrity and consistency, the 11g (11.1.0.6.0) 64 bit OLE DB driver for connectivity from AS to Oracle and SSIS for scheduled processing. The AS database also makes extensive use of linked measure groups in order to provide a single cube of data with mixed granularity.

In order to diagnose the cause of the heap corruption we had to jump through quite a series of hoops. We initially went down the route of sending the mini-dump files to Microsoft for analysis but these naturally only provided limited and static information. Sending full dump files proved very tricky in the short term as the information that the cubes contain is confidential and cannot be sent off-site. In the end, restarting AS and only processing the dimensions and cubes (i.e. no queries being sent to AS) proved adequate to be able to send the full memory dump files off-site, as it would be extremely hard to put this in-memory information together in any meaningful way.

Still, the full dump files are also static and thus proved inconclusive. It could be seen from the dump files that heap corruption was indeed occuring during processing and there always appeared to be an Oracle error in the dump file as well. It seemed logical that the Oracle 11g driver could be the cause, as most clients tend to use the 10.2.0.3.0 Oracle OLE DB driver when connecting to Oracle 10.2.0.3.0 or later. In order to try and eliminate the 11g driver as the cause, we backed out the 11g driver and installed the 10.2.0.3.0 driver. (Note that in true Oracle fashion, this proved to be an exercise in and of itself. There is no Oracle 10.2.0.3.0 64 bit client installation. Noooooooo, that's waaaaay too easy! You have to install 10.2.0.1.0 and then patch it to 10.2.0.3.0. Oh, but wait, there is a patch before you apply the patch in order to get the patch to work etc. yadayadayada.) Anyhoo, I digress...

In the end, we found that using the Oracle 10.2.0.3.0 driver did not fix the problem. The next step then, was to divide and conquer so to speak...

Part one was to try and create a reduced test case by systematically removing complexity from the AS database. This was achieved by first removing the linked measure groups (I'm not aware of many people using this feature) and then to remove cubes and dimensions until a simple AS database could be created in which the issue could still be reproduced. The idea of this approach was to be able to get a small AS database with a tiny amount of non-confidential data so that Microsoft could perform in-house testing.

Part two was to use Time Travel Trace testing (TTT) which effectively allows one to debug the issue by being able to play forward and rewind the process over a given time span. This would allow one to see the changes made to the heap and replay them to identify the point at which the heap was corrupted. Rather amusingly though, there is a slight catch. This system was using AS2005 on Windows Server 2003 as are most installations I would suspect. However, TTT only works on Windows Server 2003 on a 32 bit architecture or Windows Server 2008 on a 64 bit architecture or if virtualised, only on Hyper-V.
Right, so a few changes to the test case then. As it happened we used a 32 bit environment and as luck would have it we were actually able to reproduce the issue, though TTT failed to work. Initially it exhibited the same symptoms as seen when used on Vista, where it "watched" the incorrect "thread" for activity. This should have been fixed be setting the processor affinity for AS. It wasn't...

As it happens, the TTT testing wasn't required. The reduced test case was effective and we were able to remove all confidential data. Microsoft took this in-house and managed to find the answer. So, are you ready for this? Do you want to know why memory heap corruption was occurring during processing? Well, do you?

Rather bizarrely, the reason for the heap corruption was the use of the ".Properties" function when specifying the default member in a dimension. Yup, the use of a specific function when specifying the default dimension member caused heap corruption during processing, not querying but processing. So how lucky am I that the reduced test case that I created used one of the only two dimensions out of 30 odd using that function when specifying the default member? Now I'm not sure why that function causes the issue and nor are the AS development team. They are still looking into it and from what I understand, were equally surprised. As it happens, the MDX that was being used to specify the default member wasn't particularly efficient in the first place (I must point out that it wasn't my code) and that once the code was changed to use the "Exists" function, it solved the problem.

Clearly your mileage may vary but nonetheless please be aware that the use of the ".Properties" function when specifying the default dimension member can cause heap corruption during processing. This is true for both AS2005 CU11 (or SP3 CU1) and AS2008 CU2. I'm still not sure why this only occurred on one of the servers when the other servers have the exact same database build but hey, removing the ".Properties" function solved the issue so clearly that was the problem.

You have been warned...

4 comments:

  1. Pingback.

    Link to this post was added to our website in the [SSAS Articles]/[Connectivity] section:
    http://www.ssas-info.com/analysis-services-articles/63-connectivity/1368-analysis-services-20052008-heap-corruption

    ReplyDelete
  2. Hi, is there any chance you can advice me aboutn connecting Oracle 11 g 64 bits with Analysis Services 2008? . We are using 64 bit server and my client wants to use the Oracle Provider for OLE DB Connector because apparently is faster than other connectors. At this moment I can see the datasource view, and make a succesful connection. But at the moment I process any dimension it crashes with a ORA TNS: listner error. I checked TNS names and looks fine. I dont know what else to do. I know it works with 32 bit client I tested... but I cant figure out what to change from 32 to 64... can you help me?

    ReplyDelete
  3. Hi Adriana,

    No problem. Indeed the Oracle drivers are a lot faster than the Microsoft ones but make sure that you are using a stable version of the Oracle drivers.

    Okay so note that when developing in VS that it is a 32bit application and thus requires 32 bit drivers. AS on the other hand is 64bit and thus requires the 64bit drivers. So if you're developing on the server you need to install both drivers and configure TNSNames for both.

    Hope that gets you on your way.

    ReplyDelete