Tuesday 26 April 2011

Implementing Analysis Services Process Add Functionality – Part 1


I think most people who have used Analysis Services know about the ability to incrementally process dimensions using “ProcessAdd” and let’s face it, without the ability to partition dimensions it’s about the only data-reducing optimisation we can perform on dimensions. Sure, we have “by table” processing versus “by attribute” processing, the ability to process attributes in parallel and we can make sure that the relational layer can produce results quickly by either using snow-flaked tables or star-schema tables with indexes etc. However, we are still bound by the amount of time taken to process a given set of rows, where a subset of this information may already have been previously processed.

To this end we can use incremental dimension processing, i.e. the “ProcessAdd” processing type, which aims to reduce the amount of time taken to process a dimension by only processing new information into the dimension. Clearly changes in hierarchy relationships will still require “ProcessUpdate” or “ProcessFull” as the entire dimension data set will need to be reprocessed in order to identify and reflect any hierarchy changes.

However there are some fundamental limitations with regards how fast incremental dimension processing can go…
1.     Analysis Services performs transactional processing using files on the file system. As such, when starting a transaction its first task is to take a copy of all relevant files. It then starts to perform processing and if processing is successful the newly processed file is used. Based on this we can see that the amount of time taken to process must include at least the amount of time taken to clone all relevant files for the dimension. This can be a non-trivial amount of time on large dimensions.
I guess an enhancement request could be to have Analysis Services clone existing dimension files using a background thread some time after processing. A per dimension configuration would be ideal so that we don’t automatically double disk space and waste resources on small dimensions.
2.     In order to use incremental processing on dimensions there appears to be two general forms of implementation. There’s the method that presents only the new rows to Analysis Services during processing (the intended but clearly more arduous method) and the one that presents all rows but changes the error configuration to ignore duplicate errors. I consider the latter option a bit of a hack to be honest and not in the true spirit of incremental processing.
3.     Even after providing only new rows, Analysis Services needs to ensure that attributes which have ordering defined based on some other attribute are updated to reflect this. Clearly, if ordering has been defined on an attribute with many members, the act of Analysis Services revisiting this set could take a considerable amount of time.

From what I’ve seen, the implementation of Analysis Services “ProcessAdd” on dimensions seems somewhat diverse and I’m not aware of anyone documenting a definitive guide to implementing this. Now I certainly don’t intend these posts to be considered a definitive guide but more of a starting point for discussion. To this end, I’ll try to cover off how I see “ProcessAdd” working with regards the relationship between Analysis Services and SQL Server. Don’t get me wrong, others have most definitely blogged about “ProcessAdd” and you can see Greg Galloway’s blog for more information on ProcessAdd performance metrics here: http://www.artisconsulting.com/blogs/greggalloway/Lists/Posts/Post.aspx?List=83c71d76-a4d8-4197-9257-38d6b857634f&ID=7. There are others out there like this too. However, these posts tend to focus more on the Analysis Services side, detailing the performance characteristics or the options available such as Out-of-Line DSV’s etc. but not much with regards how to actually go about integrating the relational layer and Analysis Services with incremental processing of dimensions in mind.

So what are the challenges with integrating Analysis Services with SQL Server?
1.     Well, first we need to be able to identify new rows for every attribute. Note that it is insufficient to only identify new “key” rows as each attribute in a dimension is processed separately. So it is well possible, and indeed likely, that any new “key” row will be related to attribute members that have already been processed.
2.     Once we have the ability to identify new rows for every attribute, how do we only present these new rows to Analysis Services when processing the dimension?

Delving into the second point a bit more, we should note that this tends to be a bit of a double-edged sword. We could use an out-of-line DSV to point Analysis Services to some object(s) to present only new rows but the use of an out-of-line DSV requires a connection to be specified too. Now as soon as you specify a connection in an XMLA script you lose the option to allow Analysis Services to process under the service account. So you either have to specify a username/password combination in the XMLA connection configuration, not nice, or use Kerberos and ensure that the account submitting the process request has sufficient privileges both in Analysis Services and in SQL Server. We could also opt to just configure the base objects upon which the dimension is built to only present new rows, allowing the use of the service account during processing. However, if you do this then how do you go about performing a “ProcessUpdate” or “ProcessFull” on an adhoc basis in order to fix a rigid relationship issue or to refresh the hierarchies?

In my next post (here) I’ll try to cover off one solution to this problem. Clearly this is not going to be straightforward or risk-free but hopefully it’s a starting point.

Please find part 2 here.

No comments:

Post a Comment