Thursday, 19 May 2011

Implementing Analysis Services Process Add Functionality – Part 2


In a previous here I briefly mentioned some of the benefits and issues surrounding the implementation of incremental processing of Analysis Services dimensions. In this post I’ll go into a bit more detail with regards configuring the SQL Server schema to support incremental processing of Analysis Services dimensions.

I should point out upfront that I am specifically referring to SQL Server as the relational database here but it could be any database really. The thing is, I’m going to use some SQL Server features as a shortcut both for performance reasons and ease of understanding. I know Oracle has similar functionality but I cannot document each and every RDBMS so I have just chosen SQL Server as the basis for this post.

So straight off, we have a choice to make. Should we use an out-of-line DSV when incrementally processing or not? Depending on what is chosen here influences how you go about implementing incremental dimension processing, though in the solution presented here only the final steps will need to change. To get around the issue of having to configure Kerberos and submitting a potentially large DSV every time we wish to process, I have chosen not to go the out-of-line DSV route. Although there is nothing that I could see in SQL Profiler, I would presume that there is some form of overhead to Analysis Services evaluating an out-of-line DSV, no matter how trivial.

It’s also worth mentioning that a nice side-effect of my solution is that we also overcome the race-condition with regards attribute processing when processing dimensions whilst the relational database is loading dimensional data. For example, in the case where you load a new customer linked to a new customer group and Analysis Services has already processed the customer group attribute but not the customer attribute, dimension processing will fail.  The reason being that before the new member was added, Analysis Services will already have captured the set of customer group attributes as it was before the new member arrived in the table. This means that processing of the customer attribute will fail when Analysis Services tries to link the customer to the customer group, as the related member will not exist in the processed set of members for the customer group attribute. This can be extremely annoying and there is no out-of-the-box way to force a consistent snapshot of all attributes for a dimension during processing.

One Solution:
So, looking at snapshots, it seems that we need to roll our own. Clearly one idea is to effectively snapshot all the rows we wish to load for the dimension across every attribute. We can then present only those rows for loading by Analysis Services at process time. Using this method we can ensure that we prevent the race-condition described earlier during a ProcessUpdate operation too. The trick is how to snapshot the rows quickly and cheaply. (i.e. We don’t want to have to copy an entire table or perform a join across the current dimension table and a previous snapshot if we can avoid it.)

So, in my proposed solution, we first of all need to ensure that every attribute that is to be loaded into the dimension is recorded in its own table. That is, the relational schema for the dimension needs to be snow flaked. I know Kimball die-hard fans will be screaming around about now but I haven’t said that you can’t also have a star-schema table. For large Analysis Services dimensions this is usually what I do. I ensure that I generate integer surrogate keys for every attribute using snow-flaked tables with identity columns and configure the Analysis Services DSV using those tables such that only a table scan of a small table is required for every attribute. However, I also ensure that the “key” table (or bottom level) of the dimension is a star-schema table for relational reporting and as such also contains the attributes from related tables.

For example, given a customer dimension with customer as the bottom level and being related to group and organisation, I would create an organisation table and a customer group table to capture the discrete entities and generate a surrogate key for each. The customer table however, will be a star schema table, generating a surrogate key for each customer but also containing the attributes of the related groups and organisations.)

Customer Relationships

The main “trick” here, is to ensure that every snow-flaked table for each attribute uses an identity column for its surrogate key. You should also endeavour to choose an integer that best matches the expected maximum number of entries, be it a smallint (2 byte signed integer), int (4 byte signed integer) or bigint (8 byte signed integer). Analysis Services is no longer partial to the tinyint data type so best steer clear of that though.

This next part isn’t really required, though it will make your life easier in the long run as well as being good practice. Make sure that you have foreign keys setup between all of the snow flaked tables including the star-schema table and the snow-flaked tables. Essentially “document” the relationship between all related attributes by means of foreign keys. The reason for this will become clear later.

So we now have our dimension configured such that we have integer surrogate keys for every attribute in our dimension and we have both star-schema and snow-flaked tables. We could go ahead and build an Analysis Services dimension from this right now, if we so wished, so we can consider the schema requirements component of the solution to be complete.

Now that we have our dimension schema I’ll attempt to explain how all of this is used. In part 3 I’ll present some actual code that can be used as a basis for a more substantial implementation but for now I’ll just present a conceptual implementation plan. So, here goes…

1.     First we need to capture the list of dimensions we are interested in and the associated key table for that dimension. Note that by “key table”, I’m referring to the table that contains the attribute to be used as the key in the dimension. Create a table that captures this information.
2.     Next we need to create a table in which we will capture the names of all the tables used by a given dimension on rows (i.e. the key table and all related tables) as well as the previous and current identity values for each table.
3.      Populate the table created in point 1 above with the dimensions for which you wish to use process add and where the underlying relational schema has been configured as described in the main text.
4.     Create a stored procedure that accepts a given dimension name and key table name and, using the information schema views etc, identify all related tables by means of foreign key relationships. Using this information, populate the table created in point 2 above with an arbitrary minimum value and, using the IDENT_CURRENT(‘’) function, populate the max value with the current identity value for the table. Note that you must ensure that you retrieve the values for the key table before any of the related tables. This is what provides protection against the race condition discussed earlier.
If you did not create foreign keys between all tables, you will need statically capture all tables used by a given dimension and maintain this over time as tables are added or dropped.
5.     Create another stored procedure that iterates through each entry in the table created in point 1 above and calls the stored procedure created in point 4 above.
6.     Create a user defined table function that accepts a dimension name, schema name and table name and returns the minimum and maximum values from the table created in point 2 above for a given dimension, schema and table.
7.     Update the current views used to present data to Analysis Services to join to the function created in point 6 above, limiting the output to be where the surrogate key value is between the minimum and maximum values.

And that’s it really, or the basic nuances of it anyway. Clearly, in point 4 above, there is some work required to only set an arbitrary minimum value the first time the table is populated and thereafter to set the minimum value to the previous maximum value + 1. Where the minimum value is greater than the maximum value you get an indication that now rows have been added to the table since it was processed.

The final step in the solution of course, is to ensure that the stored procedure created in step 5 is always called before Analysis Services processing commences. In order to perform a full process of the database from SSMS say, you just need to reset the table created in point 2. (eg. Truncate the table and reseed with arbitrary minimum value and current identity value.)

And there you go, easy-peasy eh?

Please find part 3 here.