Friday 22 July 2011

SQL Server - Identifying Control Characters


Recently a query against our Analysis Services database failed with an error about invalid XML characters and thus prompted me to start the investigation as to which members contained these. I know that in Analysis Services you can set how you want to handle invalid XML characters but as we shouldn’t receive these I'm using the defaults and I don’t want to use “replace” etc.

Interestingly though, during my investigation I came across some unexpected behaviour with regards the 0x00 control code. At that point I just needed to identify the culprit members and prevent it from entering the database in the future, so I left that peculiarity and wrote the following function:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

IF EXISTS (SELECT * FROM sys.objects WHERE [object_id] = OBJECT_ID(N'[dbo].[udf_string_contains_control_character]') AND [type] IN (N'FN', N'IF', N'TF', N'FS', N'FT'))
BEGIN
  DROP FUNCTION [dbo].[udf_string_contains_control_character];
END;
GO

CREATE FUNCTION [dbo].[udf_find_name_with_control_character] (@i_InputText NVARCHAR(MAX))
RETURNS INT
AS
BEGIN
  RETURN ( SELECT CASE WHEN (-- First control code set (i.e. 1 to 31)
                             PATINDEX('%[' + CHAR(0x01) + '-' + CHAR(0x1F) + ']%', @i_InputText COLLATE DATABASE_DEFAULT) +
                             -- Control character 127 (i.e. DEL)
                             PATINDEX('%[' + CHAR(0x7F) + '-' + CHAR(0x7F) + ']%', @i_InputText COLLATE DATABASE_DEFAULT) ) > 0 THEN
                    1
                  ELSE
                    0
                  END );

END;
GO

Clearly the function doesn’t handle every “dodgy” character but it handles some of the more common control characters and can be easily extended.

I then never found the time to get back to the issue with regards the 0x00 control character until now, thanks largely to a question posted on the Analysis Services forum which you can see here: http://social.msdn.microsoft.com/Forums/en/sqlanalysisservices/thread/619a0d73-455c-404f-99ad-bbebfab2cf18)
In this particular case, the error message was:
The server sent an unrecognizable response. _'.', hexadecimal value 0x00, is an invalid character. Line , position .

This prompted me to revisit why, when I used the 0x00 control character in the filter in the above function, the function didn’t behave as I initially expected. Clearly it appears that the control character 0x00 in SQL Server is a little problematic but I didn’t realise just how problematic until I started digging a little deeper.

So how hard can it be to find control characters in SQL? Answer, harder than you might think. So, in SSMS run this in grid view (i.e. CTRL-D) and then in test view (i.e. CTRL-T).

SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test" ) a;

You should hopefully see the differences there. The grid view hides the underlying artefact, though the lengths clearly show something isn't right, but switching to text view shines a bit more light on the issue. So let’s try finding the row using LIKE…

SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test" ) a
WHERE [test] LIKE '%' + CHAR(0x00) + '%';

Er, nope, that doesn’t work. Aha, maybe we need to escape it...

SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test"
      UNION ALL
       SELECT 'bar' AS "test" ) a
WHERE [test] LIKE '%\' + CHAR(0x00) + '%' ESCAPE '\';

Groan, grumble, no that’s still not the right answer. Note how we get both rows. I suspect all of this is due to the fact that C Strings are “Nul” terminated (i.e. \0) so all rows match. I could go on for ages about trying to use REPLACE, PATINDEX, CHARINDEX etc. as I did, but I’ll save you the time, they don’t work.

For example, trying to use REPLACE:

SELECT REPLACE([test], CHAR(0x00), '') AS "test",
       LEN(REPLACE([test], CHAR(0x00), '')) AS "len",
       DATALENGTH(REPLACE([test], CHAR(0x00), '')) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test" ) a;

I then had one of those eureka moments whilst looking through the list of string functions in Books Online. What happens if I use the QUOTENAME function? I figured that that might do something interesting based on the fact that it is meant to delimit object names, and indeed it does do something interesting. If a column contains the 0x00 control character, wrapping that value with QUOTENAME returns NULL. So, to find rows that contain the 0x00 character in the data as is, I found the following works reliably:

SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test"
      UNION ALL
       SELECT 'bar' AS "test" ) a
WHERE [test] IS NOT NULL AND
      QUOTENAME([test]) IS NULL;

However, this only identifies the culprits. We still haven’t discovered a way to fix the data by removing the control code. We know that we can’t use the standard string functions such as REPLACE etc but what happens if we use XML?

So, using XML we can identify the control character by looking for "�" within the xml fragment.  See what’s returned by the following code:

SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test"
      UNION ALL
       SELECT 'bar' AS "test" ) a
FOR XML AUTO;

Okay, so now that we can identify the control character we should be able to remove it with the standard text functions. Something like the following should do it:

SELECT dta.[xmltest].value('(/a/@test)[1]', 'VARCHAR(20)') AS "test",
       LEN(dta.[xmltest].value('(/a/@test)[1]', 'VARCHAR(20)')) AS "len",
       DATALENGTH(dta.[xmltest].value('(/a/@test)[1]', 'VARCHAR(20)')) AS "data_length",
       dta.[xmltest].value('(/a/@len)[1]', 'INT') AS "original_len",
       dta.[xmltest].value('(/a/@data_length)[1]', 'INT')  AS "original_data_length"
FROM (
SELECT CAST(REPLACE([xmltest], '�', '') AS XML) AS "xmltest"
FROM ( -- Retrieve sample data with lengths etc.
       SELECT [test], LEN([test]) AS "len", DATALENGTH([test]) AS "data_length"
       FROM ( SELECT 'foo' + CHAR(0x00) + 'bar' AS "test"
             UNION ALL
              SELECT 'bar' AS "test" ) a
       -- Don't specify TYPE as it will fail due to the 0x00 character
       -- not allowed in XML. We'll convert it later after removing it.
       FOR XML AUTO ) xmled(xmltest) ) dta;

Clearly you could also write a function to iterate through each character within the string and remove the control character, but where’s the fun in that?

2 comments:

  1. Phillip, this is an excellent post and has helped me significantly.
    - Paul Novelli

    ReplyDelete