Historically AntiXSS has had problems with surrogates (go on, make the baby jokes, I’ll wait). Unicode surrogates are a way of combining two characters to enable the character range in UTF16 to go beyond 0xFFFF. Characters (or more accurately code points) between 0x000 and 0xFFFF made up the Basic Multilingual Plane however the code points and tables within the BMP are pretty much all used up – so how do you get beyond this? Any code point beyond 0xFFFF is broken down into two characters, a high surrogate (which lies between DB800 and DBFF) and a low surrogate (between DC00 and DFFF). Un UTF16 a high surrogate must always be followed by a low surrogate and the values are then combined to map to a code point outside the BMP using the following formula,

0x10000 + (High Surrogate value - 0xD800) ×  400 + (Low Surrogate value - 0xDC00)

For example the high surrogate 0xDB00 followed by a low surrogate 0xDC00 maps to the 0x10000 code point, which is the Linear B Syllable.

When AntiXSS is configured it creates an array of all the encoded values after which it changes the array contents for each safe listed value to be null, so for basic UTF16 we have a jagged char array char[0xFFFF][]. The maximum length for the second dimension is 8 characters, to support the thetasym named entity. Obviously this has an impact on memory (roughly 1Mb), but the offset in speed when we only calculate once more than makes up for it. If I extended that array to support the full UTF32 range suddenly the memory footprint leaps to more than 25Mb.

So what will happen in the next drop is the code will combine the surrogate pairs and then calculate the character value each time, which is safer than just leaving the characters as is. This does mean you cannot mark any of the code tables outside of the base plane as safe as the memory impact would just be too high.

If AntiXSS encounters a high surrogate not followed by a low surrogate, or a low surrogate not preceded by a high surrogate it will throw an InvalidSurrogatePairException – this does mean if you’re blindly truncating or concatenating strings which could contain pairs and then encoding them you need to be aware that you may chop or create an invalid surrogate pair.

I’m tempted to refer to this as the James Hart release as it was him who expressed a need for all of this!

Technorati Tags: