I guess… I guess I am wrong…

What happens when an API uses guessing to identify which character encoding is actually used in a text file:

  1. Open Notepad

  2. Type the text “this app can break” (without quotes)
  3. Save the file
  4. Re-open the file in Notepad

And the explanation: Notepad uses as windows system method called “IsTextUnicode()“. In the MSDN Library article on this method you can read:

“This function uses various statistical and deterministic methods to make its determination, under the control of flags passed in the lpi parameter. When the function returns, the results of such tests are reported using the same parameter.

The IS_TEXT_UNICODE_STATISTICS and IS_TEXT_UNICODE_REVERSE_STATISTICS tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpBuffer indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable. “

That simply means, that the method fails on detecting the correct character encoding. Aftermarket Pipes tells us:

“It actually runs a couple of heuristics over the first 256 bytes of the data and provides its best guess. As it turns out, these tests aren’t terribly reliable for very short ASCII strings that contain an even number of lower-case letters, like “this app can break”, “

Source 1: http://msdn.microsoft.com/library/en-us/intl/unicode_81np.asp?frame=true
Source 2: http://apipes.blogspot.com/2006/06/this-api-can-break.html

Comments are closed.