binary – Why does the BOM consist of two bytes instead of one for example in encoding utf-16

The BOM started out as an encoding trick. It was not part of Unicode, it was something that people discovered and cleverly (ab)used.

Basically, they found the U+FEFF Zero width non-breaking space character. Now, what does a space character that has a width of zero and does not induce a linebreak do at the very beginning of a document? Well, absolutely nothing! Adding a U+FEFF ZWNBSP to the beginning of your document will not change anything about how that document is rendered.

And they also found that the code point U+FFFE (which you would decode this as, if you decoded UTF-16 “the wrong way round”) was not assigned. (U+FFFE0000, which is what you would get from reading UTF-32 the wrong way round, is simply illegal. Codepoints can only be maximal 21 bits long.)

So, what this means is that when you add U+FEFF to the beginning of your UTF-16 (or UTF-32) encoded document, then:

  • If you read it back with the correct Byte Order, it does nothing.
  • If you read it back with the wrong Byte Order, it is a non-existing character (or not a code point at all).

Therefore, it allows you to add a code point to the beginning of the document to detect the Byte Order in a way that works 100% of the time and does not alter the meaning of your document. It is also 100% backwards-compatible. In fact, it is more than backwards-compatible: it actually works as designed even with software that doesn’t even know about this trick!

It was only later, after this trick had been widely used for many years that the Unicode Consortium made it official, in three ways:

  • They explicitly specified that the code point U+FEFF as the first code point of a document is a Byte Order Mark. When not at the beginning of the document, it is still a ZWNBSP.
  • They explicitly specified that U+FFFE will never be assigned, and is reserved for Byte Order Detection.
  • They deprecated U+FEFF ZWNBSP in favor of U+2060 Word Joiner. New documents that have an U+FEFF somewhere in the document other than as the first code point should no longer be created. U+2060 should be used instead.

So, the reason why the Byte Order Mark is a valid Unicode character and not some kind of special flag, is that it was introduced with maximum backwards-compatibility as a clever hack: adding a BOM to a document will never change it, and you don’t need to do anything to add BOM detection to existing software: if it can open the document at all, the byte order is correct, and if the byte order is incorrect, it is guaranteed to fail.

If, on the other hand, you try to add a special one-octet signature, then
all UTF-16 and UTF-32 reader software in the entire world has to be updated to recognize and process this signature. Because if the software does not know about this signature, it will simply try to decode the signature as the first octet of the first code point, and the first octet of the first code point as the second octet of the first code point, and further decode the entire document shifted by one octet. In other words: adding the BOM would completely destroy any document, unless every single piece of software in the entire world that deals with Unicode is updated before the first document with a BOM gets produced.

However, going back to the very beginning, and to your original question:

Why does the BOM consist of two bytes

It seems that you have a fundamental misunderstanding here: the BOM does not consist of two bytes. It consists of one character.

It’s just that in UTF-16, each code point gets encoded as two octets. (To be fully precise: a byte does not have to be 8 bits wide, so we should talk about octets here, not bytes.) Note that in UTF-32, for example, the BOM is not 2 octets but 4 (0000FEFF or FFFE0000), again, because that’s just how code points are encoded in UTF-32.

web browser – A runtime sometimes converts string arguments (or string returns) from WTF-16 to UTF-16 between functions in a call stack. Is this a security concern?

For example, suppose we have this code (in TypeScript syntax):

function one(str: string): string {
  // do something with the string
  return str
}

function two() {
  let s = getSomeString() // returns some unknown string that may contain surrogates
  s = one(s)
  // ...
}

two()

Now suppose that when passing the string s into the one(s) call, the runtime (not the implementation of one or two will sometimes replace part of the string. In particular, this will happen when there are WTF-16 “isolated surrogates”, but the main idea here is that this is not common and most developers will not be aware that this happens… until it happens.

Now, suppose this never happened before, but the runtime recently added this format conversion between function calls (converting function arguments or their return values without the implementation of any function having a choice in the matter).

In such a runtime, could there be a security issue after the runtime has switched from never changing strings to now doing it sometimes? If so, what could happen?


In particular, the runtime I am thinking about is JavaScript interfacing with WebAssembly in the browser in the potential near future if the new “Interface Types” proposal passes a vote for string passing to have WTF-to-UTF sanitization.

What I could imagine, for example, is some third party library updating an implementation of one or more of their functions from JavaScript to WebAssembly, this in turn causing strings being passed from JS to WebAssembly (or vice versa) to be modified from their original form, causing unexpected errors, or in the worst case, a vulnerability.

Is there a potential problem here?

Problemas com C# e UTF-16

o cenário é o seguinte… estou trocando dados com um websocket, porém, uma das mensagens que envio não obtenho resposta.

A mensagem contém o seguinte "x02x00" mas ao compilar vira "u0002".

Parece que o websocket não entende essa codificação, é possível isso?

Segue o código completo:

    private static void WsData_MessageReceived(object sender, MessageReceivedEventArgs e)
    {
        var msg = e.Message;

        Form1.Log("Mensagem Data Recebida: " + msg);

        if (msg.StartsWith("100"))
        {
            Thread.Sleep(200);
            EnviarMsg("x16x00" + "CONFIG_33_0,OVInPlay_33_0x01", "data");               
        }
        else if (msg.Substring(1,6) == "__time")
        {
            EnviarMsg("x02x00" + "commandx01nstx01" + authToken + "x02SPTBK", "data");
        }

        if (msg.Contains("AD"))
        {
            if (msg.Contains("CONFIG_"))
            {

            }
        }
       
    }

O código de envio de mensagem:

    public static void EnviarMsg(string msg, string data)
    {
        var sckt = socketData;
        if (data == "handshake") { sckt = socketHandshake; }

        if (sckt.State == WebSocket4Net.WebSocketState.Open)
        {
            
            sckt.Send(msg);
            Mensagens.Add(msg);
            Form1.Log("Mensagem Enviada: (" + data.ToString() + ")" + msg);
        }
        else
        {

        }
    }

Vale citar que, o mesmo algoritmo em python funciona… devido ao b’ antes da string fazendo com que a str seja lida como byte, estou certo? Já esgotei minhas tentativas em c# de obter algo relacionado, podem me dar uma luz?

c++ – The conversion from UTF-16 to UTF-8 manually

I have created a function that converts from UTF-16 to UTF-8.
This function converts from UTF-16 to codepoint firstly, then from codepoint to UTF-8.

void ToUTF8(char16_t *str) {
    while (*str) {
        unsigned int codepoint = 0x0;

        //-------(1) UTF-16 to codepoint -------

        if (*str <= 0xD7FF) {
            codepoint = *str;
            str++;
        } else if (*str <= 0xDBFF) {
            unsigned short highSurrogate = (*str - 0xD800) * 0x400;
            unsigned short lowSurrogate = *(str+1) - 0xDC00;
            codepoint = (lowSurrogate | highSurrogate) + 0x10000;
            str += 2;
        }

        //-------(2) Codepoint to UTF-8 -------

        if (codepoint <= 0x007F) {
            unsigned char hex(2) = { 0 };
            hex(0) = (char)codepoint;
            hex(1) = 0;
            cout << std::hex << std::uppercase << "(1Byte) " << (unsigned short)hex(0) << endl;
        } else if (codepoint <= 0x07FF) {
            unsigned char hex(3) = { 0 };
            hex(0) = ((codepoint >> 6) & 0x1F) | 0xC0;
            hex(1) = (codepoint & 0x3F) | 0x80;
            hex(2) = 0;
            cout << std::hex << std::uppercase << "(2Bytes) " << (unsigned short)hex(0) << "-" << (unsigned short)hex(1) << endl;
        } else if (codepoint <= 0xFFFF) {
            unsigned char hex(4) = { 0 };
            hex(0) = ((codepoint >> 12) & 0x0F) | 0xE0;
            hex(1) = ((codepoint >> 6) & 0x3F) | 0x80;
            hex(2) = ((codepoint) & 0x3F) | 0x80;
            hex(3) = 0;
            cout << std::hex << std::uppercase << "(3Bytes) " << (unsigned short)hex(0) << "-" << (unsigned short)hex(1) << "-" << (unsigned short)hex(2) << endl;
        } else if (codepoint <= 0x10FFFF) {
            unsigned char hex(5) = { 0 };
            hex(0) = ((codepoint >> 18) & 0x07) | 0xF0;
            hex(1) = ((codepoint >> 12) & 0x3F) | 0x80;
            hex(2) = ((codepoint >> 6) & 0x3F) | 0x80;
            hex(3) = ((codepoint) & 0x3F) | 0x80;
            hex(4) = 0;
            cout << std::hex << std::uppercase << "(4Bytes) " << (unsigned short)hex(0) << "-" << (unsigned short)hex(1) << "-" << (unsigned short)hex(2) << "-" << (unsigned short)hex(3) << endl;
        }
    }
}

Also, you can compile and test the code from here

What do you think about that function in terms of performance, and ease?

Encoding – Notepad ++ save as UTF-16 file without byte order mark

Is there a way to save a file in Notepad ++ using UTF-16 (little endian) encoding, but without adding the byte order mark? For example, if a text file is saved in notepad ++ using the little endian UTF-16 encoding (Encoding > UCS-2 LE BOM), will have the bytes FF FE before him, which I would like to eliminate without having to do it manually.

If there is no way to do this by default, is there any way I can create an encoding for Notepad ++ that is the same as the UCS-2 LE BOM option, only without byte order mark?