The BOM started out as an encoding trick. It was not part of Unicode, it was something that people discovered and cleverly (ab)used.
Basically, they found the U+FEFF Zero width non-breaking space character. Now, what does a space character that has a width of zero and does not induce a linebreak do at the very beginning of a document? Well, absolutely nothing! Adding a U+FEFF ZWNBSP to the beginning of your document will not change anything about how that document is rendered.
And they also found that the code point U+FFFE (which you would decode this as, if you decoded UTF-16 “the wrong way round”) was not assigned. (U+FFFE0000, which is what you would get from reading UTF-32 the wrong way round, is simply illegal. Codepoints can only be maximal 21 bits long.)
So, what this means is that when you add U+FEFF to the beginning of your UTF-16 (or UTF-32) encoded document, then:
- If you read it back with the correct Byte Order, it does nothing.
- If you read it back with the wrong Byte Order, it is a non-existing character (or not a code point at all).
Therefore, it allows you to add a code point to the beginning of the document to detect the Byte Order in a way that works 100% of the time and does not alter the meaning of your document. It is also 100% backwards-compatible. In fact, it is more than backwards-compatible: it actually works as designed even with software that doesn’t even know about this trick!
It was only later, after this trick had been widely used for many years that the Unicode Consortium made it official, in three ways:
- They explicitly specified that the code point U+FEFF as the first code point of a document is a Byte Order Mark. When not at the beginning of the document, it is still a ZWNBSP.
- They explicitly specified that U+FFFE will never be assigned, and is reserved for Byte Order Detection.
- They deprecated U+FEFF ZWNBSP in favor of U+2060 Word Joiner. New documents that have an U+FEFF somewhere in the document other than as the first code point should no longer be created. U+2060 should be used instead.
So, the reason why the Byte Order Mark is a valid Unicode character and not some kind of special flag, is that it was introduced with maximum backwards-compatibility as a clever hack: adding a BOM to a document will never change it, and you don’t need to do anything to add BOM detection to existing software: if it can open the document at all, the byte order is correct, and if the byte order is incorrect, it is guaranteed to fail.
If, on the other hand, you try to add a special one-octet signature, then
all UTF-16 and UTF-32 reader software in the entire world has to be updated to recognize and process this signature. Because if the software does not know about this signature, it will simply try to decode the signature as the first octet of the first code point, and the first octet of the first code point as the second octet of the first code point, and further decode the entire document shifted by one octet. In other words: adding the BOM would completely destroy any document, unless every single piece of software in the entire world that deals with Unicode is updated before the first document with a BOM gets produced.
However, going back to the very beginning, and to your original question:
Why does the BOM consist of two bytes
It seems that you have a fundamental misunderstanding here: the BOM does not consist of two bytes. It consists of one character.
It’s just that in UTF-16, each code point gets encoded as two octets. (To be fully precise: a byte does not have to be 8 bits wide, so we should talk about octets here, not bytes.) Note that in UTF-32, for example, the BOM is not 2 octets but 4 (
FFFE0000), again, because that’s just how code points are encoded in UTF-32.