binary – Why does the BOM consist of two bytes instead of one for example in encoding utf-16

The BOM started out as an encoding trick. It was not part of Unicode, it was something that people discovered and cleverly (ab)used.

Basically, they found the U+FEFF Zero width non-breaking space character. Now, what does a space character that has a width of zero and does not induce a linebreak do at the very beginning of a document? Well, absolutely nothing! Adding a U+FEFF ZWNBSP to the beginning of your document will not change anything about how that document is rendered.

And they also found that the code point U+FFFE (which you would decode this as, if you decoded UTF-16 “the wrong way round”) was not assigned. (U+FFFE0000, which is what you would get from reading UTF-32 the wrong way round, is simply illegal. Codepoints can only be maximal 21 bits long.)

So, what this means is that when you add U+FEFF to the beginning of your UTF-16 (or UTF-32) encoded document, then:

  • If you read it back with the correct Byte Order, it does nothing.
  • If you read it back with the wrong Byte Order, it is a non-existing character (or not a code point at all).

Therefore, it allows you to add a code point to the beginning of the document to detect the Byte Order in a way that works 100% of the time and does not alter the meaning of your document. It is also 100% backwards-compatible. In fact, it is more than backwards-compatible: it actually works as designed even with software that doesn’t even know about this trick!

It was only later, after this trick had been widely used for many years that the Unicode Consortium made it official, in three ways:

  • They explicitly specified that the code point U+FEFF as the first code point of a document is a Byte Order Mark. When not at the beginning of the document, it is still a ZWNBSP.
  • They explicitly specified that U+FFFE will never be assigned, and is reserved for Byte Order Detection.
  • They deprecated U+FEFF ZWNBSP in favor of U+2060 Word Joiner. New documents that have an U+FEFF somewhere in the document other than as the first code point should no longer be created. U+2060 should be used instead.

So, the reason why the Byte Order Mark is a valid Unicode character and not some kind of special flag, is that it was introduced with maximum backwards-compatibility as a clever hack: adding a BOM to a document will never change it, and you don’t need to do anything to add BOM detection to existing software: if it can open the document at all, the byte order is correct, and if the byte order is incorrect, it is guaranteed to fail.

If, on the other hand, you try to add a special one-octet signature, then
all UTF-16 and UTF-32 reader software in the entire world has to be updated to recognize and process this signature. Because if the software does not know about this signature, it will simply try to decode the signature as the first octet of the first code point, and the first octet of the first code point as the second octet of the first code point, and further decode the entire document shifted by one octet. In other words: adding the BOM would completely destroy any document, unless every single piece of software in the entire world that deals with Unicode is updated before the first document with a BOM gets produced.

However, going back to the very beginning, and to your original question:

Why does the BOM consist of two bytes

It seems that you have a fundamental misunderstanding here: the BOM does not consist of two bytes. It consists of one character.

It’s just that in UTF-16, each code point gets encoded as two octets. (To be fully precise: a byte does not have to be 8 bits wide, so we should talk about octets here, not bytes.) Note that in UTF-32, for example, the BOM is not 2 octets but 4 (0000FEFF or FFFE0000), again, because that’s just how code points are encoded in UTF-32.

np complete – What commands would a P=NP algorithm consist of?

Taking the example of boolean satisfiability problem. We might have something like:

$$(a lor b lor c lor d)land (a lor d lor e) land (f lor b lor g)$$

The statement is that there is no algorithm acting on a general input that would give the answer in polynomial time with respect to the length of the input.

Do we assume that the algorithm is some kind of Turing machine that looks at the input character by character and moves the head one step left or one step right after each move updating some internal state, or is there another way to formulate how the algorithm should be expressed?

algorithms – Sorting $n^2$ numbers which consist of numbers from 1 to $n$

I wish to sort $n^2$ numbers which all come from the set ${1,2,3,…,n}$, i.e duplications are allowed. I know I can just use
merge sort which has complexity $mathcal{O}(n^2log (n))$, but I was wondering if it was possible to do better since I know all the numbers will be coming out of ${1,2,3,…,n}$.

If there is a special name for this type of problem please let me know. Any references or answers are greatly appreciated.

Terminology: What typical processes does the Software Engineering discipline consist of?

I am reading the "Software Engineering Knowledge Guide" but I am very confused.

A typical process consists of: implementation and change, definition, evaluation, process and measurement of the product?

Or is it software, design, build, test, and maintenance requirements? I thought this was the answer to my question, but I'm starting to have second thoughts.

rt.representation theory – Does the Global Arthur package consist only of global generic representations?

I would like to ask two very stupid questions to the experts.

I wonder if each globally generic automorphic representation of unit groups contains some global Arthur package associated with some parameter A.

Rather, I also wonder if the global generic package A consists only of globally generic automorphic representations.

If these two questions are correct, are these consequences the work of Mok and Kaletha, Shin, White, Minguez?

Thank you very much if you share your knowledge.

Should variations of a color, within a color palette, consist only of dyes and shadows of that color?

I am designing an IDE with a dark UI. We have an existing color palette, but our (~ 10) grays seem muted when applied to the IDE in the large quantities required by the dark UI.

When analyzing each sample, I realized that they do not belong to a family of HTML colors. Should variations of a color, within a good Color palette, doesn't it consist of dyes and shadows of the same color or family of colors?

To add to my confusion, both the design of materials (https://material.io/resources/color/#!/?view.left=0&view.right=0&primary.color=263238) and human interface guidelines (https : // developer .apple.com / design / human-interface-Guidelines / ios / visual-design / color /) contain gray palettes with gray of different HTML color families.

Does the P2SH unlock script only consist of operants?

I have a question about the P2SH unlock script. Can I put common operators and operators in the P2SH unlock script? And if so, how can I create such a script?

e.g.

release script: CHECKSIG

redeem script: HASH160

EQUALVERIFY CHECKSIG

Do low-end IoT devices consist of Root of Trust (RoT), p. TPM?

I believe that low-end IoT devices / low power devices are very limited with limited processing capabilities. I just want to know if these types of devices have RoT.

True or false: does your diet basically consist of cigarettes and beer?

Report abuse

Additional details