What is BOM?
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (essentially a null character). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.
The Unicode Standard does permit the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8[4] so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.
One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text. For instance, the bytes of UTF-8 can be placed between the quotes of string constants in source files of many programming languages, and when executed the program will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings, without simultaneously upgrading the programming language. The unexpected three bytes of the BOM break this however, as they are located at the start of the source file, where they are certain to be a syntax error.
A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix shebang at the start of an interpreted script,[5] the problem is more widespread. For instance in PHP, the existence of a BOM will cause the page to begin output before the initial code is interpreted, causing problems if the page is trying to send custom HTTP headers (which must be set before output begins).
Some common programs from Microsoft, such as Notepad and Visual C++,[6] add BOMs to UTF-8 files by default. Google Docs adds a BOM when a Microsoft Word document is downloaded as a .txt file.
Link Wikipedia: http://en.wikipedia.org/wiki/Byte_order_mark
If you have a problem with a BOM in your catalogue, you only have to generate again your catalogue and be sure that the BOMs are in the UTF-8.
Here you can find an example with software NOTEPAD++
We can also note that the UTF-8 can be generated with or without BOM.
With BeezUP, the BOM are necessary.
Another useful link: http://unicode.org/faq/utf_bom.html#bom1
Comments
0 comments
Article is closed for comments.