Text Files Coding and End of Line Characters
Character Sets
Text files use character sets. A character set contains punctuation marks, numerals, uppercase and lowercase letters, and all other printable characters. Each element of a character set is identified by a number.
Most character sets in use are supersets of the U.S. ASCII character set, which defines characters for the 96 numeric values from 32 through 127. There are major groups of character sets:
- Windows Character Set (single-byte coding scheme)
- The Windows character set is the most commonly used character set in Windows. It is essentially equivalent to the ANSI character set. The Windows character set uses 8 bits to represent each character; therefore, the maximum number of characters that can be expressed using 8 bits is 256 (2^8). This is usually sufficient for Western languages, including the diacritical marks used in Czech, French, German, Spanish, and other languages.
- Multibyte character set
- Eastern languages employ thousands of separate characters, which cannot be encoded by using a single-byte coding scheme. With the proliferation of computer commerce, double-byte coding schemes were developed so that characters could be represented in 8-bit, 16-bit, 24-bit, or 32-bit sequences. This requires complicated passing algorithms; even so, using different code sets could yield entirely different results on two different computers.
- Unicode Character Set
- To address the problem of multiple coding schemes, the Unicode standard for data representation was developed. A 16-bit character coding scheme, Unicode can represent 65,536 (2^16) characters, which is enough to include all languages in computer commerce today, as well as punctuation marks, mathematical symbols, and room for future expansion. Unicode establishes a unique code for every character to ensure that character translation is always accurate.
- OEM Character Set
- The OEM character set is typically used in full-screen MS-DOS sessions for screen display. Characters 32 through 127 are usually the same in the OEM, U.S. ASCII, and Windows character sets. The other characters in the OEM character set (0 through 31 and 128 through 255) correspond to the characters that can be displayed in a full-screen MS-DOS session. These characters are generally different from the Windows characters.
- Symbol Character Set
- The Symbol character set contains special characters typically used to represent mathematical and scientific formulas.
End of Line Characters
The MS-DOS (including Windows and OS/2), UNIX, and Macintosh operating systems all use different characters to designate the end of a line within a text file.
Symbol | Meaning | Used | Char | Hex |
---|---|---|---|---|
CR | Carriage Return | Macintosh | \r | 0x0d |
LF | Line Feed | UNIX | \n | 0x0a |
CR/LF | Carriage Return/Line Feed | MS-DOS, Windows, OS/2 | \r\n | 0x0d, 0x0a |
NULL | Null character | \0 | 0x00 |