Monday, November 28, 2016

C++ Compile, Link Process and Characters.

C++ compilation is a two-step process. First, the source code is compiled into an object file that contains the machine code equivalent of the source file. Secondly, the linker combines the object files for a program into a file containing the complete executable program. The linker will also integrate any functions from the Standard Library used in the second step.

Imagining the intermediate object files from each .cpp source file are similar to the Java .class files, which you then run with JVM. However, the Java compiler interprets the source code into bytecode that is OS and platform independent and without saying, is not machine code.

Similar to Java, you can compile each source file independently in separate compiler runs. This is convenient since in the coding process, there will be typographical and other errors to be coded iteratively. Even if it compiles, it may have logical errors to be revised.

Regarding Characters

Talking about computer characters, ASCII was defined in 1960s as 7-bit code so that there are 128 code values. ASCII values 0 to 31 represent non-printing control characters such as carriage return (0x0F) and line feed (0X0C). Code value 65 yo 90 are the uppercase letters A to Z and 141 to 172 correspond to the lowercase a to z. The codes for uppercase and lowercase letters are only different in the sixth bit.

Enter Universal Character Set (UCS) around 1990s to overcome the limitations of ASCII codes and extend it to include codes for foreign languages. UCS is defined to code up to 32 bits.

However, it is very inefficient to use four bites when one byte can do the job.

UCS defines a mapping between characters and integer code values, called "code points". The code point is not the same as an encoding. It is an integer that can be represented in different ways of bytes or words in an computer system.

Unicode is a standard that defines the characters with the code points derived from UCS. Remember, with the same identical code point, you can have different encodings. Unicode standards provide such flexibility by dividing the codes into 17 code planes, each of which contains 65,536 code values.

Code plane 0 contains codes from 0x0 to 0xffff and code plane 1 with 0x10000 to 0x1ffff. Naturally code plane 0 contains most national languages.

As mentioned, Unicode provides more than one encoding method. The most commonly used are UTF-8 and UTF-16.

UTF-8 represents a character as a variable length of 1 to 4 bytes with ASCII character set appears in UTF-8 as single byte codes.

UTF-16 represents a character as one or two 16-bit values. UTF-16 includes UTF-8.

Java use UTF-16 unicode to represent internal text.

In C++, the default size of 'char' is 8-bit ASCII code and you can declare it as 'signed char' to have value -128 to 127. You also have wchar_t, char16_t and char32_t to store unicode characters.


No comments:

Post a Comment