research32: fundamental data type

Showing posts with label fundamental data type. Show all posts

Feb 24, 2018

Representation of data types in memory - Part 3

You have probably written statements like this thousand of times, but do you know what's going on under the hood?

float foo = 1.21;

This is the continuation of my series about representation of data types in memory. In my previous post, I've discussed the numerical integral types. Now it's time to dig into the wonderful world of floating types. There is a lot to discuss about floating types, so in this post I will focus on the Normalized form and its representation in memory.

Every C++ programmer has dealt with the data types float and double and we will soon see how they are represented in memory. But before doing that, we need to understand some basic concepts and formats, so let's discuss some theory before proceeding to the practical part.

My intention is not to give a complete description about the theory of floating points, I will give a very brief summary here (and I mean it!), the details can easily be found on the net.

Below is a very good description (from this site), which describes how a floating point is expressed when stored in memory.

A floating-point number is typically expressed in the scientific notation, with a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×10^E); while binary numbers use radix of 2 (F×2^E).

Example:
Let's say we have 16.25₁₀. By using the scientific notation, it can be written as 16.25₁₀ * 10⁰or 1.625₁₀ * 10¹ and so forth. In binary, it can be written as 10000.01₂ * 2⁰, 1000.001₂ * 2¹, 100.0001₂ * 2², 10.00001₂ * 2³ or 1.000001₂ * 2⁴ and so forth. The representation used in a floating point is 1.000001₂* 2⁴, which is called the Normalized form.

The IEEE standard 754 describes the single precision format and double precision format. It is important to have a brief understanding of these formats, because the floating types in C++ is based on them.

According to MSDN, following is stated (Visual Studio 2010 specific) for the data type float;

The float type is stored as a four-byte, single-precision, floating-point number. It represents a single-precision 32-bit IEEE 754 value.

According to MSDN, following is stated (Visual Studio 2010 specific) for the data type double;

The double type is stored as an eight-byte, double-precision, floating-point number. It represents a double-precision 64-bit IEEE 754 value.

Voila! There we have the basic definition of the data types float and double.

The single precision format consists of 32 bits (4 bytes), where the Most Significant Bit (MSB) represent the sign (S) bit, the following 8 bits represent the exponent (E), and the 23 Least Significant Bit(s) (LSB) represent the fraction (F). Note that a bias is applied to the exponent (E) in order to represent both positive and negative exponents. The bias is 127 in single precision format, meaning exponent (E) = 0 is represented as 127, E=1 is represented as 128 and so on.

The double precision format consists of 64 bits (8 bytes), where the MSB represent the sign (S) bit, the following 11 bits represent the exponent (E), and the 52 LSB(s) represent the fraction (F). Note that a bias is applied to the exponent (E) in order to represent both positive and negative exponents. The bias is 1023 in double precision format, meaning exponent (E) = 0 is represented as 1023, E=1 is represented as 1024 and so on.

Before proceeding to the practical part, just a few words about Normalized form.

As we have seen above, Normalized form means we have an implicit leading 1 to the left of the radix point, which is used in the fraction (F), for instance 1.000001₂. This leading 1 is not represented in the 32/64 bit format, but we know the leading 1 is there, if Normalized form is used.

Let's look into a simple application. This is very similar to my previous simple application, except from the fact the data type is float.

int main()
{
   float a = 1.0;
   float b = 2.0;
   float c = 3.0;
   float d = 4.0;

   return 0;
}

When starting WinDbg and execute all the initializations statements, we see the result below.

WinDbg - Memory view after initializations

As discussed in previous posts, in the Memory view, we can see the blue rectangle which indicates the inserted block of 0xCC which is typically done in Debug mode. The blue arrow shows the offset of the Memory view, which is where the insertion of 0xCC starts. Each float is also "guarded" by four bytes 0xCC. In the Disassembly view, we can notice that special floating point instructions are used. I will not discuss them in detail here, if you want more information, they are described in Intel x86 reference manual, more specific in the section "Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M"

Binary representation of 1.0

To understand how a float is stored in the memory, I will work through a couple of examples. Let's start with the first statement in the simple application;

float a = 1.0;

In binary we have 1.0₁₀ = 1.0₂, which is equal to 1.0₂ * 2⁰ in Normalized form. Now let's see how this number 1.0₂ * 2⁰ is stored in memory bit by bit.
Positive number -> Sign (S) bit = 0₂
Exponent: 0, i.e. the exponent (E) is represented as 127₁₀ = 01111111₂
Fraction (F): 0₂ = 00000000000000000000000₂
Binary representation: 00111111 10000000 00000000 00000000₂
Hexadecimal representation: 3F800000₁₆

This number is stored in the memory by this instruction;

fstp    dword ptr[ebp-8h]

The fstp instruction (Store Floating Point Value) copies the value from the FPU register stack to the memory in either single- or double precision format (single in this case due to float type). Note that the fld1 (Load Floating Point Value) instruction pushed this value (1.0) onto the FPU register stack in the first place. 0x3F800000 will be stored at memory address ebp-0x8. Taking little-endian into account, each byte will be saved according to below;

ebp-0x8: 0x00
ebp-0x7: 0x00
ebp-0x6: 0x80
ebp-0x5: 0x3F

Binary representation of 2.0

float b = 2.0;

Since 1.0 already was converted above, I will give a briefer explanation below.
Binary form: 10.0₂
Normalized form: 1.0₂ * 2¹
Positive number -> Sign (S) bit = 0₂
Exponent: 1, i.e. the exponent (E) is represented as 128₁₀ = 10000000₂
Fraction (F): 0₂ = 00000000000000000000000₂
Binary representation: 01000000 00000000 00000000 00000000₂
Hexadecimal representation: 40000000₁₆

This number is stored in the memory by this instruction;

fstp    dword ptr[ebp-14h]

Taking little-endian into account, each byte will be saved according to below;

ebp-0x14: 0x00
ebp-0x13: 0x00
ebp-0x12: 0x00
ebp-0x11: 0x40

Binary representation of 3.0

float a = 3.0;

Binary form: 11.0₂

Normalized form: 1.1₂ * 2¹
Positive number -> Sign (S) bit = 0₂
Exponent: 1, i.e., the exponent (E) is represented as 128₁₀ = 10000000₂
Fraction (F): 1₂ = 10000000000000000000000₂
Binary representation: 01000000 01000000 00000000 00000000₂
Hexadecimal representation: 40400000₁₆

This number is stored in the memory by this instruction;

fstp    dword ptr [ebp-20h]

Taking little-endian into account, each byte will be saved according to below;

ebp-0x14: 0x00
ebp-0x13: 0x00
ebp-0x12: 0x40
ebp-0x11: 0x40

Binary representation of 4.0

float a = 4.0;

Binary form: 100.0₂

Normalized form: 1.0₂ * 2²
Positive number -> Sign (S) bit = 0₂
Exponent: 2, i.e. the exponent (E) is represented as 129₁₀ = 10000001₂
Fraction (F): 0₂ = 00000000000000000000000₂
Binary representation: 01000000 10000000 00000000 00000000₂
Hexadecimal representation: 40800000₁₆

This number is stored in the memory by this instruction;

fstp    dword ptr[ebp-2Ch]

Taking little-endian into account, each byte will be saved according to below;

ebp-0x14: 0x00
ebp-0x13: 0x00
ebp-0x12: 0x80
ebp-0x11: 0x40

The four examples above was only dealing with numbers in Normalized form. Later, I will have a look at the Denormalized form.

You are welcome to leave comments, complaints or questions!

Jul 20, 2016

Representation of data types in memory - Part 2

This is the continuation of the series "Representation of data types in memory". In this part, I will investigate how signed numerical integers are stored in memory. I'm using Windows Vista 32 bit with Microsoft Visual C++ 2010 Express (in Debug Mode) and WinDbg. Note that I'm not considering C++11.

Well, in my previous post, we saw how the unsigned numerical integers were saved in little-endian format in a block of 0xCC. This is of course true for signed numerical integers as well, but signed integers need to take the sign into account as well.

I will start this post in a similar way like my previous one, by using the same simple program, but with signed numerical integers.

int main()
{
   int a = -1;
   int b = -2;
   int c = -3;
   int d = -4;

   return 0;
}

When starting WinDbg and execute all the initializations statements, we see the result below.

WinDbg - Memory view after initializations

As discussed before, in the Memory view, we can see the blue rectangle which indicates the inserted block of 0xCC, which typically is done in Debug mode. The blue arrow shows the offset of the Memory view, which is where the insertion of 0xCC starts. Each integer is also "guarded" by four bytes 0xCC. Now we will focus on how the signed numerical integers are saved in memory. This is done by using the two's-complement representation. What is two's-complement? You can read all about it on the net, but here is a short explanation cited from wikipedia.

"Two's complement is a mathematical operation on binary numbers, as well as a binary signed number representation based on this operation. Its wide use in computing makes it the most important example of a radix complement."

There are a lot of in-depth information on the net how two's-complement conversion is done, here is my short version;

Invert all bits of the (positive) number and add one (+1).

Let's use an actual value from our simple program as an example, for instance;

int a = -1;

Positive number: 1₁₀

It's an int, i.e. four bytes in size -> 1₁₀= 00000001₁₆.

Then invert all bits; 00000001₁₆ -> FFFFFFFE₁₆
and then add one (+1); FFFFFFFE₁₆ -> FFFFFFFF₁₆

In other words, the number FFFFFFFF₁₆, represent -1 in two's complement representation.

We can see this number in the Disassembly view, more specific, we can see this instruction;

mov     dword ptr [ebp-8],0FFFFFFFFh

It means that -1 is represented as 0xFFFFFFFF and will be stored at memory address ebp-0x8. Taking little-endian into account, each byte will be saved according to below;

ebp-0x8: 0xFF
ebp-0x7: 0xFF
ebp-0x6: 0xFF
ebp-0x5: 0xFF

Now let's continue with a similar program as in the first part in this series, this time with signed numerical integers.

int main()
{
   char a = -1;
   short int b = -2;
   int c = -3;
   long int d = -4;

   return 0;
}

When starting WinDbg and execute all the initializations statements, we see the result below.

WinDbg - Memory view after initializations

Like the first simple program, the blue arrow and blue rectangle, shows the inserted 0xCC done by the rep stos instruction.
Let's investigate how each type is stored. We know they are stored in two's-complement little-endian format.

Below, I've used following shorthand to denote the two's complement conversion for each type;

Positive number (using correct size) -> Invert all bits -> Added 1

char a = -1;

01₁₆ -> FE₁₆ -> FF₁₆

ebp-0x5: 0xFF

short int b = -2;

0002₁₆ -> FFFD₁₆ -> FFFE₁₆

ebp-0x14: 0xFE
ebp-0x13: 0xFF

int c = -3;

00000003₁₆ -> FFFFFFFC₁₆ -> FFFFFFFD₁₆

ebp-0x20: 0xFD
ebp-0x1F: 0xFF
ebp-0x1E: 0xFF
ebp-0x1D: 0xFF

long int d = -4;

00000004₁₆ -> FFFFFFFB₁₆ -> FFFFFFFC₁₆

ebp-0x2C: 0xFC
ebp-0x2B: 0xFF
ebp-0x2A: 0xFF
ebp-0x29: 0xFF

You are welcome to leave comments, complaints or questions!

Jul 14, 2016

Representation of data types in memory - Part 1

Normally we don't need to bother how fundamental data types are stored in memory. We just expect it to work. Recently I became curios, so I learned the details of how a fundamental data type is represented in memory, especially on the stack. As usual, I'm using Windows Vista 32 bit with Microsoft Visual C++ 2010 Express and WinDbg. Note that I'm not considering C++11.

I've divided this topic into several posts. In the first one, I will focus on one group of fundamental data types; Numerical integer types, and more specific, the unsigned ones. In following posts, I will consider the signed numerical integers, floating types and other aspects.

Well, let's look into the numerical integer types. You probably know the integer types already. Below is a recap cited from www.cplusplus.com.

"Numerical integer types:

They can store a whole number value, such as 7 or 1024. They exist in a variety of sizes, and can either be signed or unsigned, depending on whether they support negative values or not."

Let's start with a very simple program, here we just use the unsigned integer type. When compiling this program in Debug mode, and executing it, how is the unsigned integers represented in memory?

int main()
{
   unsigned int a = 1;
   unsigned int b = 2;
   unsigned int c = 3;
   unsigned int d = 4;

   return 0;
}

When starting WinDbg and execute all the initializations statements, we see the result below.

WinDbg - Memory view after initializations

The blue arrow above, indicates that I've set the Memory view to ebp-0x0F0 according to the Disassembly view. Before any initializations has been done, a block of memory is initialized to 0xCC thanks to the rep stos instruction. This is typically done in Debug mode. The rep stos instruction starts inserts 0xCC at ebp-0f0.

Further, we can also see that each initialized integer is "guarded" by four bytes of 0xCC (before and after each integer).

The x86 architecture is using the little-endian format. I will briefly explain the little-endian format here. If you want more in-depth information, just search the net.

The endianness describes how a sequence of bytes are stored in the memory. It means that the endianness only matters for data types with more than one byte in size. Little-endian means that the Least Significant Byte (LSB) is stored at the lowest address and the Most Significant Byte (MSB) is stored at the highest address.

From the example above, we are dealing with integers, so let's use integer as an example. An integer is four bytes in size, this can be seen in the Memory view. An integer can be written like "ByteA ByteB ByteC ByteD", where ByteA is the MSB and ByteD is the LSB. The memory will look like this when we are using the little-endian format.

Base Address: ByteD
Base Address + 1: ByteC
Base Address + 2: ByteB
Base Address + 3: ByteA

Let's make the example above more realistic by using an actual value from our simple program, for instance;

unsigned int a = 1;

As we know, the unsigned int is four byte in size, so the number will be 0x00000001. According to the Disassembly view, the statement "unsigned int a = 1;", will be saved on the stack at memory address ebp-0x8. The LSB i.e. 0x01 is saved at memory address ebp-0x8, and the other bytes is saved according to below.

ebp-0x8: 0x01
ebp-0x7: 0x00
ebp-0x6: 0x00
ebp-0x5: 0x00

The example above, was only dealing with the unsigned integers. Now we move on to another simple program, which shows the data representation of each unsigned numerical integers; char, short int, int and long int.

int main()
{
   unsigned char a = 1;
   unsigned short int b = 2;
   unsigned int c = 3;
   unsigned long int d = 4;

   return 0;
}

When starting WinDbg and execute all the initializations statements, we see the result below.

WinDbg - Memory view after initializations

Like the first simple program, the blue arrow and blue rectangle, shows the inserted 0xCC done by the rep stos instruction.

Let's investigate how each type is stored. We know they are stored in little-endian format, meaning the LSB is stored in the lowest address.

   unsigned char a = 1;

ebp-0x5: 0x01

   unsigned short int b = 2;

ebp-0x14: 0x02
ebp-0x13: 0x00

   unsigned int c = 3;

ebp-0x20: 0x03
ebp-0x1F: 0x00
ebp-0x1E: 0x00
ebp-0x1D: 0x00

   unsigned long int d = 4;

ebp-0x2C: 0x04
ebp-0x2B: 0x00
ebp-0x2A: 0x00
ebp-0x29: 0x00

From the results above, we can note that the unsigned char is only one byte in size, so the endianess format does not matter. We can also see that both unsigned int and unsigned long int is four byte in size.

You are welcome to leave comments, complaints or questions!