Jul 14, 2016

Representation of data types in memory - Part 1

Normally we don't need to bother how fundamental data types are stored in memory. We just expect it to work. Recently I became curios, so I learned the details of how a fundamental data type is represented in memory, especially on the stack. As usual, I'm using Windows Vista 32 bit with Microsoft Visual C++ 2010 Express and WinDbg. Note that I'm not considering C++11.

I've divided this topic into several posts. In the first one, I will focus on one group of fundamental data types; Numerical integer types, and more specific, the unsigned ones. In following posts, I will consider the signed numerical integers, floating types and other aspects.

Well, let's look into the numerical integer types. You probably know the integer types already. Below is a recap cited from www.cplusplus.com.
"Numerical integer types:
They can store a whole number value, such as 7 or 1024. They exist in a variety of sizes, and can either be signed or unsigned, depending on whether they support negative values or not."
Let's start with a very simple program, here we just use the unsigned integer type. When compiling this program in Debug mode, and executing it, how is the unsigned integers represented in memory?
int main()
{
   unsigned int a = 1;
   unsigned int b = 2;
   unsigned int c = 3;
   unsigned int d = 4;

   return 0;
}
When starting WinDbg and execute all the initializations statements, we see the result below.
WinDbg - Memory view after initializations

The blue arrow above, indicates that I've set the Memory view to ebp-0x0F0 according to the Disassembly view. Before any initializations has been done, a block of memory is initialized to 0xCC thanks to the rep stos instruction. This is typically done in Debug mode. The rep stos instruction starts inserts 0xCC at ebp-0f0.

Further, we can also see that each initialized integer is "guarded" by four bytes of 0xCC (before and after each integer).

The x86 architecture is using the little-endian format. I will briefly explain the little-endian format here. If you want more in-depth information, just search the net.

The endianness describes how a sequence of bytes are stored in the memory. It means that the endianness only matters for data types with more than one byte in size. Little-endian means that the Least Significant Byte (LSB) is stored at the lowest address and the Most Significant Byte (MSB) is stored at the highest address.

From the example above, we are dealing with integers, so let's use integer as an example. An integer is four bytes in size, this can be seen in the Memory view. An integer can be written like "ByteA ByteB ByteC ByteD", where ByteA is the MSB and ByteD is the LSB. The memory will look like this when we are using the little-endian format.

Base Address: ByteD
Base Address + 1: ByteC
Base Address + 2: ByteB
Base Address + 3: ByteA

Let's make the example above more realistic by using an actual value from our simple program, for instance;

   unsigned int a = 1;

As we know, the unsigned int is four byte in size, so the number will be 0x00000001. According to the Disassembly view, the statement "unsigned int a = 1;", will be saved on the stack at memory address ebp-0x8. The LSB i.e. 0x01 is saved at memory address ebp-0x8, and the other bytes is saved according to below.

ebp-0x8: 0x01
ebp-0x7: 0x00
ebp-0x6: 0x00
ebp-0x5: 0x00

The example above, was only dealing with the unsigned integers. Now we move on to another simple program, which shows the data representation of each unsigned numerical integers; char, short int, int and long int.
int main()
{
   unsigned char a = 1;
   unsigned short int b = 2;
   unsigned int c = 3;
   unsigned long int d = 4;

   return 0;
}
When starting WinDbg and execute all the initializations statements, we see the result below.
WinDbg - Memory view after initializations
Like the first simple program, the blue arrow and blue rectangle, shows the inserted 0xCC done by the rep stos instruction.

Let's investigate how each type is stored. We know they are stored in little-endian format, meaning the LSB is stored in the lowest address.

   unsigned char a = 1;

ebp-0x5: 0x01

   unsigned short int b = 2;

ebp-0x14: 0x02
ebp-0x13: 0x00

   unsigned int c = 3;

ebp-0x20: 0x03
ebp-0x1F: 0x00
ebp-0x1E: 0x00
ebp-0x1D: 0x00

   unsigned long int d = 4;

ebp-0x2C: 0x04
ebp-0x2B: 0x00
ebp-0x2A: 0x00
ebp-0x29: 0x00

From the results above, we can note that the unsigned char is only one byte in size, so the endianess format does not matter. We can also see that both unsigned int and unsigned long int is four byte in size.

You are welcome to leave comments, complaints or questions!

No comments:

Post a Comment