Contents

How to load ELF into memory?

In this article, the loading process described is limited to ELF in Linux, not including how PE is loaded in Window. But don’t worry, many concepts (in addition to terminology) are similar in Window.

Why not just map to the physical memory address?

Why is program always loaded into virtual memory address? We already know that program would run up as a process, and each process tends to take all the memory alone by themselves. Long time ago, process runs up directly on a physical memory, which means that process would be mapped to the physical memory address. However, there are several risks for this method: 1. The address spaces of all processes are not separated from each other, which means that malicious process can visit the address space of others, 2. It’s not efficient to map whole program to the memory, 3. Program can only access data at fixed address; however, the allocated memory address are not fixed.

Previous methods to map to physical memory address: fix partitioning and dynamic partitioning. For fix partitioning, each partition would have fixed size (might be unequal to each others). However, it would cause to internal fragmentation due to the fixed size. For dynamic partitioning, each partition has size fitting to the process. However, it would have external fragmentation problem, and need strategy to pick priority partition even using the compaction method.

The method to solve the problem is to make process visit virtual memory address, and map virtual memory address to physical address. Imagine that non-consecutive physical address can still be consecutive in virtual address space, which means that process doesn’t need to care about the non-consecutive problem in physical address space! Each process can only access to their virtual address space (which solves problem 1). For the methods to map from virtual address space to physical address space, I would like to introduce segmentation in past and paging which is used widely now.
Segmentation is the method of mapping virtual address space to physical address space bytes by bytes (byte to byte mapping). For example, process A needs 10 MB for memory, then assume that A needs (0x00000000 ~ 0x00A00000) in the virtual address space. In physical address space, memory of 10 MB would be allocated and mapped by virtual address space. For example,

1
2
3
4
5
6
# virtual address
0x00000000 ~ 0x00A00000              0x00000000 ~ 0x00100000
                \    process A                  \      process B
                 v                               v
        0x00100000 ~ 0x00B00000           0x00B00000 ~ 0x00C00000
# physical address

The memory accessed in virtual address space and physical address space have the same size. Segmentation solves the first (A and B map to different physical address space) and the third issue (no matter which physical address space is allocated, just follow the rule of mapping from 0x00000000 to 0x00100000). However, mapping whole process into memory is still not efficient. To solve this problem, we need to know about the principle of locality for program first: Process would only visit part of data and program frequently, which means that we can use a method of more fine-grained level to achieve higher efficiency. This method is called paging.
Paging would separate address space to pages. Break up all memory into small fixed-size (fix partitioning) partitions called frames. Each process would be given a number of pages, and each frame can only hold one page. The size of page is decided by the hardware. The data and code which is accessed frequently would be loaded into memory, and the remaining part would be kept in disk. With paging, the memory wasted due to internal fragmentation would never be more than (page_size - 1) bytes. Here comes an example, we assume that process A uses the virtual address space (VAS_A), process B uses the virtual address space (VAS_B). VAS_A and VAS_B can be separated to pages of (VP_A1, VP_A2, ...VP_A8) and (VP_B1, VP_B2, ... VP_B8), there are also physical pages of (PP1, PP2, ...PP6) and disk pages of (DP0, DP1).

1
2
3
4
5
6
7
8
9
          #VAS_A#   #physical#  #VAS_B#
           VP_A8      PP6        VP_B8
#Disk#     VP_A7 -->  PP5  <--   VP_B7
 DP1  <--  VP_A6      PP4        VP_B6
 DP0  <--  VP_A5   >  PP3        VP_B5
           VP_A4  /   PP2        VP_B4
           VP_A3 /    PP1  <     VP_B3
           VP_A2             \   VP_B2
           VP_A1              -  VP_B1

In above, VP_A6 and VP_A5 are kept in disk first. When process A needs to access these two pages, hardware will catch Page Fault and OS would read them from disk and load into virtual address space. The details of page fault would be described in next section. VP_A7, VP_A3 of process A and VP_B7, VP_B1 of process B are mapped to physical memory by pages, and VP_A7 and VP_B7 share the memory! Mapping by pages solves the second problem and also makes it convenient to build up protection. Each page can be set with different privileges and limit processes to access.
To summarize the mapping from virtual address space to physical address space, almost all CPU needs a compoment - Memory Management Unit (MMU) to help handle with it. This work is called dynamic address translation.

1
[CPU] -> virtual address -> [MMU] -> physical address > [physical memory]

/12-21-21/paging.png
Take a look at the picture above. After we swap out the process B, we can still put D in non-consecutive physical address. Try using dynamic address translation (page_table[frame number] + offset) on consecutive virtual address, then we can find that they are non-consecutive addresses actually.

Take away, this section talks about why using paging rather than segmentation.

How to map by pages?

Continuing on the previous section, how process is built up, loaded and run?

  1. build up an individual virtual address space.
  2. read ELF header and build up mapping between program and virtual address space.
  3. set CPU register as entry point of program and run.

For the step 1, it doesn’t create a space actually (of course, it is virtual) but just build up a data structure for mapping from virtual address space to physical address space.
For the step 2, it would map from ELF to virtual address space. For example, if size of page is 4096 bytes, then each mapping would take 4096 bytes.

1
2
3
4
5
6
7
# ELF
header | .text | ...
           \
            v
... | ... | .text | ...
# virtual address space
        0x08048000 ~ 0x08049000

Mapping by page, it would take from 0x08048000 ~ 0x08049000 in virtual address space just like above. This mapping work is also just a data structure which can also help search the location of fault page in ELF.
For the step 3, OS would let process control the execution.

What’s page fault?

Three steps in above have not loaded instructions and data into memory, only data structure of mapping is built. Assume that the entry point address of program is 0x08048000, which is the start address of .text. When CPU plans to execute the instruction in this address, he finds that this page (0x08048000 - 0x08049000) is an empty page, then Page fault would be caught and pass the control to OS. The exception handler in OS would go to search in data structure built in step 2 above, which maps ELF to virtual address space. The data structure will tell OS that the empty page is mapped to which part of ELF, and allocate a physical page in physical memory space. Give control back to the process, then process would continue on the address which has page fault. Therefore, with process executed, page fault would happen again and again.

Briefly work like following steps:

  1. Execute pages by pages
  2. Asking for page which is not hold in memory
  3. There is no any frame holds the page
  4. Call page fault
  5. Loading desired page from physical to frame in virtual memory

Take away, this section talks about mapping with page fault.

Section or Segment in memory?

If we map each section to one page, it should be annoying waste if binary has several sections. Therefore, ELF is mapped into page by segments instead of sections. What’s segment? Segment is the combination of sections in memory for purpose of reducing scattered pages.

1
2
readelf -S xxx.elf
readelf -l xxx.elf

There is section header in ELF which we can use -S option to search for sections. There is also program header in ELF which we can use -l option to search for Section to Segment mapping. However, we still can find that the size of segment is not multiple of a page, which means that page would still be scattered. For example, if there are three segments (seg1, seg2, seg3) for ELF, and seg2 even needs three pages to support. If each segment is mapped to page individually, we need five pages of physical memory. To reduce the number of pages needed, the page which include multiple segments will be mapped for two times in virtual memory. For example,

1
2
|--seg1--|-----------seg2------------|---seg3---|
|------page------|------page------|------page------|

With picture above, we can imagine that if we can put all segments together then only three pages are enough. Yes, then it is close to the solution to handle the wasteful issue. First, what is the difficulty if we only use three pages to separate all segments?

1
2
3
4
5
6
7
|--seg1--|-----------seg2------------|---seg3---|
|------page------|    |------page------|  |------page------|
|   seg1 + seg2' |    |     seg2''     |  | seg2''' + seg3 |
# map to virtual address space
|   seg1 + seg2' |     seg2''     | seg2''' + seg3 |
v          v                                  v
seg1       seg2                               seg3

It seems that we still can locate the start address of each segment if we load into continuous virtual address space. However, if we want to swap out page 1 after loading seg1, then part of seg2 would also be swapped out. Therefore, I want to avoid starting address of each segments to be in same page. If we map the page including mixed segments for two times,

1
2
3
4
5
|   seg1 + seg2' |    |     seg2''     |  | seg2''' + seg3 |
# map to virtual address space
|   seg1 + seg2' |   seg1 + seg2' |     seg2''     | seg2''' + seg3 | seg2''' + seg3 |
v                           v                                                   v
seg1                      seg2                                                  seg3

Now, all starting addresses of segments are in different pages. For seg2, we can load continuous memory through three pages. Here, we know that the starting address is not necessary to be multiple of page.

Take away, this section talks about how segments are loaded into memory The problem is simple in Window, starting addresses of all segments would be multiple of pages. Don’t need to worry about the waste, PE won’t have so many sections like ELF does.

Stack and Heap in memory

In addition to segments of ELF, stack and heap would also be allocated in virtual address space. To see the virtual address space allocated for process,

1
2
3
4
# get pid
./xxx.elf &
# list out all virtual address space for the pid
cat /proc/[pid]/maps

In addition to virtual space allocated for segments, we can also find [heap], [stack] and [vdso]. These three virtual spaces are not mapped to any ELF, and they are also called as Anonymous Virtual Memory Area.

How Linux load ELF?

Back to our big problem, how does Linux load ELF? In bash process, we input a command to run up ELF, and bash process would call fork() to create a new process. The new process would call execve() to start loading ELF.

1
2
3
4
5
6
bash -> fork() => new process
# new process
-> execve('xxx.elf', argv, envp)  # start loading
-> sys_execve()  # /kernel/Process.c
-> do_execve()
-> search_binary_handle()

First, do_execve() will find ELF and extract first 128 bytes to check the format. For example, the first four bytes are called as Magic Number. For example,

1
2
3
4
# ELF
[0x7F, 'e', 'l', 'f']
# Jave executable
['c', 'a', 'f', 'e']

then call search_binary_handle(). From the name, we already can guess that this function would decide loading method based on format. For example, the method of loading ELF is load_elf_binary() or load_script() for script based executables. load_elf_binary() is defined in /fs/binfmt_elf.c. The major steps are:

  1. simple check of binary format
  2. find .interp section for dynamic linking
  3. mapping into memory based on header
  4. initialize process environment
  5. modify the return address of syscall as the entry point of ELF

Here are several details which would discussed later in article of dynamic linking. After load_elf_binary() is done, then return to do_execve(), return to sys_execve(), next step would return to entry point of ELF instead of execve()! Why? The reason is that step 5 already modifies the return address of syscall made by execve(); therefore, we come to the door of ELF.

What’s the next plan?

Ok, it really takes some time for me to write this note, but it is interesting. It brings me back to the memory when I learned about OS. What’s different is that this time I know why I learn and I know more about ELF. The ELF discussed about in this article is the case of static linked, I would like to explore the case of dynamic linked in next article.