x86 Exploitation 101: heap overflows… unlink me, would you please?

Well, do the previous techniques apply to the dynamic allocation scenario? What if, instead of a statically allocated array, there’s a malloc-ed space? Would that work? Well, more or less, but things get REALLY more complicated. And I mean it for real. So, instead of stack overflows, there will be heap overflows: problem is that heap and stack work in different ways. In addition, the heap is handled differently according to the allocator implementation: this makes heap overflow exploits really dependent on the allocator implementation and on the operating system. As for now I’m taking care about Linux, I decided to analyze the exploitation scenario and the history of the most common Linux allocator, i.e. the one included in GNU C library (glibc).

Why should I care about the history? Glibc developers patched, time after time, the most of the bugs in the malloc implementation that allowed exploits to work. This, anyway, is a good training exercise to really get into this stuff and understand the general thought. Also, as the exploitation of a heap overflow strongly relies on the implementation of the allocator, a deep analysis on how it’s implemented is mandatory.

So, first things first. How is the malloc implemented in glibc? The implementation is a variation of the ptmalloc2 implementation, which is a variation of dlmalloc. Anyway, as the glibc code says, “There have been substantial changes made after the integration into glibc in all parts of the code. Do not look for much commonality with the ptmalloc2 version”. Even if it’s a variation of a variation, the fundamental ideas are still the same. Actually the ptmalloc2 supports more than one heap at the same time, and each heap is identified by the following structure:

typedef struct _heap_info
  mstate ar_ptr; /* Arena for this heap. */
  struct _heap_info *prev; /* Previous heap. */
  size_t size;   /* Current size in bytes. */
  size_t mprotect_size; /* Size in bytes that has been mprotected
                           PROT_READ|PROT_WRITE.  */
  /* Make sure the following data is properly aligned, particularly
     that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];
} heap_info;

The most important thing in this structure is the ar_ptr variable, as it’s a pointer to the arena, i.e. the heap itself. Also, there’s the size variable storing the size of the heap. Having more than one heap in a multi-threading context is really useful, as if a thread is reading the content of a variable in a given arena and another thread asks to allocate new memory, it is possible to create a new arena and use the newly created one to perform all the required operations.

Each arena is described by the following structure:

struct malloc_state
  /* Serialize access.  */
  mutex_t mutex;

  /* Flags (formerly in max_fast).  */
  int flags;

  /* Fastbins */
  mfastbinptr fastbinsY[NFASTBINS];

  /* Base of the topmost chunk -- not otherwise kept in a bin */
  mchunkptr top;

  /* The remainder from the most recent split of a small request */
  mchunkptr last_remainder;

  /* Normal bins packed as described above */
  mchunkptr bins[NBINS * 2 - 2];

  /* Bitmap of bins */
  unsigned int binmap[BINMAPSIZE];

  /* Linked list */
  struct malloc_state *next;

  /* Linked list for free arenas.  */
  struct malloc_state *next_free;

  /* Memory allocated from the system in this arena.  */
  INTERNAL_SIZE_T system_mem;
  INTERNAL_SIZE_T max_system_mem;

In this structure, there’s the mutex for this arena, used during concurrent accesses, and two other really important fields: top and bins. The top field is a chunk of memory (the fundamental element of the allocator) and it’s where the data for which the space has been allocated are going to be stored. Each chunk in this is a memory fragment that can be allocated. At the beginning, there’s only one big chunk in the arena (called wilderness), which is pointed by the top field itself: this chunk is always free and its size represents the free space of the arena. Also it marks the end of the available space of the arena.

The bins array is composed by double-linked lists to chunks that were allocated and that were successively freed (this means that, at the beginning, all the bins are empty). Each bin stores a list of chunks of specified size in order to allow the allocator to easily search for a free chunk, given the size: the research will be performed by starting looking for the smallest and best-fitting one.

Bins and Chucks

The chunk is described by the following structure:

struct malloc_chunk
  INTERNAL_SIZE_T      prev_size;  /* Size of previous chunk (if free).  */
  INTERNAL_SIZE_T      size;       /* Size in bytes, including overhead. */

  struct malloc_chunk* fd;         /* double links -- used only if free. */
  struct malloc_chunk* bk;

  /* Only used for large blocks: pointer to next larger size.  */
  struct malloc_chunk* fd_nextsize; /* double links -- used only if free. */
  struct malloc_chunk* bk_nextsize;

The different fields of this structure are used in different ways, depending on the status of the chunk itself. If the chunk is allocated, only the first two fields are present. The prev_size fields specifies the size of the previous chunk if the previous one is free (if it is allocated, then this field is “shared” with the previous chunk, enlarging it of 4 bytes, in order to decrease the waste of space), while the size field specifies the size of the current chunk. Then, if the chunk is in free state, there are in addition two pointers: fd and bk. These are the pointers used to double link the list of chunks. The other two pointers (present anyway only in free chunks) are not that important in this context and won’t be examined.

How come there’s no flag to tell if the current chunk is free or allocated? Well, a feature of the chunks in this implementation is that they are 8-bytes aligned: this means that the last three bits of the size field can be used as general purpose flags.

  • The LSB of the size variable tells us if the previous chunk is allocated or not (PREV_INUSE)
  • The following bit tells if the chunk was allocated by using the mmap system call (IS_MMAPPED)
  • The third bit specifies if the chunk is stored in the main arena or not (NON_MAIN_ARENA)

In order to know if a given chunk is free or not it is necessary to get the next chunk by adding size to the pointer of the current chunk (obtaining the address of the next chunk), and checking the LSB of the size field of the next chunk.

When the malloc function is called, the first thing done is to search in the bins if there’s already a previously-freed chunk available matching the specified size, otherwise, a new chunk is created in the wilderness next to last allocated one. If a chunk is found in the bins, it is necessary to remove it from the list, in order to keep the whole structure coherent. This is done by using the infamous unlink macro defined in malloc.c: this macro uses the bk and the fd fields of the chunk to be moved to perform its task. Before glibc 2.3.4 (released at the end of the 2004), the unlink macro was defined as follows:

/* Take a chunk off a bin list */
#define unlink(P, BK, FD) {                                            \
  FD = P->fd;                                                          \
  BK = P->bk;                                                          \
  FD->bk = BK;                                                         \
  BK->fd = FD;                                                         \

This macro is just an extraction of an item from a double-linked list: business as usual. If no space is available on the heap, new memory for the heap is requested to the operating system, by using the sbrk or the mmap system call (if mmap is used, then the chunk is marked with the IS_MMAPPED bit). The heap address where the chunk has been allocated is then returned to malloc.

However there’s another situation in which the unlink macro may be used: during a free. In fact, if the chunk(s) next (both before and after) to the one that is going to be freed is/are not used, then all of them are merged together in one chunk. This means that all the chunks that were in free status before the merging need to be unlinked from the bins where they were by using the aforementioned macro. The resulting chunk is finally moved to the unsorted_chunks list

On July 2000, Solar Designer on Openwall and then on November 2001 MaXX on Phrack #57, published two articles on how to exploit this macro. Their whole point is: what if it was possible to modify the fd and the bk pointers? In fact, if we look at the malloc_chunk structure, the whole unlink macro can be reduced to the following instructions:

FD = *P + 8;
BK = *P + 12;
FD + 12 = BK;
BK + 8 = FD;

This means that, if we could control the fd and the bk pointers, it would be possible to overwrite the FD+12 location with the BK content. If BK points to the shellcode address, that would be awesome. Actually the BK+8 location gets overwritten as well, and that would be in the middle of the shellcode itself: this means that the first instruction of the shellcode should jump over this overwritten part. Here the chance of overwriting any DWORD of memory is given: the problem is which one would be of any use to be overwritten? Overwriting the return address, just like in the stack overflows, is a painful, as it depends on the stack situation at the moment. Interesting alternatives would be about overwriting something in libc, an exception handler address, or the free function address itself. This means that every single function pointer can actually be overwritten.

MaXX, in his article on Phrack, proposed the idea of overwriting the first function pointer emplacement in the .dtors section (he actually re-proposed the ideas already exposed in this article). gcc provides an interesting feature: constructors and destructors. The idea behind this is exactly the same used in C++ for classes. It is possible to specify attributes for some functions that will be automatically executed before the main function (constructors) or after (destructors). The declarations are specified in the following way:

static void start(void) __attribute__ ((constructor));
static void stop(void) __attribute__ ((destructor));

where start and stop are arbitrary names for functions. The addresses of these functions will be stored, respectively, in the .ctors and in .dtors section in the following way: there are 4 bytes set to 0xFF at the beginning and 4 bytes set to 0x00 at the end and, in between, there are the addresses of the functions. Of course, if there are no constructors/destructors defined, the .ctors/.dtors section would look like 0xFFFFFFFF 0x00000000. The goal is to overwrite the 0x00000000 part with the address of the function that has to be executed as a constructor (the head MUST be left as 0xFFFFFFFF).

So, the “only” thing left to do is to be able to control the fd and the bk pointers. How this can be done? To do this, two adjacent allocate chunks are required and an overflow must be possible on the first one. In fact, let’s say that we have the following piece of code:

#include <stdlib.h>
#include <string.h>
int main(int argc, char **argv)
  char *first_buf;
  char *second_buf;

  first_buf = (char *)malloc(78 * sizeof(char));
  second_buf = (char *)malloc(20 * sizeof(char));

  strcpy(first_buf, argv[1]);

  return 0;

The heap situation will look like:

first_buf size has been aligned to 8-bit (so, from 78 to 80 byte of buffer space) and includes the prev_size and the size fields (80+4+4=88). In addition the LSB has been set to 1, as the chunk is used: the final size for this chunk is 89 (0x59). The second_buf chunk size is already 8-bit aligned and shares the first 4 bytes with the previous chunk: this means that the final size of the chunk itself is 25 (0x19), as the LSB is set as well.

It is clear that, if we overflow first_buf, it is possible to overwrite the data of second_buf (metadata included). So here’s the strategy:

  • The second_buf chunk must be “transformed” into a freed chunk
  • The address of memory where the 4 bytes are going to be written (minus 12) must be stored in the fd field of the chunk storing second_buf
  • The 4 bytes to be written must be stored in the bk field of the chunk storing second_buf
  • The whole thing must be triggered when free(first_buf) is executed

The strategy is pretty simple: the only tricky point is the first one, but its solution is simple and awesome. Glibc’s implementation of malloc checks if a chunk is free or not with the following macro:

#define inuse_bit_at_offset(p, s)   (((mchunkptr) (((char *) (p)) + (s)))->size & PREV_INUSE)

where p is the address of the chunk and s is its size (it actually checks the PREV_INUSE bit of the size field of the following chunk). This would mean modify a third chunk? Well, no… What if the size of the second chunk is set to 0? The size of the second chunk itself is read! As the size is set to 0, the PREV_INUSE bit is clear: according to glibc the second_buf is a free chunk. That’s everything we need, actually.

The payload to be written in first_buf will be:

  • 80 bytes of useless data
  • 4 bytes of useless data overwriting the prev_size field of second_buf‘s chunk
  • 4 bytes set to 0 overwriting the size field of second_buf‘s chunk
  • 4 bytes set to the address of the last for bytes of dtors (the one having 0x00000000) overwriting the bk field of second_buf‘s chunk
  • 4 bytes set to the address of the shellcode to be executed overwriting the fd field of second_buf‘s chunk

This is really everything’s needed to execute an heap overflow exploiting the unlink macro bug.

An alternative way of exploitation exists: the double-free scenario. What happens if the same chunk is freed twice? Some glibc’s versions ago, it would have been inserted twice in the free chunks list. Let’s say that one of these two chunks is then reallocated and that we modify the locations where the fd and bk fields are. Then let’s say that we get managed to allocate the “second” chunk still in the free list: as the unlink macro is called at allocation time as well, the whole mechanism is triggered again in a similar way. This trick known as the double-free exploitation, and even if it’s not THAT documented, there are known exploits aiming at this kind of vulnerability.

Sadly, all this won’t work nowadays for four reasons:

  1. The RELRO technique (enabled by default on recent Linux distributions) marks the relocation sections used to dynamically dynamically loaded functions read-only (.ctors, .dtors, .jcr, .dynamic and .got): this means that the program crashes if it tries to modifies one of these sections.
  2. The __do_global_dtors_aux function (which is the one executing the destructors) has been hardened in 2007 in such a way that only the destructors actually defined in the code are going to be executed.
  3. As said at the beginning, the unlink macro has been hardened in order to check the pointers before unlinking:
    #define unlink(P, BK, FD) {                                            \
      FD = P->fd;                                                          \
      BK = P->bk;                                                          \
      if (__builtin_expect (FD->bk != P || BK->fd != P, 0))                \
        malloc_printerr (check_action, "corrupted double-linked list", P); \
      else {                                                               \
        FD->bk = BK;                                                       \
        BK->fd = FD;                                                       \
      }                                                                    \

    In a normal situation P->fd->bk points to P (the same is true for P->bk->fd): if this isn’t the case, then it means that the pointers have been modified and that the double-linked list is corrupted.

  4. Checks for double free have been added more or less everywhere

The next step in this field is the publication of another article on Phrack #61 by jp on August 2003 (still, some time before the unlink macro was patched on glibc) that aims at build a higher layer on what MaXX discovered two years before. jp’s main point is: what should I do if I know that I can write four bytes of its memory with almost any arbitrary data with the unlink technique? At the beginning of the paper he defined the new acronym aa4bmo meaning “Almost Arbitrary 4 Bytes Mirrored Overwrite” and for respect’s sake I will keep using this acronym from now on. When MaXX wrote his article, there was no NON_MAIN_ARENA bit, so what jp did is an update of the techniques exposed by MaXX in order to have them working on the 2003 version of glibc and the implementation of the aa4bmoPrimitive function that allows to write more or less complex programs exploiting the aforementioned vulnerability.

Based on the aa4bmo primitive, Phantasmal Phantasmagoria wrote another interesting article called “Exploiting the Wilderness“: in fact he proved that, in case an overflowable buffer is located next to the wilderness, then it is possible to have the aa4bmo.

Of course, this jp’s whole work (and everything else based on that) stopped working when the unlink macro was hardened.

MaXX, jp and Phantasmal Phantasmagoria’s work were an important step in heap overflow exploitation’s history, as they opened minds to new roads and it didn’t take too much to realize that there were/are other ways to exploit heap overflows. In fact, in 2005 Phantasmal Phantasmagoria came out with a theoretical article Malloc Maleficarum and started a new rush to the heap overflow exploits. All these next steps will be explained in the next article.


10 thoughts on “x86 Exploitation 101: heap overflows… unlink me, would you please?

    1. Hello, if you’re talking about the first_buf malloc, the actual size used in the allocation process is 80, as said in the article. Otherwise, would you please pointing out the exact dubious step?

  1. “As for now I’m taking care about Linux, I decided to analyze the exploitation scenario and the history of the most common Linux allocator, i.e. the one included in GNU C library (glibc).”

    Just nit picking around! Therefore why here linux is solely mentioned, instead of GNU / linux, which I would say is hardly related to the allocator, except for brk() and related syscall 🙂 ? Furthermore, they are a POSIX standard, and consequently are totally abstracted by the API (except for the syscall ABI you would be right to say) ? Indeed, here we are here mentioning with a surgical precision the GNU dynamic memory allocator only if I understand well.

    Nonetheless, thank you for this neat article, I will warmly recommend it around.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s