Skip to content

ELF obfuscation: let analysis tools show wrong external symbol calls

Introduction

Now where the hack.lu 2014 CTF is over, I can finally publish a small ELF analysis tool fuck up, I found some months ago. I used this ELF analysis tools fuck up in a challenge of the CTF ("the union") because I did not find anything about it on the internet (you can almost say it was a kind of "0 day" to obfuscate stuff in analysis tools).

So a short back story how I came to it. I gave a little talk about ELF basics and some obfuscation with the help of ELF sections at a colloquium. I discussed some ELF stuff with guys there and an idea was raised: "What would happen, if you put two dynamic string tables in there. One manipulated in the section table and the original in the dynamic segment?". And this is what this post is all about.


The analysis tools fuck up

The biggest fuck up that (most) analysis tools are doing is that they are using the section header table of the ELF file to get information about the binary. The problem with this is that the sections are completely optional and not needed to execute the file. You can completely delete them and still execute the binary without any problem. Why does this work? Because the loader uses the segments, which are mandatory in the ELF specification. But still, the sections (normally) contain more detailed information about the binary and therefore all analysis tools like IDA, gdb, ... use them. IDA (the latest version is 6.6 in the time of writing this) for example has an option to use the "program header table" (which actually means, it just uses the segments) to analyze the binary. But unfortunately, IDA still uses the information of the sections (if they are available). In this post I showed how to restore external symbol calls in IDA 6.1 with ida python and a python library called ZwoELF when no sections are available (the current versions do it themselves, I think by using a similar way). ZwoELF was written with the idea to use the same information a loader has to start the binary and therefore has the correct information about the binary.

So the question you should ask yourself is: "Why shouldn't we manipulate the section header table to show false information about the binary?". But before we come to this and how you can do this, we should know what the dynamic string table is.

The dynamic string table (section) holds all the strings that are needed for dynamic linking. This means that all names of the symbols that are used to call functions from other libraries (for example malloc(), printf(), ...) are in there.


Examples

So let us start with an example before I describe how it exactly works. As an example I use the hack.lu 2014 CTF challenge "the union". It is a normal x86 ELF binary with nothing special. You can download the manipulated version here and the plain version without any section modification here. If you want to solve this challenge yourself you should not read ahead. There are spoilers ahead. If you want to see how it was solved, you can see write-ups here.

The following symbol strings were modified in the binary:


printf -> fputs
strncmp -> strcmp
system -> printf
 


First we take a look at IDA 6.6. Here is the graph view of the trapdoor function of the unmodified binary:



As you can see, in the first basic block we can see a call to "printf" and on the left basic block we see a call to "system". Now we take a look at the modified version (I even used the option to use the "program header table" to load the binary):



We can see, that the call to "printf" in the first basic block is shown as a call to "fputs". The call to "system" in the second basic block as a call to "printf". As we can see, IDA (as other analysis tools too) is fooled by a manipulated section header table. It is even cooler when someone uses the decompiler of IDA:



In this case, the arguments are chosen to make sense when you look at the assembly. But if some push instruction for arguments do not make any sense (because the called function do not have so many arguments), it will not be displayed ;-)

So let us take a look at readelf. Readelf is a good starting point to get information about the binary. But it also uses the section header table (when you do not specify it otherwise). So when you use it without any special parameter you get this:


sqall@towel:~/Desktop/hacklu$ readelf -a theunion
[...]
Relocation section '.rel.plt' at offset 0x4a0 contains 17 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
0804b00c  00000107 R_386_JUMP_SLOT   00000000   fputs
0804b010  00000207 R_386_JUMP_SLOT   00000000   fflush
0804b014  00000307 R_386_JUMP_SLOT   00000000   free
0804b018  00000407 R_386_JUMP_SLOT   00000000   sleep
0804b01c  00000507 R_386_JUMP_SLOT   00000000   __stack_chk_fail
0804b020  00000607 R_386_JUMP_SLOT   00000000   strcat
0804b024  00000707 R_386_JUMP_SLOT   00000000   malloc
0804b028  00000807 R_386_JUMP_SLOT   00000000   puts
0804b02c  00000907 R_386_JUMP_SLOT   00000000   printf
0804b030  00000a07 R_386_JUMP_SLOT   00000000   __gmon_start__
0804b034  00000b07 R_386_JUMP_SLOT   00000000   exit
0804b038  00000c07 R_386_JUMP_SLOT   00000000   strlen
0804b03c  00000d07 R_386_JUMP_SLOT   00000000   __libc_start_main
0804b040  00000e07 R_386_JUMP_SLOT   00000000   fopen
0804b044  00000f07 R_386_JUMP_SLOT   00000000   fgetc
0804b048  00001007 R_386_JUMP_SLOT   00000000   __isoc99_scanf
0804b04c  00001107 R_386_JUMP_SLOT   00000000   strcmp
[...]
 


In my post about "restoring external symbols with ida python" I also mentioned readelf. I got some answers that it can also use the dynamic segment to read the information (the dynamic segment is mandatory for dynamically linked ELF binaries). So let us use this parameter also in this case:


sqall@towel:~/Desktop/hacklu$ readelf -a -D theunion
[...]
'PLT' relocation section at offset 0x80484a0 contains 136 bytes:
 Offset     Info    Type            Sym.Value  Sym. Name
0804b00c  00000107 R_386_JUMP_SLOT   00000000   fputs
0804b010  00000207 R_386_JUMP_SLOT   00000000   fflush
0804b014  00000307 R_386_JUMP_SLOT   00000000   free
0804b018  00000407 R_386_JUMP_SLOT   00000000   sleep
0804b01c  00000507 R_386_JUMP_SLOT   00000000   __stack_chk_fail
0804b020  00000607 R_386_JUMP_SLOT   00000000   strcat
0804b024  00000707 R_386_JUMP_SLOT   00000000   malloc
0804b028  00000807 R_386_JUMP_SLOT   00000000   puts
0804b02c  00000907 R_386_JUMP_SLOT   00000000   printf
0804b030  00000a07 R_386_JUMP_SLOT   00000000   __gmon_start__
0804b034  00000b07 R_386_JUMP_SLOT   00000000   exit
0804b038  00000c07 R_386_JUMP_SLOT   00000000   strlen
0804b03c  00000d07 R_386_JUMP_SLOT   00000000   __libc_start_main
0804b040  00000e07 R_386_JUMP_SLOT   00000000   fopen
0804b044  00000f07 R_386_JUMP_SLOT   00000000   fgetc
0804b048  00001007 R_386_JUMP_SLOT   00000000   __isoc99_scanf
0804b04c  00001107 R_386_JUMP_SLOT   00000000   strcmp
[...]
 


You see that there is still no "strncmp" or "system". In fact, it is the same list as without the "-D" or "--use-dynamic" option. Somehow readelf still uses the section header table to receive the information. To see that you can retrieve the correct information, let us use the "readElf.py" script of the ZwoELF python library examples:


sqall@towel:~/Desktop/hacklu$ python /home/sqall/Desktop/elf_test/git/ZwoELF/examples/readElf.py ./theunion
[...]
Jump relocation entries (17 entries)
No.     MemAddr         File offset     Info            Type            Sym. value      Sym. name
        (r_offset)                      (r_info)        (r_type)
0       0x0804b00c      0x0000300c      0x00000107      R_386_JMP_SLOT  0x00000000      printf
1       0x0804b010      0x00003010      0x00000207      R_386_JMP_SLOT  0x00000000      fflush
2       0x0804b014      0x00003014      0x00000307      R_386_JMP_SLOT  0x00000000      free
3       0x0804b018      0x00003018      0x00000407      R_386_JMP_SLOT  0x00000000      sleep
4       0x0804b01c      0x0000301c      0x00000507      R_386_JMP_SLOT  0x00000000      __stack_chk_fail
5       0x0804b020      0x00003020      0x00000607      R_386_JMP_SLOT  0x00000000      strcat
6       0x0804b024      0x00003024      0x00000707      R_386_JMP_SLOT  0x00000000      malloc
7       0x0804b028      0x00003028      0x00000807      R_386_JMP_SLOT  0x00000000      puts
8       0x0804b02c      0x0000302c      0x00000907      R_386_JMP_SLOT  0x00000000      system
9       0x0804b030      0x00003030      0x00000a07      R_386_JMP_SLOT  0x00000000      __gmon_start__
10      0x0804b034      0x00003034      0x00000b07      R_386_JMP_SLOT  0x00000000      exit
11      0x0804b038      0x00003038      0x00000c07      R_386_JMP_SLOT  0x00000000      strlen
12      0x0804b03c      0x0000303c      0x00000d07      R_386_JMP_SLOT  0x00000000      __libc_start_main
13      0x0804b040      0x00003040      0x00000e07      R_386_JMP_SLOT  0x00000000      fopen
14      0x0804b044      0x00003044      0x00000f07      R_386_JMP_SLOT  0x00000000      fgetc
15      0x0804b048      0x00003048      0x00001007      R_386_JMP_SLOT  0x00000000      __isoc99_scanf
16      0x0804b04c      0x0000304c      0x00001107      R_386_JMP_SLOT  0x00000000      strncmp
[...]
 


As you can see, "strncmp" and "system" are displayed. Because the ordering of the output is the same, you can also see that "fputs" is actually "printf".

Finally, let us take a look at gdb. Gdb has the same problems like IDA and readelf. It uses the manipulated sections to get additional information about the binary. When we take a look at the "trapdoor" function in this binary:


sqall@towel:~/Desktop/hacklu$ gdb ./theunion
[...]
gdb-peda$ x/20i 0x8049208
   0x8049208:   push   ebp
   0x8049209:    mov    ebp,esp
   0x804920b:   sub    esp,0x18
   0x804920e:   mov    DWORD PTR [esp],0x8049b08
   0x8049215:   call   0x80485d0 <puts@plt>
   0x804921a:   mov    eax,ds:0x804b080
   0x804921f:   mov    DWORD PTR [esp+0x4],eax
   0x8049223:   mov    DWORD PTR [esp],0x8049b28
   0x804922a:   call   0x8048560 <fputs@plt>
   0x804922f:   cmp    DWORD PTR [ebp+0x8],0x0
   0x8049233:   je     0x8049243
   0x8049235:   mov    DWORD PTR [esp],0x8049b68
   0x804923c:   call   0x80485e0 <printf@plt>
   0x8049241:   jmp    0x804924f
   0x8049243:   mov    DWORD PTR [esp],0x8049b88
   0x804924a:   call   0x80485d0 <puts@plt>
   0x804924f:   mov    eax,ds:0x804b080
   0x8049254:   mov    DWORD PTR [esp],eax
   0x8049257:   call   0x8048570 <fflush@plt>
   0x804925c:   leave
 


We can see the valid looking call to "fputs" at 0x804922a and "printf" at 0x804923c. So let us take a look at the unmodified version of the binary:


sqall@towel:~/Desktop/hacklu$ gdb ./theunion_unmodified
[...]
gdb-peda$ x/20i 0x8049208
   0x8049208 <trapDoor>:        push   ebp
   0x8049209 <trapDoor+1>:      mov    ebp,esp
   0x804920b <trapDoor+3>:      sub    esp,0x18
   0x804920e <trapDoor+6>:      mov    DWORD PTR [esp],0x8049b08
   0x8049215 <trapDoor+13>:     call   0x80485d0 <puts@plt>
   0x804921a <trapDoor+18>:     mov    eax,ds:0x804b080
   0x804921f <trapDoor+23>:     mov    DWORD PTR [esp+0x4],eax
   0x8049223 <trapDoor+27>:     mov    DWORD PTR [esp],0x8049b28
   0x804922a <trapDoor+34>:     call   0x8048560 <printf@plt>
   0x804922f <trapDoor+39>:     cmp    DWORD PTR [ebp+0x8],0x0
   0x8049233 <trapDoor+43>:     je     0x8049243 <trapDoor+59>
   0x8049235 <trapDoor+45>:     mov    DWORD PTR [esp],0x8049b68
   0x804923c <trapDoor+52>:     call   0x80485e0 <system@plt>
   0x8049241 <trapDoor+57>:     jmp    0x804924f <trapDoor+71>
   0x8049243 <trapDoor+59>:     mov    DWORD PTR [esp],0x8049b88
   0x804924a <trapDoor+66>:     call   0x80485d0 <puts@plt>
   0x804924f <trapDoor+71>:     mov    eax,ds:0x804b080
   0x8049254 <trapDoor+76>:     mov    DWORD PTR [esp],eax
   0x8049257 <trapDoor+79>:     call   0x8048570 <fflush@plt>
   0x804925c <trapDoor+84>:     leave
 


I also left the symbols in the unmodified version intact. We see that the call at 0x804922a is now to "printf" instead of "fputs" and the call at 0x804923c is now to "system". But what happens if you "step in" these functions? Let us try with the call at 0x804922a.


sqall@towel:~/Desktop/hacklu$ gdb ./theunion
[...]
gdb-peda$ b *0x804922a
Breakpoint 1 at 0x804922a
gdb-peda$ r
[...]
   0x804921a:   mov    eax,ds:0x804b080
   0x804921f:   mov    DWORD PTR [esp+0x4],eax
   0x8049223:   mov    DWORD PTR [esp],0x8049b28
=> 0x804922a:   call   0x8048560 <fputs@plt>
   0x804922f:   cmp    DWORD PTR [ebp+0x8],0x0
   0x8049233:   je     0x8049243
   0x8049235:   mov    DWORD PTR [esp],0x8049b68
   0x804923c:   call   0x80485e0 <printf@plt>
gdb-peda$ b printf
Breakpoint 2 at 0xf7e541f0
gdb-peda$ b fputs
Breakpoint 3 at 0xf7e6b930
gdb-peda$ c
[...]
=> 0xf7e541f0 <printf>: push   ebx
   0xf7e541f1 <printf+1>:       sub    esp,0x18
   0xf7e541f4 <printf+4>:       call   0xf7f2e22b
   0xf7e541f9 <printf+9>:       add    ebx,0x15ee07
   0xf7e541ff <printf+15>:      lea    eax,[esp+0x24]
 


We can see, we do not call "fputs", we end up in "printf". But how does this actually work?


The duplicated dynamic string table

It works quite simple. When you take a look at the section header table of the unmodifed binary:

sqall@towel:~/Desktop/hacklu$ readelf -S theunion_unmodified
There are 30 section headers, starting at offset 0x218c:


Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        08048154 000154 000013 00   A  0   0  1
  [ 2] .note.ABI-tag     NOTE            08048168 000168 000020 00   A  0   0  4
  [ 3] .note.gnu.build-i NOTE            08048188 000188 000024 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        080481ac 0001ac 00002c 04   A  5   0  4
  [ 5] .dynsym           DYNSYM          080481d8 0001d8 000150 10   A  6   1  4
  [ 6] .dynstr           STRTAB          08048328 000328 0000e3 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          0804840c 00040c 00002a 02   A  5   0  2
[..]
 


You see the ".dynstr" section. In this section, there lies the dynamic string table. This table contains just C strings that are lined up. The dynamic symbols have a field called "st_name". This field has a value that gives the offset inside this string table. So when you want to get the name of this symbol, you go to the start of the dynamic string table and add the "st_name" value to the offset/address. Then you get the C string of the symbol. But the loader does not determine the offset/address of the dynamic string table (or .dynstr section) through the section table. It gets the address through the dynamic segment. Let us take a look at the dynamic segment of the modified binary:


sqall@towel:~/Desktop/hacklu$ readelf -d theunion

Dynamic section at offset 0x2f14 contains 24 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
 0x0000000c (INIT)                       0x8048528
 0x0000000d (FINI)                       0x80497a4
 0x00000019 (INIT_ARRAY)                 0x804af08
 0x0000001b (INIT_ARRAYSZ)               4 (bytes)
 0x0000001a (FINI_ARRAY)                 0x804af0c
 0x0000001c (FINI_ARRAYSZ)               4 (bytes)
 0x6ffffef5 (GNU_HASH)                   0x80481ac
 0x00000005 (STRTAB)                     0x8048328
 0x00000006 (SYMTAB)                     0x80481d8
 [...]
 


You can see that the "STRTAB" entry has the same address like the ".dynstr" entry in the section header table of the unmodified binary. This value is needed by the loader to execute the binary correctly. So when we take a look at the section header of the modified binary:

sqall@towel:~/Desktop/hacklu$ readelf -S theunion
There are 28 section headers, starting at offset 0x317c:


Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        08048154 000154 000013 00   A  0   0  1
  [ 2] .note.ABI-tag     NOTE            08048168 000168 000020 00   A  0   0  4
  [ 3] .note.gnu.build-i NOTE            08048188 000188 000024 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        080481ac 0001ac 00002c 04   A  5   0  4
  [ 5] .dynsym           DYNSYM          080481d8 0001d8 000150 10   A  6   1  4
  [ 6] .dynstr           STRTAB          08048328 0020fd 0000e3 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          0804840c 00040c 00002a 02   A  5   0  2
[...]
 


We see that the address is completely the same, but the offset of the ".dynstr" section is modified. This is the whole trick. We copied the dynamic string table to the end of the binary (or somewhere in between, as we like) and overwrote the offset of the ".dynstr" entry in the section header table with the offset of our new dynamic string table. In this string table we can now modify the names of the symbols and the analysis tools will use these strings to receive additional information of the symbol.

This technique has a limitation though. As I explained above, the "st_name" field of a dynamic symbol contains the offset to the C string. We can not modify this offset value. The string has to be exactly at the same offset and must end with a null byte. This means we can only replace a symbol name with a name with a smaller or exactly the same length. A larger name would interfere with the next symbol name. Also we have a problem when we have two or more symbols with a similar name. For example when we want to manipulate the name of "printf" and we also have the "sprintf" symbol used in the binary. To optimize the used string table, the compiler uses the same string for the "printf" and "sprintf" symbol. It just uses the value of the "st_name" field to get the correct string back. When we now change "printf" to "flux", we also changed the name of "sprintf" to "sflux". So keep this in mind if you use this trick.


What can we do against it?

Well, to spot it you just have to check if the ".dynstr" section and "STRTAB" entry in the dynamic segment have the same address. But the address is not enough, you have to calculate the offset from the "STRTAB" address and compare it to the ".dynstr" offset. If these offsets are the same, you are good to go. Nevertheless, it is a nice trick to throw analysts off (without any performance what so ever impact) and fool analysis tools. One can argue that it is too easy to spot and every trained analyst can find it in no time. But on the hack.lu 2014 CTF it was hard enough so that only 10 teams solved the challenge. Automated approaches like "renaming functions after their functionality (here under windows)" would definitly be fooled as long as they use everything from the section table (which analysis tools usually do).

For IDA, the script to restore the external symbol calls which I linked above can also be used to get the symbol names automatically. At the moment I am trying to restore the complete section table from the information I can gather in the binary. The idea behind it is that you can not correct all analysis tools, so why not correct the section table for all these analysis tools. But it is just a project I do in my free time (with a lot of other projects). So do not expect any results too soon.

I added a script that can do this manipulation to the ZwoELF examples. So if you want to test it yourself, feel free to use it.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

The author does not allow comments to this entry

Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
Form options

Submitted comments will be subject to moderation before being displayed.