Last week, we identified a bug in Qt with Olivier‘s new signal-slot syntax. Upon further investigation, it turns out it’s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.

tl;dr (a.k.a. Executive Summary)

Shared libraries on Linux are linked with -fPIC, which makes all variable references and function calls indirect, unless they are static. That’s because in addition to making it position-independent, it makes every variable and function interposable by another module: it can be overridden by the executable and by LD_PRELOAD libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.

Plus, there are a few more actions we should take (like prelinking) to improve performance even further.

Jump to existing or proposed solutions, Google+ discussion.

Details

Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.

The basics

Imagine the following C file, which also compiles in C++ mode:

extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &externalFunction;
}

The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.

This compiler produces the following assembly output (gcc 4.6.0, -O3):

        call    externalFunction
        movq    $externalFunction, externalVariable(%rip)

This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.

When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn’t know. So the linker must somehow fix this. It fixes the call instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.

The movq instruction cannot be fixed. There’s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker “fixes” the values. For the address of externalFunction, it uses the address of the PLT stub it created in the previous paragraph. For the externalVariable variable, tt will create a copy relocation, which means the dynamic linker will need to find the variable where it is, copy its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.

What are the consequences of this? For the PLT call, it’s a simple performance impact which could not be avoided. Since the address of the actual externalFunction function is not known at compile and link-time, and we don’t want to leave a text relocation, the only way to place that call to find the address at run-time and indirectly call it.

For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn’t changed. This is done only once, then the code runs in its most efficient form.

The fact that we resolved &externalFunction to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it’s a doubly-indirect call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.

It gets worse

So far we’ve analysed what happens in an executable. Now let’s see what happens when we try to build the same C code for a shared library. We do that by introducing the -fPIC compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rdx
        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

When assembled, the .o still contains three relocations, albeit of different type.

When we compare the output of the position-dependent and the position-independent code, we notice the following:

The call is still a call, but now we’re explicitly calling the PLT stub. This might seem irrelevant, since the linker would have fixed the call anyway to point to the PLT if it had to, but isn’t.
The single movq instruction was split in three. This is required by the x86-64 processor, since the instruction set cannot encode a 64-bit value and the 64-bit address to store it in the same instruction (such instruction would be at least 17 bytes long, which 2 two bytes longer than the maximum instruction length).
The values for the two symbols are loaded indirectly. Instead of encoding the two values in those two middle movq instructions, the compiler is loading the values from another linker-generated structure called the Global Offset Table (GOT).

The compiler needed to generate the code above since it doesn’t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.

Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that &externalFunction might have a different value.

That means the indirect call through the PLT and the three movq instructions remain, even if those two symbols were in the same compilation unit!

The problem is that even if at first glance you’d think that the compiler should know for a fact where those symbols are, it actually doesn’t. The -fPIC option doesn’t enable only position-independent code. It also enables ELF symbol interposition, which is when another module “steals” the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD’ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.

In the end, we’re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.

All this for the possibility of interposition?

Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.

Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It’s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.

I don’t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:

-rwxr-xr-x. 2 root root   13544 Aug  5 06:27 /usr/bin/perl
-rwxr-xr-x. 2 root root    9144 Apr 12  2011 /usr/bin/python
-rwxr-xr-x. 1 root root    5160 Dec 29 13:46 /usr/bin/ruby
-r-xr-xr-x. 1 root root 1763488 Apr 12  2011 /usr/lib64/libpython2.7.so.1.0
-rwxr-xr-x. 1 root root  947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7
-rwxr-xr-x. 1 root root 1524064 Aug  5 06:27 /usr/lib64/perl5/CORE/libperl.so

That’s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.

Existing solutions

Partial solution for private symbols

When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be “hidden”, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can’t find them, they can’t steal them. And if they can’t steal them, the linker does not need to produce a PLT stub for the function call, so the call instruction will be linked to a simple, direct call as the executable in the first part had been.

The other consequence is an optimisation that the compiler does. Since it also knows that the externalVariable variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rax
        movq    %rax, externalVariable(%rip)

The .o file will still contain three relocations. However, note how the getting of the address of the externalFunction function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it’s a simpler relocation since the symbol name itself is not present.

If there’s a reason for getting the address indirectly like this, I have yet to find it.

Partial solution for public non-interposable symbols

If your symbols are public, however, you cannot use the ELF “hidden” visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.

If you declare a variable with ELF “protected” visibility, you’re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not ever be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.

The GCC syntax __attribute__((visibility("protected"))) works in ELF platforms only, whereas the one with the “hidden” keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).

Another way to do the same is to use one of two linker options: -Bsymbolic and -Bsymbolic-functions. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.

The reason why -Bsymbolic-functions exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn’t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it’s possible to symbolically bind function calls inside a library without affecting executables.

Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won’t be the same.

Proposed solutions

Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.

Use of PLT in function calls should rest only with the linker

As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding “@PLT” to the symbol name. Turns out that the linker doesn’t really care about this and will generate (or not) the PLT stub if needed. If that’s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.

Function addresses should always be resolved through the GOT

Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What’s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.

That being the case, we can take a small “hit” in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.

The benefit of doing this is that the function address we load will point to exactly function’s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.

PLT stubs should use the regular GOT’s address, if it exists

If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn’t already so, I guess, is because the entries in the .got.plt section aren’t initialised with the target function’s address, but the local module’s function resolver. This trick allows for the “lazy resolution” of functions: they are resolved only the first time they are called.

I wouldn’t ask for all functions to be resolved at load-time, but if the address of the function is taken anyway, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?

Copy relocations should be deprecated

Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.

The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.

The linker should relax instructions used for loading variable addresses

This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.

For our x86-64 64-bit case, the instructions the compiler generated were:

        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

By changing one bit in the opcode of the first instruction, with no code size change, we can produce:

        leaq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

The x86 instruction “LEA” means “Load Effective Address”. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn’t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.

It’s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second movq instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:

The original would be:

        movq     externalVariable@GOTPCREL(%rip), %rax
        movq.d32 %rdx, 0x0(%rax)

And it would get relaxed to:

        nopl.d32 0x0(%rax)
        movq     %rdx, externalVariable@GOTPCREL(%rip)

That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.

The same applies to other architectures too. The full -fPIC code on ARM to store a value from a register into a variable is the following:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable(GOT)
.LPIC1: ldr     r3, [r4, r3]  @ r4 contains the base address of the GOT
        str     r2, [r3, #0]

If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable-(.LPIC1-8)
.LPIC1: add     r3, pc, r3
        str     r2, [r3, #0]

Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.

The linker should relax instructions used for loading function addresses

Similar to the above, but instead looking at function addresses. The original library code is:

        movq    externalFunction@GOTPCREL(%rip), %rdx

But it can be relaxed to:

        leaq    externalFunction(%rip), %rdx

With ARM, the original code is:

        ldr     r3, .L2+8     @ points to a constant of value: externalFunction(GOT)
        ldr     r2, [r4, r3]  @ r4 contains the address of the base of the GOT

But relaxed, it would be:

        ldr     r2, .L2+8    @ points to a constant of value: externalFunction-(.LPIC0+8)
.LPIC0: add     r2, pc, r2

There should be a way to tell the compiler where the symbol is

We’re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.

I would suggest simply using the existing ELF markers and being explicit about them:

__attribute__((visibility("hidden"))): symbol is in this ELF module and is not exported (equivalent on Windows: no decoration);
__attribute__((visibility("protected"))): symbol is in this ELF module and is exported (equivalent on Windows: __declspec(dllexport));
__attribute__((visibility("default"))): symbol is in another ELF module (equivalent on Windows: __declspec(dllimport)); this also applies to symbols that must be overridable according to the library’s API (like C++’s global operator new).

Considering the other suggestions, we know the references to symbols with “default” visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the “default” visibility for cases of uncertain symbols.

Getting there

Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now “good citizens”, for they access their own private data most efficiently, they don’t have huge symbol tables (which impact lookup speed) and they don’t pollute the global namespace with unnecessary symbols.

Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the “protected” visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.

An alternative to using the “protected” visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (--dynamic-list) and list each symbol one by one.

Using -fPIE

The compiler option -fPIE tells the compiler to generate position-independent code for executables. It is similar to the -fPIC option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.

With executables compiled with this option, copy relocations and direct loading of function addresses aren’t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.

Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there’s an added, side-effect of this option: it defines the __PIC__ macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.

Further work and further reading

I highly recommend Urlich Drepper’s “How to Write Shared Libraries” paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using static functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.

Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the .data.rel.ro sections).