Why is std::function slower than a function pointer?
Introduction
The other day I came across some code that passes a function to std::sort. Though sort will let you call it with a std::function, there was an admonishment to use the function pointer version of sort instead. Using std::function is almost 4x slower for a trivial float comparison.
What’s going on? Does std::function not compile down to a function pointer??
I tried using godbolt to see what was happening but that wasn’t very helpful. Time to fire up the debugger.
To the debugger!
After using a bunch of higher-level languages for 20 years which have pretty crap debugging capabilities, I’m pleasantly surprised to discover that debugging in C++ is excellent. Especially now that there’s a great front-end to gdb called Seer.
I compiled a test case with g++ foo.cpp -o foo -g, fired up seergdb, and set a breakpoint inside of this no-capture lambda argument to std::function:
std::function<bool(int,int)> cmp_stdfunc =
[](int a, int b){ return a < b; };
The result was immediately illuminating! I expected std::sort to call the function described in cmp_stdfunc, but there were actually about four more stack frames between the inside of sort and the return a < b.
Unexpected stack frames1 between higher-order2 sort and the std::function comparison
(1) The std::function object operator() method
This method tests to see if the function pointer (called the “functor”3 in the code) is empty and raises if it is. Otherwise it calls an “invoker” method on the function object and forwards the args.
operator()(_ArgTypes... __args) const {
if (_M_empty())
__throw_bad_function_call();
return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
}
(2) The invoker
The invoker is a wrapper that basically dereferences the function pointer in the function object and calls a second invoker, forwarding the args.
This invoker is selected when the lambda4 is stored. In this case, it selects an invoker that can invoke this no-capture lambda4 callable.
static _Res
_M_invoke(const _Any_data& __functor, _ArgTypes&&... __args)
{
return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
std::forward<_ArgTypes>(__args)...);
}
(3) The second invoker
The second invoker has a special constexpr case for handling functions that return void. If the return type of the function is not void, it instruments a branch that returns the function’s return value. Otherwise it doesn’t try returning it. I think this is just for template5 type discipline and is not meaningful for logic. E.g. return foo() is not cool if foo() itself is void.
template<typename _Res, typename _Callable, typename... _Args>
constexpr enable_if_t<is_invocable_r_v<_Res, _Callable, _Args...>, _Res>
__invoke_r(_Callable&& __fn, _Args&&... __args)
noexcept(is_nothrow_invocable_r_v<_Res, _Callable, _Args...>)
{
using __result = __invoke_result<_Callable, _Args...>;
using __type = typename __result::type;
using __tag = typename __result::__invoke_type;
if constexpr (is_void_v<_Res>)
std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
std::forward<_Args>(__args)...);
else
return std::__invoke_impl<__type>(__tag{},
std::forward<_Callable>(__fn),
std::forward<_Args>(__args)...);
}
(4) The third invoker
This third invoker finally invokes the function pointer, though it does it as a std::forward(__f) on the function pointer itself (and also forwards the args). I don’t understand why this frame exists at all, I assume it’s more template-related stuff.
template<typename _Res, typename _Fn, typename... _Args>
constexpr _Res
__invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
{ return std::forward<_Fn>(__f)(std::forward<_Args>(__args)...); }
Finally the comparison function (return a < b) is called.
Discussion of the frames
It’s tempting to blame the slowdown on all of these wrapper calls, but my guess is a decent compiler would optimize away a lot of them1(remember I didn’t use any -O flags). That said, there is some work that does not look trivial to delete.
One issue is the function argument could be “not set”. This is some runtime flexibility offered in std::function, which allows creating the object with no function argument, or allows for changing it. If I was designing this interface I would require the function pointer to be provided at construction. Make invalid states unrepresentable. But I’m sure this is naive of me. Anyway! This demands a test for null-ness. Right off the bat you can imagine every iteration of std::sort inserting an unseful extra if (cmp != nullptr) { ... } before calling cmp.
But what about dereferencing and calling it through the invoker? Why are we calling an invoker to call a function when we could just call the function? Because of type erasure6!
Type erasure6 and runtime penalties
One rule of thumb in C++ is if you see a type container that doesn’t fully encode the underlying representation of the thing it contains, you can expect there to be a runtime penalty.
You may look at std::function<bool(int, int)> and protest that this has everything needed to call the function. And if your function is a function, function pointer, or a no-capture lambda4 (which boils down to a function pointer), that’s true! But std::function actually supports all callable objects. That means it can also dispatch to method calls for objects, and lambdas4 that capture environment. Why are these different?
The compiler can call a function simply by pushing its arguments onto the stack and jumping to it. But an object method has a hidden third argument, a pointer to the object itself – the this pointer. That needs to be bundled with the function pointer. And for a lambda4 that captures its calling environment7 this extra information can be a pointer to an entire object graph.
So, a full std::function object architecture must necessarily store the function pointer, a flag indicating what kind of callable object it is and an additional pointer to a payload that is either the this pointer or the closed-over calling environment objects. Since the kind of callable is itself not represented in the type we have erased (or hidden) some information.
This “type erasure”6 necessarily incurs a runtime cost.
Why not use trampolines8 for uniform calling?
If you’re used to dynamic languages or strictly statically typed functional programming languages this semantic overhead pushed onto the programmer and danced around at runtime seems doubly counter-productive.
Surely we could re-write calls to object methods and capturing-lambdas4 so that they conform to the same function interface but also bundle the necessary state? In fact, yes, we could! These are called “trampolines”8. For an object method you could dynamically create a function, inter the object itself into the body of the function, and write the desired method call. This has an additional overhead but approximately the same as the type erased approach, but you only pay it if you use this kind of callable. You can then pass this around and use it interchangeably in any binary interface that expects a function. You could do similarly with capturing-lambdas.
You could, but this is discouraged. The primary reason is security. Many execution environments disable runtime machine-code generation since it’s a security vulnerability. A trampoline8 would require treating data as code which is a modern day no-no. Code is code and data is data. Code is generally marked as non-writable and data is marked as non-executable. Violating either of those norms can crash your program in strict environments.
Why is it 4x slower again?
In total, the overhead above simply calling a function pointer when using std::function is at least: a nullness check, a dereference of the invoker, and then a call into the invoker. That can sum up to an approximate 4x penalty above doing a simple float comparison that’s being provided to std::sort.
Footnotes (partly auto-generated)
-
Stack frames: Data structures that store information about active function calls, including local variables, parameters, and return addresses. Each function call creates a new stack frame. ↩ ↩2
-
Higher-order function: A function that takes other functions as arguments or returns functions as results.
std::sortis a higher-order function because it accepts a comparison function as a parameter. ↩ -
Functor: An object that can be called like a function by implementing the
operator()method. In C++, this includes function objects, lambdas, and function pointers. This kind of “functor” is completely different from functors in OCaml or Haskell. ↩ -
Lambda: An anonymous function that can be defined inline. In C++, lambdas can capture variables from their surrounding scope, making them more flexible than regular function pointers. Capturing requires bundling them with additional state. Non-capturing lambdas are effectively function pointers. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Template: A C++ feature that allows writing generic code that works with different types. Templates are resolved at compile time, creating type-specific code for each instantiation. ↩
-
Type erasure: More formally, a technique where specific type information is hidden behind a common interface.
std::functionuses type erasure to store different kinds of callable objects (functions, lambdas, functors) in a uniform way, at the cost of runtime overhead. ↩ ↩2 ↩3 -
Closure/closed-over environment: When a lambda “captures” variables from its surrounding scope, those variables become part of the lambda’s closure. This captured data must be stored and passed along with the function. ↩
-
Trampolines: Small pieces of dynamically generated code that adapt one calling convention to another. They could theoretically make all callable types uniform, but are security risks since they require treating data as executable code. ↩ ↩2 ↩3