Mohegan SkunkWorks

Sun, 09 Aug 2009 10:27:45 EST

Factorials, Tail Recursion and CPS ... in C

Recursive algorithms are elegant. However, if the recursion is not a tail call the growth of the stack leads to a stack overflow.

Tail call recursion is a technique whereby the last call in a recursive function does not depend on the variables pushed on the stack. In other words the function returns the value of its additional (recursive) call.

Functional languages like Haskell or Lisp are designed to support the use of tail recursive algorithms.The JVM -although now the target platform of a lisp like clojure or a hybrid functional language like scala - does not support tail recursion at all. In C/C++ the compiler can in fact replace tail recursive calls with a simple loop, thereby eliminating the allocation for additional stack frames all together. In this post I'll consider various implementations of the humble factorial to illustrate some of these things.

The factorial of an integer n is defined as:

n! = n * (n-1)!  
1! = 1  

Here's a straightforward implementation using a while :

fact = 1;  
while (n > 0) {  
  	  fact *=n;  

A naive recursive implementation looks like this :

	int  fact(int n) {  
        if (n == 1) return 1;  
   	    return n * fact(n-1);  

fact cannot return a value until all previous values have been popped of the stack, causing the stack to grow. A tail recursive version of the same algorithm uses an accumulator to remove the multiplication on the last line :

int fact(int n, int accum) {  
	if n==1 return accum;  
	return fact(n-1, n*accum);  

When discussing recursion the y-combinator cannot be far behind.

This document discusses the reduction of a factorial in scheme to an application of the y-combinator. When I read this, a third possibility of calculating the factorial sprung to mind: pass the function as part of the recursion. This is very much like a CPS transformation: Each subsequent call is passed a continuation of the previous calculation.

C doesn't support continuations. The role of the continuation is played by the stack, but that is not accessible as a first-order object in C.

A first approach is to rewrite the previous expressions as :

int fact(Func f, int n, int accum) {  
  	    if n == 1 return accum;  
  	    return f(f, n-1, n*accum);  

What's the signature of f ? It's a function which takes as it's first argument a function , which takes as it's first argument a function, which.... In other words we now need to explicitly type the recursion, which is not possible in C. Never mind then ? Well, we can now take advantage of C's lack of typing and change the implementation just a little:

     int fact(void* f, int val, int acc) {  
  	      if (val == 1) return acc;  
  	      Func* fptr = (Func * ) f;  
  	      return fptr((void *) f, val-1, acc*val);  

This does the trick.

The file factorial.c in c_fact contains all three implementations discussed above, as well as the 'loop' based calculation. I'm compiling on a 64 bit machine (AMD Turion dual core), running Ubuntu.

It's instructive too look at the assembly code generated by the compilet for each implementation, at various levels of optimization.

First, I compile a non-optimized debug version, like so :

gcc -m32 -g    -c -o factorial.o factorial.c  
gcc -m32 factorial.o -o factorial  
objdump -D ./factorial 

The -m32 will force gcc to generate 32 bit code. The 64 bit code is not going to be materially different. objdump -D is used to disassemble the resulting executable.

Here is the dump for the first variation of the factorial function :

 0804845e factorial:  
 804845e:	55                   	push   %ebp  
 804845f:	89 e5                	mov    %esp,%ebp  
 8048461:	83 ec 08             	sub    $0x8,%esp  
 8048464:	83 7d 08 00          	cmpl   $0x0,0x8(%ebp)  
 8048468:	75 09                	jne    8048473 <factorial+0x15>  
 804846a:	c7 45 fc 01 00 00 00 	movl   $0x1,-0x4(%ebp)  
 8048471:	eb 17                	jmp    804848a <factorial+0x2c>  
 8048473:	8b 45 08             	mov    0x8(%ebp),%eax  
 8048476:	83 e8 01             	sub    $0x1,%eax  
 8048479:	89 04 24             	mov    %eax,(%esp)  
 **804847c:	e8 dd ff ff ff       	call   804845e <factorial>**  
 8048481:	89 c2                	mov    %eax,%edx  
 8048483:	0f af 55 08          	imul   0x8(%ebp),%edx  
 8048487:	89 55 fc             	mov    %edx,-0x4(%ebp)  
 804848a:	8b 45 fc             	mov    -0x4(%ebp),%eax  
 804848d:	c9                   	leave   
 804848e:	c3                   	ret     

Notice that in position 804847c the factorial function calls itself again.

The growth of the stack is quite obvious in the debugger:

#0  factorial (val=3) at factorial.c:23  
#1  0x08048481 in factorial (val=4) at factorial.c:24  
#2  0x08048481 in factorial (val=5) at factorial.c:24  
#3  0x08048481 in factorial (val=6) at factorial.c:24  
#4  0x08048481 in factorial (val=7) at factorial.c:24  
#5  0x08048481 in factorial (val=8) at factorial.c:24  
#6  0x08048481 in factorial (val=9) at factorial.c:24  
#7  0x08048481 in factorial (val=10) at factorial.c:24  
#8  0x0804861e in main (argc=2, argv=0xfffe8364) at factorial.c:66 

The assembly code generated for the other two implementations is not substantially different.

0804848f <factorial2>:  
 804848f:	55                   	push   %ebp  
 8048490:	89 e5                	mov    %esp,%ebp  
 8048492:	83 ec 0c             	sub    $0xc,%esp  
 8048495:	83 7d 08 01          	cmpl   $0x1,0x8(%ebp)  
 8048499:	75 08                	jne        80484a3 <factorial2+0x14>  
 804849b:	8b 45 0c             	mov    0xc(%ebp),%eax  
 804849e:	89 45 fc             	mov    %eax,-0x4(%ebp)  
 80484a1:	eb 1c                	jmp        80484bf <factorial2+0x30>  
 80484a3:	8b 45 08             	mov    0x8(%ebp),%eax  
 80484a6:	0f af 45 0c          	imul   0xc(%ebp),%eax  
 80484aa:	8b 55 08             	mov    0x8(%ebp),%edx  
 80484ad:	83 ea 01             	sub    $0x1,%edx  
 80484b0:	89 44 24 04          	mov    %eax,0x4(%esp)  
 80484b4:	89 14 24             	mov    %edx,(%esp)  
 80484b7:	e8 d3 ff ff ff       	call       804848f <factorial2>  
 80484bc:	89 45 fc             	mov    %eax,-0x4(%ebp)  
 80484bf:	8b 45 fc             	mov    -0x4(%ebp),%eax  
 80484c2:	c9                   	leave   
 80484c3:	c3                   	ret     
 080484c4 <factorial3>:  
 80484c4:	55                   	push   %ebp  
 80484c5:	89 e5                	mov    %esp,%ebp  
 80484c7:	83 ec 28             	sub    $0x28,%esp  
 80484ca:	83 7d 0c 01          	cmpl   $0x1,0xc(%ebp)  
 80484ce:	75 08                	jne        80484d8 <factorial3+0x14>  
 80484d0:	8b 45 10             	mov    0x10(%ebp),%eax  
 80484d3:	89 45 ec             	mov    %eax,-0x14(%ebp)  
 80484d6:	eb 29                	jmp        8048501 <factorial3+0x3d>  
 80484d8:	8b 45 08             	mov    0x8(%ebp),%eax  
 80484db:	89 45 fc             	mov    %eax,-0x4(%ebp)  
 80484de:	8b 45 10             	mov    0x10(%ebp),%eax  
 80484e1:	0f af 45 0c          	imul   0xc(%ebp),%eax  
 80484e5:	8b 55 0c             	mov    0xc(%ebp),%edx  
 80484e8:	83 ea 01             	sub    $0x1,%edx  
 80484eb:	89 44 24 08          	mov    %eax,0x8(%esp)  
 80484ef:	89 54 24 04          	mov    %edx,0x4(%esp)  
 80484f3:	8b 45 08             	mov    0x8(%ebp),%eax  
 80484f6:	89 04 24             	mov    %eax,(%esp)  
 80484f9:	8b 45 fc             	mov    -0x4(%ebp),%eax  
 80484fc:	ff d0                	call   *%eax  
 80484fe:	89 45 ec             	mov    %eax,-0x14(%ebp)  
 8048501:	8b 45 ec             	mov    -0x14(%ebp),%eax  
 8048504:	c9                   	leave   
 8048505:	c3                   	ret    

In both cases the recursion is indicated by the assembly instruction call on the function itself.

Ok, now let's turn up the optimization level a bit :

gcc -m32 -g -O2   -c -o factorial.o factorial.c  
gcc -m32 factorial.o -o factorial 

Here's the assembly code generated by the compiler for all three versions of the factorial:

 08048410 <factorial>:  
 8048410:	55                   	push   %ebp  
 8048411:	b8 01 00 00 00       	mov    $0x1,%eax  
 8048416:	89 e5                	mov    %esp,%ebp  
 8048418:	8b 55 08             	mov    0x8(%ebp),%edx  
 804841b:	85 d2                	test   %edx,%edx  
 804841d:	74 09                	je         8048428 <factorial+0x18>  
 804841f:	90                   	nop     
 8048420:	0f af c2             	imul   %edx,%eax  
 8048423:	83 ea 01             	sub    $0x1,%edx  
 8048426:	75 f8                	jne        8048420 <factorial+0x10>  
 8048428:	5d                   	pop    %ebp  
 8048429:	c3                   	ret     
 804842a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi  
 08048430 <factorial2>:  
 8048430:	55                   	push   %ebp  
 8048431:	89 e5                	mov    %esp,%ebp  
 8048433:	8b 55 08             	mov    0x8(%ebp),%edx  
 8048436:	8b 45 0c             	mov    0xc(%ebp),%eax  
 8048439:	83 fa 01             	cmp    $0x1,%edx  
 804843c:	74 0d                	je         804844b <factorial2+0x1b>  
 804843e:	66 90                	xchg   %ax,%ax  
 8048440:	0f af c2             	imul   %edx,%eax  
 8048443:	83 ea 01             	sub    $0x1,%edx  
 8048446:	83 fa 01             	cmp    $0x1,%edx  
 8048449:	75 f5                	jne         8048440 <factorial2+0x10>  
 804844b:	5d                   	pop    %ebp  
 804844c:	c3                   	ret     
 804844d:	8d 76 00             	lea    0x0(%esi),%esi  
 08048450 <factorial3>:  
 8048450:	55                   	push   %ebp  
 8048451:	89 e5                	mov    %esp,%ebp  
 8048453:	8b 45 0c             	mov    0xc(%ebp),%eax  
 8048456:	8b 4d 08             	mov    0x8(%ebp),%ecx  
 8048459:	8b 55 10             	mov    0x10(%ebp),%edx  
 804845c:	83 f8 01             	cmp    $0x1,%eax  
 804845f:	74 0f                	je     8048470 <factorial3+0x20>  
 8048461:	0f af d0             	imul   %eax,%edx  
 8048464:	83 e8 01             	sub    $0x1,%eax  
 8048467:	89 45 0c             	mov    %eax,0xc(%ebp)  
 804846a:	89 55 10             	mov    %edx,0x10(%ebp)  
 804846d:	5d                   	pop    %ebp  
 804846e:	ff e1                	jmp    *%ecx  
 8048470:	5d                   	pop    %ebp  
 8048471:	89 d0                	mov    %edx,%eax  
 8048473:	c3                   	ret     
 8048474:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi  
 804847a:	8d bf 00 00 00 00    	lea    0x0(%edi),%edi  

At this stage, the recursion has been replaced by a loop in the first two factorial algorithms (factorial and factorial2 above).

In factorial3 the recursion has been replaced by a jmp instruction. This is obviously an improvement on the call instruction, since it skips saving the instruction pointer on the stack. Note though, that for all intents and purposes the instruction pointer is already on the stack, as the function pointer is an argument of the function itself.

It's instructive so see how all three implementations behave in the debugger. If you set a breakpoint at the entry point of the first two functions the breakpoint will respected the first time functions are entered. Since the recursion has been replaced by a loop, this will be the only time the breakpoint is hit.

That's pretty annoying, if you want to examine the state of the function body on subsequent iterations by breaking on the function invocation. The compiler has rewritten your function body and essentially eliminated that entry point. Sure, gdb allows you to debug at the assembly level, and with the assembly generated by objdump in hand you're certainly in a position to examine the function body nevertheless. But you have to work a level down (or is it up ?) on the abstraction scale. The optimization of the third factorial function does not eliminate the function invocation as a possible break point.

What's the upshot of all of this ?

A C/C++ compiler should be able to optimize a simple recursive algorithm, like the factorial, by replacing the recursive call with a loop. The compiler rewrites the function body, in effect changing the algorithm. That's easy to verify by inspecting the assembly code. For a simple enough function like the first two factorial functions this is probably not too much of an issue. When the function body is more complex, there is going to be a disconnect between the code generated by the compiler and the original. That will complicate reasoning about the correctness of it.

However, the extend of the rewrite can be controlled. The optimization of the CPS like algorithm kept the original function body substantially in place, but did remove the build up of the stack. The benefit here is that the non-optimized and optimized function are not substantially different, as can be verified by inspection in the debugger.