sys_clone: Beyond Processes and Threads

Originally posted on January 1st, 2009 on the now defunct Tuxology technoblog

Most Linux developers are aware of the two library calls for creating a new context of execution in Linux:

fork() and related call vfork(), which create a new process, complete with it’s own process id, private address space and a private copy set file descriptors, file system attributes (such as working directory) and signal handlers, all distinctly separate from those of the parent that called the fork() function.

And pthread_create(), which creates a new thread inside the existing process, which shares the same process id, address space, file descriptors, file system attributes and signal handlers with the caller.

Looking at these two options it would seem that were are faced with a “share all or share nothing” attitude – either we go the new process route and get a private copy of everything (actually Linux employs copy on write semantics to make more efficient use of memory), or we can go the new thread route share all resources and for the most, these options are sufficient for most needs.

Sometimes, however, a less black and white approach is called for – such as a case where we would like to create a new context of execution sharing the file descriptors and file system attributes of the creator, but not it’s address space (or maybe juts a portion of it) and process id, for example.

For these cases exactly Linux offers clone(2). Clone is a library function implemented inside the C library, Glibc, which layered on top of the underlying sys_clone system call.

Clone is similar to fork(2) and pthread_create(3) in that it creates a new execution context which are scheduled independently from the creator.

Unlike fork and pthread_create, however, clone provides a fine grained level of control of the properties of this new context:

  • What it will and what it will not share with its creator.
  • Which parent process will it belong to.
  • Which signal, if any, will be delivered to its parent when it terminates.
  • Where is the location of the new task call stack.

As an example, here a code snippet that creates a new task (for lack of a better word), which implements our previous example – a new process which has its own process ID, a private address space (with a copy on write semantic of the creator address space) and a separate set of signal handlers, but shares with its creating process the table of file descriptors, file system attributes:

 
#include <sched.h>
 
#define STACK_SIZE 4096
 
void * stack = mmap(NULL, STACK_SIZE , \
   PROT_READ | PROT_WRITE, MAP_PRIVATE | \
   MAP_ANON | MAP_GROWSDOWN, -1, 0);
 
/* check for alloc errors... */
 
clone(test, stack, CLONE_FS | CLONE_FILES \
    | SIGCHLD| CLONE_PARENT, NULL);
/* check for clone errors... */
 
munmap(stack, STACK_SIZE);

As you can see in the example above, we first allocate a stack for the new task using an anonymous private memory mapping and instruct the kernel to grow the mapping downwards as need be (this is actually not true on all architectures).

Them, we use the clone library call to create the new task, sharing the file descriptor table and file system attributes, but not the address space with the creator.

We also ask that the new task will inherit the same parent as the creator and that a SIGCHLD signal will be sent to the parent of the new task, as normally would be the case.

As you can see, the clone library call and the sys_clone are a powerful tool that can be used to create unique execution contexts beyond the standard thread or process variety. Many more options are detailed in the clone(2) man page.