15 Nov 2018

Preemptive multitasking is now more than 50 years old, initially introduced in PDP-6 Monitor and MULTICS in 1964. Instead of relying on processes to cooperatively release the CPU to the kernel, they registered timer interrupt handlers to move control back to kernel code.

Nevertheless, processes were still able to yield control of the CPU to the kernel during a time-slice by calling any system call.

System calls are often distinguished into blocking or non-blocking depending on the probability that the kernel will return the control to the calling process’ code before the next tick.

In real world, non-blocking syscalls might block the process and blocking syscalls might not block at all: it all depends on implementation details and on the state of the system at run-time.

Classical examples of blocking system calls are

sleep that should block for a certain number of seconds
wait that could block until a children process exit
read that would block until a chunk of data is available
write that would block on a pipe without enough space for the new data

CPU control and time

Sleep is a very interesting syscall because its whole point is just to yield the CPU for a while, so that other processes can use it. Its main use case is to poll for a certain resource to become available without subtracting computing power that could be used to produce such resource: it’s a tool for cooperative multitasking in a OS supporting preemptive multitasking.

Other blocking system calls (like wait, read and write) depend on external events to return control to the process. Soon programmers realized that an external event might never occur and designed ways to mitigate the risks.

Unix introduced signals and services like alarm and setitimer to give back control to the calling process on certain events or after a certain time.

Later, new syscalls like select, poll, kqueue or sigtimedwait were introduced with a timeout parameter from the very beginning.

Plan 9 from Bell Labs

Compared to the 400 system calls of Linux, Plan 9’s API is rather simpler but it still supports a bit of each time-control styles:

sleep suspends the calling process for a number of milliseconds specified by the argument.
alarm causes an alarm note to be sent to the invoking process after a number of milliseconds.
tsemacquire only waits for a number of milliseconds to acquire a semaphore.

By design, Plan 9 provides only a very limited support for non-blocking I/O (through alarm) and no support for I/O multiplexing:

if a program needs to wait for several resources, it usually calls rfork
if a program want to serve several concurrent clients, it usually expose a 9P2000 filesystem

Furthermore, with libthread, Plan 9 provides an implementation of Hoare’s Communicating Sequential Processes in which dedicated (preemptively scheduled) processes are used to issue any blocking system calls and several cooperatively-scheduled threads sharing memory in a single process are used to access global state.

The Curse of Frankenstein

The complex interactions between the different ways to handle timeouts in Unix (and, to a lesser extent, in Plan 9) are direct effects of its design philosophy.

As explained by Richard P. Gabriel in his famous essay, “Unix and C are the ultimate computer viruses”.

Worse is better, isn’t it?

Worse-is-Better is a model of software design and implementation which has the following characteristics (in approximately descending order of importance):

Simplicity
The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.
Correctness
The design should be correct in all observable aspects, but it is slightly better to be simple than correct.
Consistency
The design must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either complexity or inconsistency in the implementation.
Completeness
The design must cover as many important situations as is practical. All reasonably expected cases should be covered.

Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

In a way the so called “New Jersey style” was a rush for a minimum viable product able to minimize the time-to-market and to gain the first mover advantage.

Quoting Gabriel

Both early Unix and C compilers had simple structures, are easy to port, require few machine resources to run, and provide about 50%-80% of what you want from an operating system and programming language.

[…]

It is important to remember that the initial virus has to be basically good. If so, the viral spread is assured as long as it is portable. Once the virus has spread, there will be pressure to improve it, possibly by increasing its functionality closer to 90%, but users have already been conditioned to accept worse than the right thing.

So far so good.

Early Unix, like Plan 9, didn’t aim at perfection, but at serving the majority of users’ need well enough.

But is “enough”, enough?

The Unix philosophy spread beyond Unix, and it became the common wisdom of software engineering so far.

For early Unix and Plan 9, enough is enough. Look at 9front to see a modern system that follows this philosophy consistently.

There is a catch, however.
Plan 9 (like early Unix), is a system evolving as a whole. This is particularly visible in the 9front repository. The whole system evolves together, all programs are modified consistently, answering to the ever changing needs of its users.

Plan 9 is one application of computers. An operating system split into several executables, but still one thing.

So in its asymptotic path toward completeness, all pieces progress together. When they provides 90% of required functionality, it really means 90%. Or 95%. Or 97%. Or 99%.

But what if we split the system into independent components and assign them to different teams following the same philosophy?

This is what happened to mainstream operating systems.

With their widespread adoption, the variety of people needs couldn’t be served anymore by a single company. Thus good APIs and development tools provided a strategic advantage to build ecosystems of applications that, by running on a certain OS, would have expanded its market share.

Unix was particularly well suited for this kind of race for application developers.

However this process had a significant but overlooked drawback: the running operating system lose its unity, its coherence. The operating system became a collection of variegate software, developed by a variety of different teams, each with their own goals and taste and set of best practices.

More pieces, more cracks!

Take one application that works correctly 99% of times, and you will be fine 99% of times.

But what if you have 10 applications each working correctly 99% of times?

The probability that they work correctly together at each computer clock is (99%)^10. That is: their composition will do what you expect roughly 90% of times.

With such assumption, the probability of a system behaving correctly goes down pretty quickly approaching the formula

p(n) = (0.99)ⁿ

Probability of whole system correctness.

How many programs are you running right now? :-D

This is the curse of Frankenstein: more pieces, more cracks.

Simplex Sigillum Veri

Plan 9 from Bell Labs followed the New Jersey style from the very beginning and still does so in most if it’s incarnations.

But what about Jehanne, from Meucci’s laboratory?

Turns out the design of Jehanne doesn’t follow the New Jersey style. I have no rush. Yet Jehanne doesn’t follow the MIT/Stanford style either.

To follow Gabriel’s scheme, we could say that Jehanne is based upon

Simplicity
The design must be simple. Few simple, easy to learn and orthogonal abstractions should be able to describe any use case conceivable.
If the implementation is difficult, the design is not simple enough.
Correctness
The design should be correct in all observable aspects.
Incorrectness is not allowed.
Consistency
The design must not be inconsistent. Any inconsistency reveals a design problem: either a missing abstraction or abstractions that are not orthogonal enough.
Composability
Completeness should naturally emerge as a (desirable) side effect.
As such, it cannot be a goal: the design must cover as few important situations as practical and let the user compose the abstractions provided to build what he needs.

I call this set of principles simplex sigillum veri.

It’s deeply inspired by the works of other Italian hackers, such as Antonio Meucci, Pier Giorgio Perotto, Guglielmo Marconi, Renzo Piano and Leonardo da Vinci.

The Awake system call

Plan 9 supports 39 system calls (not counting obsolete ones). Since some system calls can be expressed in term of the others, in Jehanne I moved the duplicates in userspace.

The idea is that a smaller kernel surface makes it easier to get it right and simple in the long run.

During such clean up, I realized that sleep, alarm and tsemacquire were somehow “ugly”: I supposed that they were mixing orthogonal issues and looked for the missing abstraction they were hiding.

This is how the awake syscall was born:

long awake(long milliseconds);

Awake is the complement of sleep: it books a new time slice in the future. It’s a fundamental building block that can be used to implement other services in user space, like sleep, alarm and tsemacquire.

It interrupts a blocking syscall after a certain number of milliseconds. On failure it returns 0. On success it returns a negative long that can be used as an identifier to release the booked time slice.

On wakeup, no note or signal is sent to the process, the process’ error string left unchanged: the blocking syscall simply returns to the process with a ~0ULL result.

A process can register how many wakeups it want (within a global system cap) and each wake up will have a chance to interrupt a system call.

Wakeups are booked in two distinct group that do not interact: normal process execution and note handlers. Such groups are automatically reset respectively on exits and on noted.

In libc, two very simple functions wrap awake to enhance readability.

Awakened tells the calling process whether a certain wakeup already occurred. Forgivewkp tells the kernel to remove a certain wakeup.

With these primitives it’s trivial to move sleep and tsemaquire to libc.

In particular, tsemaquire shows pretty well the simple idiom of awake:

	/* book a time slice in the future */
	long wkup = awake(ms);
	while(blocking_syscall() == -1){
		if(jehanne_awakened(wkup)){
			/* handle syscall timed out */
		}
		/* the syscall has been otherwise interrupted, you can
		 * - try again
		 * - fail with an error
		 * - do whatever fit
		 */
	}
	/* the syscall completed, release the booked time slice... */
	jehanne_forgivewkp(wkup);

	/* ...and enjoy */

(the mindful reader will notice that alarm is still waiting to be moved to user space… the fact is that it’s too boring of a task!)

To be fair, awake was originally designed to interrupt rendezvous only, to enable timeouts support for QLock, RWLock and Rendez in libc.

But from the very beginning it was clear that it could have been generalized and composed with other system calls, to provide other services.

With the advent of libposix, I used awake to implement support for POSIX non-blocking I/O, signals and sigset waiters.

Which in turn enabled the port of newlib and of MirBSD Korn Shell to Jehanne.

All this with more or less 600 lines of code.

Two kernel processes, a timer and a ringer, cooperate through a linked list of wakeups kept in order of expiration that is filled by the system calls. On each tick, if the first element of the registry is expired, the interrupt handler awake the timer, that prepare a collection of expired timers.

Then the ringer start a loop to interrupt the process properly.

These two different processes make it possible to keep processing new wakeups while the ringer is waiting for the process to be in an interruptible state.

Not all blocking system calls can be interrupted though. Create is a notable example of a blocking system call that has been excluded from the interruptible ones, to prevent a timeout to leave an orphan file behind.

Issues and future uses

Awake will be used to implement all timeouts of Jehanne’s system calls.

Even without libposix, it provides a simple mechanism to implement non-blocking I/O, if needed.

And it will be used to implement timeouts in multiplexing I/O too.

Unfortunately, as an interface to the kernel scheduler, awake doesn’t work with user space schedulers that keep control of their own tasks. Right now, this just means that qlockt, rlockt, wlockt and rsleept cannot work when linking libthread. In the long run, it might make slightly more difficult to port virtual machines and programming languages that provide their own scheduler in user space, like Go or Java.

The solution however is already outlined by libthread usage of a custom rendezvous: we will simply override the system call to give control to the desired scheduler.