Why you shouldn't block on D-Bus calls

← Converting Debian packaging from bzr to git | Background noise | Encrypted root filesystem on a Debian laptop →

I've found myself writing this several times recently in the context of Telepathy, so I thought I'd make a blog post that I can refer people to.

D-Bus is defined (by the wire protocol spec) to be an asynchronous message-passing system, and libdbus behaves accordingly; there are no blocking calls at the wire protocol level. However, libdbus provides a "blocking" API (dbus_do_something_and_block), via a mechanism I'll refer to as "pseudo-blocking" for the purposes of this post. Most D-Bus bindings (at least dbus-glib, dbus-python and QtDBus) expose this pseudo-blocking behaviour to library users.

These pseudo-blocking calls work like this:

send a method call message (call it M)
while a reply to M has not been received:
- select() or poll() on the D-Bus socket, ignoring all other I/O
- whenever a whole message has been received:
  - check whether the message is a reply to M
  - if it is (call it R), stop
  - if it is not, put it on the incoming message queue

The messages received between M and R are delivered when the main loop is next entered.

This can cause a number of problems:

Messages are re-ordered: messages received between M and R aren't delivered until after the reply, violating the ordering guarantee that the D-Bus daemon usually provides.

(This causes practical problems if a signal indicating object destruction is delayed - the client gets a method reply "UnknownMethod", has to guess that this is because the object has vanished, and can't know why it vanished until the signal indicating its destruction arrives with more details.)
The client is completely unresponsive until the service replies - if the service has somehow got wedged (e.g. telepathy-gabble is meant to be purely non-blocking and asynchronous, but there are cases where it will do blocking I/O on SSL connections due to Loudmouth bugs), the client will be unresponsive for (by default) 25 seconds until the call times out.

(Clients shouldn't crash or lock up, whatever happens to the services they depend on.)
The client can't parallelize calls - if a signal (e.g. an incoming Text message in Telepathy) causes method calls to be made, a client that uses pseudo-blocking calls can't even start processing the next message until those method calls return
If two processes make pseudo-blocking calls on each other, deadlock occurs. This is particularly tricky in the presence of a plugin architecture and shared D-Bus connections - a plugin that "knows" it's a client and not a service, and a plugin in the same process that "knows" it's a service and not a client, can end up sharing a connection, resulting in a process that is both a service and a client (and hence deadlock-prone).

(We've seen this happen in the OLPC Sugar environment and on Nokia internet tablets; it's not just a theoretical concern.)

As a result, telepathy-glib's code generation mechanism does not generate any pseudo-blocking code. There are several alternative modes of operation that we do support:

Fully asynchronous: call a method now, pass it a callback, get your callback called from the main loop later. This can be awkward to program with, but is the only way forward for most non-trivial projects.
Re-entrant main loop: re-enter the main loop, processing all messages (D-Bus, GUI, network, anything), and run it until the reply has been received. This is useful for trivial projects (like telepathy-glib's regression tests), but dangerous for non-trivial projects (like Empathy), and I'm somewhat regretting implementing this.
Ignoring the reply: when appropriate (it rarely is), you can make an asynchronous call with callback = NULL and the reply will be ignored completely

We make a couple of narrowly targeted exceptions to the "no pseudo-blocking" policy in internal code, by allowing telepathy-glib internals to make a small number of pseudo-blocking method calls to the dbus-daemon (which is the one component we can definitely trust to return results promptly).

Here are some examples of the same strategies in other libraries:

Pseudo-blocking: dbus-python method calls with no special keyword arguments; dbus-glib dbus_g_proxy_call; QDBus calls with mode QDBus::Block
Fully asynchronous: dbus-python method calls with reply_callback and error_callback keyword arguments; dbus-glib dbus_g_proxy_begin_call; QDBusConnection::callWithCallback
Re-entrant: QDBus calls with mode QDBus::BlockWithGui
Ignoring reply: dbus-python method calls with ignore_reply keyword argument; QDBus calls with mode QDBus::NoWaitForReply