Chapter 20: Multi Threading

The 98 C++ standard did not acknowledge the existence of multi-threading. Between then and the release of the current C++ standard computers have evolved to multi-core machines, and using multi-threading by now is a real option to consider when developing software.

Multi-threading is an extensive and complex subject, and many good reference texts on the subject exist. The C++ multi-threading is built upon the facilities offered by the pthreads library (cf. Nichols, B, et al.'s Pthreads Programming, O'Reilly ). However, in line with C++'s current-day philosophy the multi-threading implementation offered by the language offers a high level interface to multi-threading, and using the raw pthread building blocks is hardly ever necessary (cf. Williams, A. (2012): C++ Concurrency in action).

This chapter covers the facilities for multi-threading as supported by C++. Although the coverage aims at providing the tools and examples allowing you to create your own multi-threaded programs, coverage necessarily is far from complete. The topic of multi threading is too extensive for that. The mentioned reference texts provide a good starting point for any further study of multi threading.

A thread of execution (commonly abbreviated to a thread) is a single flow of control within a program. It differs from a separately executed program, as created by the fork(1) system call in the sense that threads all run inside one program, while fork(1) creates independent copies of a running program. Multi-threading means that multiple tasks are being executed in parallel inside one program, and no assumptions can be made as to which thread is running first or last, or at what moment in time. Especially when the number of threads does not exceed the number of cores, each thread may be active at the same time. If the number of threads exceed the number of cores, the operating system will resort to task switching, offering each thread time slices in which it can perform its tasks. Task switching takes time, and the law of diminishing returns applies here as well: if the number of threads greatly exceeds the number of available cores (also called overpopulation), then the overhead incurred may exceed the benefit of being able to run multiple tasks in parallel.

Since all threads are running inside one single program, all threads share the program's data and code. When the same data are accessed by multiple threads, and at least one of the threads is modifying these data, access must be synchronized to avoid that threads read data while these data are being modified by other threads, and to avoid that multiple threads modify the same data at the same time.

So how do we run a multi-threaded program in C++? Let's look at hello world, the multi-threaded way:

     1: #include <iostream>
     2: #include <thread>
     4: void hello()
     5: {
     6:     std::cout << "hello world!\n";
     7: }
     9: int main()
    10: {
    11:     std::thread hi(hello);
    12:     hi.join();
    13: }

When compiling multi-threaded programs using the GNU g++ compiler the -pthread option must be specified.

To create a multi-threaded program defined in a source file the g++ compiler can be called like this:

    g++ --std=c++14 -pthread -Wall

20.1: Multi Threading

In C++ multi threading may be implemented at various levels of abstraction. In general the highest level of abstraction which is available to implement a mult-threaded problem should be used. Not so much because it's often simpler than using lower levels of abstraction, but because higher levels of abstraction are usually semantically closer to the original problem description, resulting in code which is easier to understand and therefore easier to maintain. Also, high-abstraction classes also provide exception safety and prevent the occurrence of memory leaks.

C++'s main tool for creating multi-threaded programs is the class std::thread, and some examples of its use have already been shown at the beginning of this chapter.

Characteristics of individual threads can be queried from the std::this_thread namespace. Also, std::this_thread offers some control over the behavior of an individual thread.

To synchronize access to shared data C++ offers mutexes (implemented by the class std::mutex) and condition variables (implemented by the class std::condition_variable).

Members of these classes may throw system_error objects (cf. section 10.9) when encountering a low-level error condition.

20.1.1: The namespace std::this_thread

The namespace std::this_thread contains functions that are uniquely associated with the currently running thread.

Before using the namespace this_thread the <thread> header file must be included.

Inside the std::this_thread namespace several free functions are defined, providing information about the current thread or that can be used to control its behavior:

20.1.2: The class std::thread

Multi threading in C++ starts off with objects of the class std::thread. Each object of this class handles a separate thread.

Before using Thread objects the <thread> header file must be included.

Thread objects can be constructed in various ways:

The class std::thread does not provide a copy constructor.

The following members are available:

Things to note:

A thread ends when the function executing a thread finishes. When a thread object is destroyed while its thread function is still running, terminate is called, aborting the program's end. Bad news: the destructors of existing objects aren't called and exceptions that are thrown are left uncaught. This happens in the following program as the thread is still active when main ends:

    #include <iostream>
    #include <thread>

    void hello()
        while (true)
            std::cout << "hello world!\n";

    int main()
        std::thread hi(hello);
There are several ways to solve this problem. One of them is discussed in the next section. Static data and threads: thread_local

With multi-threaded programs the well-known distinction between global and local data is somewhat too coarse. For single- and multi-threaded programs alike, global data are available to all of the program's code, and local data are available to the function (or compound statement) in which the local data are defined. But multi-threaded programs may feel the need for an intermediate type of data, uniquely available to the different threads.

The thread_local keyword provides this intermediate data level. Global variables declared as thread_local are global within each individual thread. Each thread owns a copy of the thread_local variables, and may modify them at will. A thread_local variable in one thread is completely separated from that variable in another thread. Here is an example:

     1: #include <iostream>
     2: #include <thread>
     4: using namespace std;
     6: thread_local int t_value = 100;
     8: void modify(char const *label, int newValue)
     9: {
    10:     cout << label << " before: " << t_value << ". Address: " <<
    11:                                                     &t_value << endl;
    12:     t_value = newValue;
    13:     cout << label << " after: " << t_value << endl;
    14: }
    16: int main()
    17: {
    18:     thread(modify, "first", 50).join();
    19:     thread(modify, "second", 20).join();
    20:     modify("main", 0);
    21: }
Running this program shows that each separate thread starts with t_value being 100, and then modifies it without affecting the values of t_value used by other threads.

Note that, although the t_value variables are unique to each thread, identical addresses may be shown for them. Since each thread uses its own stack, these variables may occupy the same relative locations within their respective stacks, giving the illusion that their physical addresses are identical. Exceptions and join()

Once a thread starts and it isn't detached it must eventually join its starting (parent) thread, or the program aborts. Usually, once a thread has started the parent thread continues to do some work by itself:
    void childActions();
    void doSomeWork();

    void parent()
        thread child(childActions);

However, maybe doSomeWork can't complete its work, and throws an exception, to be caught outside of parent. This, unfortunately, ends parent, and child.join() is missed. Consequently, the program aborts because of a thread that hasn't been joined.

Clearly, all exceptions must be caught, join must be called, and the exception must be rethrown. But parent cannot use a function try-block, as the thread object is already out of scope once execution reaches the matching catch-clause. So we get:

    void childActions();
    void doSomeWork();

    void parent()
        thread child(childActions);
        catch (...)

This is ugly: suddenly the function's code is clobbered with a try-catch clause, as well as some unwelcome code-duplication.

This situation can be avoided using object based programming. Like, e.g., unique pointers, which use their destructors to encapsulate the destruction of dynamically allocated memory, we can use a comparable technique to encapsulate thread joining in an object's destructor.

By defining the thread object inside a class we're sure that by the time the our object goes out of scope, even if the childActions function throws an exception, the thread's join member is called. Here are the bare essentials of our JoinGuard class, providing the join-guarantee (using in-line member implementations for brevity):

     1: #include <thread>
     3: class JoinGuard
     4: {
     5:     std::thread d_thread;
     7:     public:
     8:         JoinGuard(std::thread &&threadObj)
     9:         :
    10:             d_thread(std::move(threadObj))
    11:         {}
    12:         ~JoinGuard()
    13:         {
    14:             if (d_thread.joinable())
    15:                 d_thread.join();
    16:         }
    17: };
Here is an example how JoinGuard could be used:
     1: #include <iostream>
     2: #include "joinguard.h"
     4: void childActions();
     6: void doSomeWork()
     7: {
     8:     throw std::runtime_error("doSomeWork throws");
     9: }
    11: void parent()
    12: {
    13:     JoinGuard{std::thread{childActions}};
    14:     doSomeWork();
    15: }
    17: int main()
    18: try
    19: {
    20:     parent();
    21: }
    22: catch (std::exception const &exc)
    23: {
    24:     std::cout << exc.what() << '\n';
    25: }

20.2: Synchronization (mutexes)

Objects of mutex classes are used to protect shared data.

Before using mutexes the <mutex> header file must be included.

One of the key characteristics of multi-threaded programs is that threads may share data. Functions running as separate threads have access to all global data, and may also share the local data of their parent threads. However, unless proper measures are taken, this may easily result in data corruption, as illustrated by the following simulation of some steps that could be encountered in a multi-threaded program:

Time step:    Thread 1:     var        Thread 2:       description
    0                        5
    1           starts                                  T1 active
    2           writes var                              T1 commences writing
    3           stopped                                 Context switch
    4                                   starts          T2 active
    5                                   writes var      T2 commences writing
    6                       10          assigns 10      T2 writes 10
    7                                   stopped         Context switch
    8           assigns 12                              T1 writes 12
    9                       12

In this example, threads 1 and 2 share variable var, initially having the value 5. At step 1 thread 1 starts, and starts to write a value into var. However, it is interrupted by a context switch, and thread 2 is started (step 4). Thread 2 also wants to write a value into var, and succeeds until time step 7, when another context switch takes place. By now var is 10. However, thread 1 was also in the process of writing a value into var, and it is given a chance to complete its work: it assigns 12 to var in time step 8. Once time step 9 is reached, thread 2 proceeds on the (erroneous) assumption that var must be equal to 10. Clearly, from the point of view of thread 2 its data have been corrupted.

In this case data corruption was caused by multiple threads accessing the same data in an uncontrolled way. To prevent this from happening, access to shared data should be protected in such a way that only one thread at a time may access the shared data.

Mutexes are used to prevent the abovementioned kinds of problems by offering a guarantee that data are only accessed by the thread that could lock the mutex that is used to synchronize access to those data.

Exclusive data access completely depends on cooperation between the threads. If thread 1 uses mutexes, but thread 2 doesn't, then thread 2 may freely access the common data. Of course that's bad practice, which should be avoided.

It is stressed that although using mutexes is the programmer's responsibility, their implementation isn't: mutexes offer the necessary atomic calls. When requesting a mutex-lock the thread is blocked (i.e., the mutex statement does not return) until the lock has been obtained by the requesting thread.

Apart from the class std::mutex the class std::recursive_mutex is available. When a recursive_mutex is called multiple times by the same thread it increases its lock-count. Before other threads may access the protected data the recursive mutex must be unlocked again that number of times. Moreover, the classes std::timed_mutex and std::recursive_timed_mutex are available. Their locks expire when released, but also after a certain amount of time.

The members of the mutex classes perform atomic actions: no context switch occurs while they are active. So when two threads are trying to lock a mutex only one can succeed. In the above example: if both threads would use a mutex to control access to var thread 2 would not have been able to assign 12 to var, with thread 1 assuming that its value was 10. We could even have two threads running purely parallel (e.g., on two separate cores). E.g.:

Time step:    Thread 1:        Thread 2:        escription
    1         starts           starts           T1 and T2 active
    2         locks            locks            Both threads try to 
                                                lock the mutex
    3         blocks...        obtains lock     T2 obtains the lock,
                                                and T1 must wait
    4         (blocked)        processes var    T2 processes var,
                                                T1 still blocked
    5         obtains lock     releases lock    T2 releases the lock,
                                                and T1 immediately 
                                                obtains the lock
    6         processes var                     now T1 processes var
    7         releases lock                     T1 also releases the lock

Although mutexes can directly be used in programs, this rarely happens. It is more common to embed mutex handling in locking classes that make sure that the mutex is automatically unlocked again when the mutex lock is no longer needed. Therefore, this section merely offers an overview of the interfaces of the mutex classes. Examples of their use will be given in the upcoming sections (e.g., section 20.3).

All mutex classes offer the following constructors and members:

The timed-mutex classes (timed_mutex, recursive_timed_mutex) also offer these members:

20.2.1: Initialization in multi-threaded programs

Before using the std::once_flag and the std::call_once function, introduced in this section, the <mutex> header file must be included.

In single threaded programs the initialization of global data does not necessarily occur at one point in code. An example is the initialization of the object of a singleton class (cf. Gamma et al. (1995), Design Patterns, Addison-Wesley). Singleton classes may define a single static pointer data member Singleton *s_object, pointing to the singleton's object, and may offer a static member instance, implemented something like this:

    Singleton &Singleton::instance()
        return s_object ? 
                    (s_object = new Singleton);

With multi-threaded programs this approach immediately gets complex. For example, if two threads call instance at the same time, while s_object still equals 0, then both may call new Singleton, resulting in one dynamically allocated Singleton object becoming unreachable. Other threads, called after s_object was initialized for the first time, may either return a reference to that object, or may return a reference to the object initialized by the second thread. Not exactly the expected behavior of a singleton.

Mutexes (cf. section 20.2) can be used to solve these kinds of problems, but they result in some overhead and inefficiency, as the mutex must be inspected at each call of Singleton::instance.

When variables must dynamically be initialized, and the initialization should take place only once the std::once_flag type and the std::call_once function should be used.

The call_once function expects two or three arguments:

A thread-safe implementation of the singleton's instance function can now easily be designed (using in-class implementations for brevity):
    class Singleton
        static std::once_flag s_once;
        static Singleton *s_singleton;
            static Singleton *instance()
                std::call_once(s_once, []{s_singleton = new Singleton;} );
                return s_singleton;

However, there are additional ways to initialize data, even for multi-threaded programs:

20.2.2: Shared mutexes

Shared mutexes (via the type std::shared_mutex) are available after including the <shared_mutex> header file. Shared mutex types behave like timed_mutex types and optionally have the characteristics described below.

The class shared_mutex provides a non-recursive mutex with shared ownership semantics, comparable to, e.g., the shared_ptr type. A program using shared_mutexes is undefined if:

Shared mutex types provide a shared lock ownership mode. Multiple threads can simultaneously hold a shared lock ownership of a shared_mutex type of object. But no thread can hold a shared lock while another thread holds an exclusive lock on the same shared_mutex object, and vice-versa.

The type shared_mutex offers the following members:

20.3: Locks and lock handling

Locks are used to simplify the use of mutexes. Before locks can be used the <mutex> header file must be included.

Whenever threads share data, and at least one of the threads may change common data, mutexes should be used to prevent threads from using the same data synchronously.

Usually locks are released at the end of action blocks. This requires explicit calls to the mutexes' unlock function, which introduces comparable problems as we've seen with the thread's join member.

To simplify locking and unlocking two mutex wrapper classes are available:

The class lock_guard offers a limited, but useful interface:

Here is a simple example of a multi-threaded program using lock_guards to prevent information inserted into cout from getting mixed.

    bool oneLine(istream &in, mutex &mut, int nr)
       lock_guard<mutex> lg(mut);
        string line;
        if (not getline(in, line))
            return false;
        cout << nr << ": " << line << endl;
        return true;
    void io(istream &in, mutex &mut, int nr)
        while (oneLine(in, mut, nr))
    int main(int argc, char **argv)
        ifstream in(argv[1]);
        mutex ioMutex;
        thread t1(io, ref(in), ref(ioMutex), 1);
        thread t2(io, ref(in), ref(ioMutex), 2);
        thread t3(io, ref(in), ref(ioMutex), 3);

As with lock_guard, a mutex-type must be specified when defining objects of the class std::unique_lock. The class unique_lock is much more elaborate than the basic lock_guard class template. Its interface does not define a copy constructor or overloaded assignment operator, but it does define a move constructor and a move assignment operator. In the following overview of unique_lock's inteface Mutex refers to the mutex-type that is specified when defining a unique_lock:

In addition to the members of the classes std::lock_guard and std::unique_lock the functions std::lock and std::try_lock are available. These functions can be used to prevent deadlocks, the topic of the next section.

20.3.1: Deadlocks

A deadlock occurs when two locks are required to process data, but one thread obtains the first lock and another thread obtains the second lock. C++ defines the generic std::lock and std::try_lock functions that can be used to help preventing such situations.

Before these functions can be used the <mutex> header file must be included

In the following overview L1 &l1, ... represents one or more references to objects of lockable types:

As an example consider the following little multi-threaded program: The threads use mutexes to obtain unique access to cout and to an int value. However, fun1 first locks cout (line 7), and then value (line 10); fun2 first locks value (line 16) and then cout (line 19). Clearly, if fun1 has locked cout fun2 can't obtain the lock until fun1 has released it. Unfortunately, fun2 has locked value, and the functions only release their locks when returning. But in order to access the information in value fun1 it must have obtained a lock on value, which it can't, as fun2 has already locked value: the threads are waiting for each other, and neither thread gives in.

     2: int value;
     3: mutex valueMutex;
     4: mutex coutMutex;
     6: void fun1()
     7: {
     8:     lock_guard<mutex> lg1(coutMutex);
     9:     cout << "fun 1 locks cout\n";
    11:     lock_guard<mutex> lg2(valueMutex);
    12:     cout << "fun 1 locks value\n";
    13: }
    15: void fun2()
    16: {
    17:     lock_guard<mutex> lg1(valueMutex);
    18:     cerr << "fun 2 locks value\n";
    20:     lock_guard<mutex> lg2(coutMutex);
    21:     cout << "fun 2 locks cout\n";
    22: }
    24: int main()
    25: {
    26:     thread t1(fun1);
    27:     fun2();
    28:     t1.join();
    29: }

A good recipe for avoiding deadlocks is to prevent nested (or multiple) mutex lock calls. But if multiple mutexes must be used, always obtain the locks in the same order. Rather than doing this yourself, std::lock and std::try_lock should be used whenever possible to obtain multiple mutex locks. These functions accept multiple arguments, which must be lockable types like lock_guard, unique_lock, or even a plain mutex. The previous deadlocking program, can be modified to call std::lock to lock both mutexes. In this example using one single mutex would also work, but the modified program now looks as similar as possible to the previous program. Note how in lines 10 and 21 a different ordering of the unique_locks arguments was used: it is not necessary to use an identical argument order when calling std::lock or std::try_lock.

     2: int value;
     3: mutex valueMutex;
     4: mutex coutMutex;
     6: void fun1()
     7: {
     8:     unique_lock<mutex> lg1(coutMutex, defer_lock);
     9:     unique_lock<mutex> lg2(valueMutex, defer_lock);
    11:     lock(lg1, lg2);
    13:     cout << "fun 1 locks cout\n";
    14:     cout << "fun 1 locks value\n";
    15: }
    17: void fun2()
    18: {
    19:     unique_lock<mutex> lg1(coutMutex, defer_lock);
    20:     unique_lock<mutex> lg2(valueMutex, defer_lock);
    22:     lock(lg2, lg1);
    24:     cout << "fun 2 locks cout\n";
    25:     cout << "fun 2 locks value\n";
    26: }
    28: int main()
    29: {
    30:     thread t1(fun1);
    31:     thread t2(fun2);
    32:     t1.join();
    33:     t2.join();
    34: }

20.3.2: Shared locks

Shared locks are available through the type std::shared_lock, after including the <shared_mutex> header file.

An object of the type std::shared_lock controls the shared ownership of a lockable object within a scope. Shared ownership of the lockable object may be acquired at construction time or thereafter, and once acquired, it may be transferred to another shared_lock object. Objects of type shared_lock cannot be copied, but move construction and assignment is supported.

The behavior of a program is undefined if the contained pointer to a mutex (pm) has a non-zero value and the lockable object pointed to by pm does not exist for the entire remaining lifetime of the shared_lock object. The supplied mutex type must be a shared_mutex or a type having the same characteristics.

The type shared_lock offers the following constructors, destructor and operators:

The following members are provided:

20.4: Event handling (condition variables)

This section introduces condition variables. Condition variables allow programs to synchronize threads using the states of data, rather than simply locking the access to data (which is realized using mutexes).

Before condition variables can be used the <condition_variable> header file must be included.

To start our discussion, consider a classic producer-consumer scenario: the producer generates items which are consumed by a consumer. The producer can only produce a certain number of items before its storage capacity has filled up and the client cannot consume more items than the producer has produced.

At some point the producer's storage capacity has filled to the brim, and the producer has to wait until the client has at least consumed some items, thereby creating space in the producer's storage. Similarly, the consumer cannot start consuming until the producer has at least produced some items.

Implementing this scenario only using mutexes (data locking) is not an attractive option, as merely using mutexes forces a program to implement the scenario using polling: processes must continuously (re)acquire the mutex's lock, determine whether they can perform some action, followed by the release of the lock. Often there's no action to perform, and the process is busy acquiring and releasing the mutex's lock. Polling forces threads to wait until they can lock the mutex, even though continuation might already be possible. The polling interval could be reduced, but that too isn't an attractive option, as that increases the overhead associated with handling the mutexes (also called `busy waiting').

Condition variables can be used to prevent polling. Threads can use condition variables to notify waiting threads that there is something for them to do. This way threads can synchronize on data values (states).

As data values may be modified by multiple threads, threads still need to use mutexes, but only for controlling access to the data. In addition, condition variables allow threads to release ownership of mutexes until a certain value has been obtained, until a preset amount of time has been passed, or until a preset point in time has been reached.

The prototypical setup of threads using condition variables looks like this:

No matter which thread starts, the thread holding the mutex's lock will at some point release the lock, allowing the other process to (re)acquire it. If the consumer starts it immediately releases the lock once it enters its waiting state; if the producer starts it releases the lock once the condition is true.

This protocol hides a subtle initial synchronization requirement. The consumer will miss the producer's notification if it (i.e., the consumer) hasn't yet entered its waiting state. So waiting (consumer) threads should start before notifying (producer) threads. Once threads have started, no assumptions can be made anymore about the order in which any of the condition variable's members (notify_one, notify_all, wait, wait_for, and wait_until) are called.

Condition variables come in two flavors: objects of the class std::condition_variable are used in combination with objects of type unique_lock<mutex>. Because of optimizations which are available for this specific combination using condition_variables is somewhat more efficient than using the more generally applicable class std::condition_variable_any, which may be used with any (e.g., user supplied) lock type.

Condition variable classes (covered in detail in the next two sections) offer members like wait, wait_for, wait_until, notify_one and notify_all that may concurrently be called. The notifying members are always atomically executed. Execution of the wait members consists of three atomic parts:

So, returning from wait-members the previously waiting thread has reacquired the mutex's lock.

In addition to the condition variable classes the following free function and enum type is provided:

20.4.1: The class std::condition_variable

The class std::condition_variable merely offers a default constructor. No copy constructor or overloaded assignment operator is provided.

Before using the class condition_variable the <condition_variable> header file must be included.

The class's destructor requires that no thread is blocked by the thread destroying the condition_variable. So all threads waiting on a condition_variable must be notified before a condition_variable object's lifetime ends. Calling notify_all (see below) before a condition_variable's lifetime ends takes care of that, as the condition_variable's thread releases its lock of the mutex variable, allowing one of the notified threads to lock the mutex.

In the following member-descriptions a type Predicate indicates that a provided Predicate argument can be called as a function without arguments, returning a bool. Also, other member functions are frequently referred to. It is tacitly assumed that all member referred to below were called using the same condition variable object.

The class condition_variable supports several wait members, which block the thread until notified by another thread (or after a configurable waiting time). However, wait members may also spuriously unblock, without having reacquired the lock. Therefore, returning from wait members threads should verify that the required condition is actually true. If not, again calling wait may be appropriate. The next piece of pseudo code illustrates this scheme:

    while (conditionNotTrue())

The class condition_variable's members are:

Threads should verify that the required condition is true when wait-members of condition variables return.

20.4.2: The class std::condition_variable_any

Different from the class condition_variable the class std::condition_variable_any can be used with any (e.g., user supplied) lock type, and not just with the stl-provided unique_lock<mutex>.

Before using the class condition_variable_any the <condition_variable> header file must be included.

The functionality that is offered by condition_variable_any is identical to the functionality offered by the class condition_variable, albeit that the lock-type that is used by condition_variable_any is not predefined. The class condition_variable_any therefore requires the specification of the lock-type that must be used by its objects.

In the interface shown below this lock-type is referred to as Lock. Most of condition_variable_any's members are defined as member templates, defining a Lock type as one of its parameters. The requirements of these lock-types are identical to those of the stl-provided unique_lock, and user-defined lock-type implementations should provide at least the interface and semantics that is also provided by unique_lock.

This section merely presents the interface of the class condition_variable_any. As its interface offers the same members as condition_variable (allowing, where applicable, passing any lock-type instead of just unique_lock to corresponding members), the reader is referred to the previous section for a description of the semantics of the class members.

Like condition_variable, the class condition_variable_any only offers a default constructor. No copy constructor or overloaded assignment operator is provided.

Also, like condition_variable, the class's destructor requires that no thread is blocked by the current thread. This implies that all other (waiting) threads must have been notified; those threads may, however, subsequently block on the lock specified in their wait calls.

Note that, in addition to Lock, the types Clock, Duration, Period, Predicate, and Rep are template types, defined just like the identically named types mentioned in the previous section.

Assuming that MyMutex is a user defined mutex type, and that MyLock is a user defined lock-type (cf. section 20.3 for details about lock-types), then a condition_variable_any object can be defined and used like this:

    MyMutex mut;
    MyLock<MyMutex> ul(mut);
    condition_variable_any cva;


These are the class condition_variable_any's members:

20.4.3: An example using condition variables

Condition variables are used to synchronize threads on the values of data, rather than on the mere access to data (for which plain mutex-objects can be used). Using condition variables, a thread simply sleeps until it is notified by another thread. In a producer-consumer type of program this is usually accomplished like this:
    consumer loop:
        - wait until there's an item in store,
            then reduce the number of stored items
        - remove the item from the store
        - increment the number of available storage locations
        - do something with the retrieved item

    producer loop:
        - produce the next item
        - wait until there's room to store the item,
            then reduce the number of available storage locations
        - store the item
        - increment the number of stored items

It is important that the two storage administrative tasks (registering the number of available items and available storage locations) are either performed by the client or by the producer. For the consumer `waiting' means:

This scheme is implemented in a class Semaphore, offering members wait and notify_all. For a more extensive discussion of semaphores see Tanenbaum, A.S. and Austin, T. (2013) Structured Computer Organization, Pearson Prentice-Hall.

The data member containing the actual count is called d_available. It is protected by mutex d_mutex. In addition a condition_variable d_condition is defined:

    mutable std::mutex d_mutex;
    std::condition_variable d_condition;
    size_t d_available;

The waiting process is implemented through its member function wait:

     1: void Semaphore::wait()
     2: {
     3:     std::unique_lock<std::mutex> lk(d_mutex);   // get the lock
     4:     while (d_available == 0)
     5:         d_condition.wait(lk);   // internally releases the lock
     6:                                 // and waits, on exit
     7:                                 // acquires the lock again
     8:     --d_available;              // dec. available
     9: }   // the lock is released
In line 5 d_condition.wait releases the lock. It waits until receiving a notification, and re-acquires the lock just before returning. Consequently, wait's code always has complete and unique control over d_available.

What about notifying the a waiting thread? This is handled in lines 4 and 5 of the member function notify_all:

     1: void Semaphore::notify_all()
     2: {
     3:     std::lock_guard<std::mutex> lk(d_mutex);    // get the lock
     4:     if (d_available++ == 0)
     5:         d_condition.notify_all();   // use notify_one to notify one other
     6:                                     // thread
     7: }   // the lock is released
At line 4 d_available is always incremented; by using a postfix increment it can simultaneously be tested for being zero. If it was initially zero then d_available is now one. A thread waiting until d_available exceeds zero may now continue. A waiting thread is notified by calling d_condition.notify_one. In situations where multiple threads are waiting `notify_all' can also be used.

Using the facilities of the class Semaphore whose constructor expects an initial value of its semaphore data member, the classic consumer-producer paradigm can now be implemented using multi-threading (A more elaborate example of the producer-consumer program is found in the yo/threading/examples/ file in the C++ Annotations's source archive):

    Semaphore available(10);
    Semaphore filled(0);
    std::queue itemQueue;

    void consumer()
        while (true)
                // mutex lock the queue with multiple consumers
            size_t item = itemQueue.front();
            process(item);      // not implemented here

    void producer()
        size_t item = 0;
        while (true)
                // mutex lock the queue with multiple consumers
    int main()
        thread consume(consumer);
        thread produce(producer);


20.5: Atomic actions: mutexes not required

Before using the facilities introduced in this section the <atomic> header file must be included.

When data are shared among multiple threads, data corruption is usually prevented using mutexes. To increment a simple int using this strategy code as shown below is commonly used:

        lock_guard<mutex> lk{ intVarMutex };

The compound statement is used to limit the lock_guard's lifetime, so that intVar is only locked for a short little while.

This scheme is not complex, but at the end of the day having to define a lock_guard for every single use of a simple variable, and having to define a matching mutex for each simple variable is a bit annoying and cumbersome.

C++ offers a way out through the use of atomic data types. Atomic data types are available for all basic types, and also for (trivial) user defined types. Trivial types are (see also section 23.6.2) all scalar types, arrays of elements of a trivial type, and classes whose constructors, copy constructors, and destructors all have default implementations, and their non-static data members are themselves of trivial types.

The class template std::atomic<Type> is available for all built-in types, including pointer types. E.g., std::atomic<bool> defines an atomic bool type. For many types alternative somewhat shorter type names are available. E.g, instead of std::atomic<unsigned short> the type std::atomic_ushort can be used. Refer to the atomic header file for a complete list of alternate names.

If Trivial is a user-defined trivial type then std::atomic<Trivial> defines an atomic variant of Trivial: such a type does not require a separate mutex to synchronize access by multiple threads.

Objects of the class template std::atomic<Type> cannot directly be copied or assigned to each other. However, they can be initialized by values of type Type, and values of type Type can also directly be assigned to std::atomic<Type> objects. Moreover, since atomic<Type> types offer conversion operators returning their Type values, an atomic<Type> objects can also be assigned to or initialized by another atomic<Type> object using a static_cast:

    atomic<int> a1 = 5;
    atomic<int> a2{ static_cast<int>(a1) };

The class std::atomic<Type> provides several public members, shown below. Non-member (free) functions operating on atomic<Type> objects are also available.

The std::memory_order enumeration defines the following symbolic constants, which are used to specify ordering constraints of atomic operations:

The memory order cannot be specified for the overloaded operators provided by atomic<Type>. Otherwise, most atomic member functions may also be given a final memory_order argument. Where this is not available it is explictly mentioned at the function's description.

Here are the standard available std::atomic<Type> member functions:

In addition to the above members, integral atomic types `Integral' (essentially the atomic variants of all built-in integral types) also offer the following member functions:

Some of the free member functions have names ending in _explicit. The _explicit functions define an additional parameter `memory_order order', which is not available for the non-_explicit functions (e.g., atomic_load(atomic<Type> *ptr) and atomic_load_explicit(atomic<Type> *ptr, memory_order order))

Here are the free functions that are available for all atomic types:

In addition to the abovementioned free functions atomic<Integral> types also offer the following free member functions:

20.6: An example: threaded quicksort

The quicksort sorting algorithm (Hoare, 1962) is a well-known sorting algorithm. Given an array of n elements, it works like this:

To convert this algorithm to a multi-threaded algorithm appears to be be a simple task:

    void quicksort(Iterator begin, Iterator end)
        if (end - begin < 2)            // less than 2 elements are left
            return;                     // and we're done

        Iter pivot = partition(begin, end); // determine an iterator pointing
                                            // to the pivot element

        thread lhs(quicksort, begin, pivot);// start threads on the left-hand
                                            // side sub-arrays
        thread rhs(quicksort, pivot + 1, end);  // and on the right-hand side
                                                // sub-arrays
        rhs.join();                         // and we're done

Unfortunately, this translation to a multi-threaded approach won't work for reasonably large arrays because of a phenomenon called overpopulation: more threads are started than the operating system is prepared to give us. In those cases a Resource temporarily unavailable exception is thrown, and the program ends.

Overpopulation can be avoided by using a pool of workers, where each `worker' is a thread, which in this case is responsible for handling one (sub) array, but not for the nested calls. The pool of workers is controlled by a scheduler, receiving the requests to sort sub-arrays, and passing these requests on to the next available worker.

The main data structure of the example program developed in this section is a queue of std::pairs containing iterators of the array to be sorted (cf. Figure 25, the sources of the program are found in the C++ Annotations's yo/threading/examples/multisort directory). Two queues are being used: one queue is a task-queue, receiving the iterators of sub-arrays to be partitioned. Instead of immediately launching new threads (the lhs and rhs threads in the above example), the ranges to be sorted are pushed on the task-queue. The other queue is the work-queue: elements are moved from the task-queue to the work-queue, where they will be processed by one of the worker threads.

Figure 25: Data structure used for multi-threading quicksort

The program's main function starts the workforce, reads the data, pushes the arrays begin and end iterators on the task queue and then starts the scheduler. Once the scheduler ends the sorted array is displayed:

    int main()
        workForce();            // start the worker threads
        readData();             // read the data into vector<int> g_data
        g_taskQ.push(           // prepare the main task
                    Pair(g_data.begin(), g_data.end())
        scheduler();            // sort g_data
        display();              // show the sorted elements

The workforce consists of a bunch of detached threads. Each thread represents a worker, implemented in the function void worker. Since the number of worker threads is fixed, overpopulation doesn't occur. Once the array has been sorted and the program stops these detached threads simply end:

    for (size_t idx = 0; idx != g_sizeofWorkforce; ++idx)

The scheduler continues for as long as there are sub-arrays to sort. When this is the case the task queue's front element is moved to the work queue. This reduces the work queue's size, and prepares an assignment for the next available worker. The scheduler now waits until a worker is available. Once workers are available one of them is informed of the waiting assignment, and the scheduler waits for the next task:

    void scheduler()
        while (newTask())
            g_workforce.wait();           // wait for a worker to be available
            g_worker.notify_all();            // activate a worker

The function newTask simply checks whether the task queue is empty. If so, and none of the workers is currently busy sorting a sub-array then the array has been sorted, and newTask can return false. When the task queue is empty but a worker is still busy, it may be that new sub-array dimensions are going to be placed on the task queue by an active worker. Whenever a worker is active the Semaphore g_workforce's size is less than the size of the work force:

    bool wip()
        return g_workforce.size() != g_sizeofWorkforce;
    bool newTask()
        bool done;
        unique_lock<mutex> lk(g_taskMutex);
        while ((done = g_taskQ.empty()) && wip())
        return not done;

Each detached worker thread performs a continuous loop. In the loop it waits for a notification by the scheduler. Once it receives a notification it retrieves its assignment from the work queue, and partitions the sub-array specified in its assignment. Partitioning may result in new tasks. Once this has been completed the worker has completed its assignment: it increments the available workforce and notifies the scheduler that it should check whether all tasks have been performed:

    void worker()
        while (true)
            g_worker.wait();      // wait for action
            lock_guard<mutex> lk(g_taskMutex);

Sub-arrays smaller than two elements need no partitioning. All larger sub-arrays are partitioned relative to their first element. The std::partition generic algorithm does this well, but if the pivot is itself an element of the array to partition then the pivot's eventual location is undetermined: it may be found anywhere in the series of elements which are at least equal to the pivot. The two required sub-arrays, however, can easily be constructed:

The two iterator pairs defining these two sub-arrays are thereupon added to the task queue, creating two new tasks to be dealt with by the scheduler:
    void partition(Pair const &range)
        if (range.second - range.first < 2)
        auto rhsBegin = partition(range.first + 1, range.second,
                                bind2nd(less<int>(), *range.first));
        auto lhsEnd = rhsBegin - 1;
        swap(*range.first, *lhsEnd);
        pushTask(range.first, lhsEnd);
        pushTask(rhsBegin, range.second);

20.7: Shared States

Just before a thread ends it may have produced some results. These results may have to to be communicated to other threads. In multi threaded programs several classes and functions can be used that produce shared states, making it easy to communicate results to other threads. Results could be values, objects or exceptions.

Objects that contain such shared states are called asynchronous return objects. However, due to the nature of multi threading, a thread may request the results of an asynchronous return object before these result are actually available. In those cases the requesting thread blocks, waiting for the results to become available. Asynchronous return objects offer wait and get members which, respectively, wait until the results have become available, and produce the asynchronous results once they are available. The phrase that is used to indicate that the results are available is `the shared state has been made ready'.

Shared states are made ready by asynchronous providers. Asynchronous providers are simply objects or functions providing results to shared states. Making a shared state ready means that an asynchronous provider

Once a shared state has been made ready it contains a value, object, or exception which can be retrieved by objects having access to the shared state. While code is waiting for a shared state to become ready the value or exception that is going to be stored in the shared state may be computed. When multiple threads try to access the same shared state they must use synchronizing mechanisms (like mutexes, cf. section 20.2) to prevent access-conflicts.

Shared states use reference counting to keep track of the number of asynchronous return objects or asynchronous providers that hold references to them. These return objects and providers may release their references to these shared states (which is called `releasing the shared state). This happens when a return object or provider holds the last reference to the shared state, and the shared state is destroyed.

On the other hand, an asynchronous provider may also abandon its shared state. In that case the provider, in sequence,

Objects of the class std::future (see the next section) are asynchronous return objects. They can be produced by the std::async (section 20.10) family of functions, and by objects of the classes std::packaged_task (section 20.11), and std::promise (section 20.12).

20.8: Asynchronous return objects: std::future

Condition variables allow threads to wait until data have obtained certain values. A thread may also have to wait until a sub-thread has finished when calling a sub-thread's join member.

Waiting may be unwelcome: instead of just waiting our thread might also be doing something useful. It might as well pick up the results produced by a sub-thread at some point in the future.

In fact, exchanging data among threads always poses some difficulties, as it requires shared variables, and the use of locks and mutexes to prevent data corruption. Rather than waiting and using locks it would be nice if some asynchronous task could be started, allowing the initiating thread (or even other threads) to pick up the result at some point in the future, when the results are needed, without having to worry about data locks or waiting times. For situations like these C++ provides the class std::future.

Before using the class std::future the <future> header file must be included.

Objects of the class template std::future harbor the results produced by asynchronously executed tasks. The class std::future is a class template. Its template type parameter specifies the type of the result returned by the asynchronously executed task. This type may be void.

On the other hand, the asynchronously executed task may throw an exception (ending the task). In that case the future object catches the exception, and rethrows it once its return value (i.e., the value returned by the asynchronously executed task) is requested.

In this section the members of the class template future are described. Future objects are commonly initialized through anonymous future objects returned by the factory function std::async or by the get_future members of the classes std::promise, and std::packaged_task (introduced in upcoming sections). Examples of the use of std::future objects are provided in those sections.

Some of future's members return a value of the strongly typed enumeration std::future_status. This enumeration defines three symbolic constants: future_status::ready, future_status::timeout, and future_status::deferred.

Error conditions are returned through std::future_error exceptions. These error conditions are represented by the values of the strongly typed enumeration std::future_errc (covered in the next section).

The class future itself provides the following constructors:

The class future does not offer a copy constructor or an overloaded assignment operator.

Here are the members of the class std::future:

The class std::future<ResultType> declares the following friends:

(sf. section 20.12), and

    template<typename Function, typename... Args>
        std::future<typename result_of<Function(Args...)>::type> 
        std::async(std::launch, Function &&fun, Args &&...args);

(cf. section 20.10).

20.8.1: The std::future_error exception and the std::future_errc enum

Members of the class std::future may return errors by throwing std::future_error exceptions. These error conditions are represented by the values of the strongly typed enumeration std::future_errc which defines the following symbolic constants:

The class std::future_error is derived from the class std::exception, and offers, in addition to the char const *what() const member also the member std::error_code const &code() const, returning an std::error_code object associated with the thrown exception.

20.9: Shared asynchronous return objects: std::shared_future

When a thread activates an asynchronous provider (e.g., a std::async) then the return value of the asynchronously called function becomes available in its activating thread through a std::future object. The future object cannot be used by another thread. If this is required (e.g., see this chapter's final section) the future object must be converted to a std::shared_future object.

Before using the class std::shared_future the <future> header file must be included.

Once a shared_future object is available, its get member (see below) can repeatedly be called to retrieve the results of the original future object. This is illustrated by the next small example:

     1: int main()
     2: {
     3:     std::promise<int> promise;
     4:     promise.set_value(15);
     6:     auto fut = promise.get_future();
     7:     auto shared1 = fut.share();
     9:     std::cerr << "Result: " << shared1.get() << '\n';
    10:               << "Result: " << shared1.get() << '\n';
    11:               << "Valid: " << fut.valid() << '\n';
    13:     auto shared2 = fut.share();
    15:     std::cerr << "Result: " << shared2.get() << '\n';
    16:               << "Result: " << shared2.get() << '\n';
    17: }
In lines 9 and 10 the promise's results are retrieved multiple times, but having obtained the shared_future in line 7, the original future object no longer has an associated shared state. Therefore, when another attempt is made (in line 13) to obtain the shared_future, a no associated state exception is thrown and the program aborts.

However, multiple copies of shared_future objects may co-exist. When multiple copies of shared_future objects exist (e.g. in different threads), the results of the associated asynchronous task are made ready (become available) at exactly the same moment in time.

The relationship between the classes future and shared_future resembles the relationship between the classes unique_ptr and shared_ptr: there can only be one instance of a unique_pointer, pointing to data, whereas there can be many instances of a shared_pointer, each pointing to the same data.

The effect of calling any member of a shared_future object for which valid() == false other than the destructor, the move-assignment operator, or valid is undefined.

The class shared_future supports the following constructors:

The class's destructor destroys the shared_future object for which it is called. If the object for which the destructor is called is the last shared_future object, and no std::promise or std::packaged_task is associated with the results associated with the current object, then the results are also destroyed.

Here are the members of the class std::shared_future:

20.10: Starting a new thread: std::async

In this section the function template std::async is covered. Async is used to start asynchronous tasks, returning values (or void) to the calling thread, which is hard to realize merely using the std::thread class.

Before using the function async the <future> header file must be included.

When starting a thread using the facilities of the class std::thread the initiating thread at some point commonly calls the thread's join method. At that point the thread must have finished or execution blocks until join returns. While this often is a sensible course of action, it may not always be: maybe the function implementing the thread has a return value, or it could throw an exception.

In those cases join cannot be used: if an exception leaves a thread, then your program ends. Here is an example:

     1: void thrower()
     2: {
     3:     throw std::exception();
     4: }
     6: int main()
     7: try
     8: {
     9:    std::thread subThread(thrower);
    10: }
    11: catch (...)
    12: {
    13:     std::cerr << "Caught exception\n";
    14: }
In line 3 thrower throws an exception, leaving the thread. This exception is not caught by main's try-block (as it is defined in another thread). As a consequence, the program terminates.

This scenario doesn't occur when std::async is used. Async may start a new asynchronous task, and the activating thread may retrieve the return value of the function implementing the asynchronous task or any exception leaving that function from a std::future object returned by the async function. Basically, async is called similarly to the way a thread is started using std::thread: it is passed a function and optionally arguments which are forwarded to the function.

Although the function implementing the asynchronous task may be passed as first argument, async's first argument may also be a value of the strongly typed enumeration std::launch:

    enum class launch

When passing launch::async the asynchronous task immediately starts; when passing launch::deferred the asynchronous task is deferred. When std::launch is not specified the default value launch::async | launch::deferred is used, giving the implementation freedom of choice, usually resulting in deferring execution of the asynchronous task.

So, here is the first example again, this time using async to start the sub-thread:

     1: bool fun()
     2: {
     3:     return std::cerr << "    hello from fun\n";
     4: }
     5: int exceptionalFun()
     6: {
     7:     throw std::exception();
     8: }
    10: int main()
    11: try
    12: {
    13:     auto fut1 = std::async(std::launch::async, fun);
    14:     auto fut2 = std::async(std::launch::async, exceptionalFun);
    16:     std::cerr << "fun returned " << std::boolalpha << fut1.get() << '\n';
    17:     std::cerr << "exceptionalFun did not return " << fut2.get() << '\n';
    18: }
    19: catch (...)
    20: {
    21:     std::cerr << "caught exception thrown by exceptionalFun\n";
    22: }
Now the threads immediately start, but although the results are available around line 13, the thrown exception isn't terminating the program. The first thread's return value is made available in line 16, the exception thrown by the second thread is simply caught by main's try-block (line 19).

The function template async has several overloaded versions:

When calling async all arguments except for the std::launch argument must be references, pointers or move-constructible objects: Once the thread itself starts another move construction is used to construct an object for the duration of the thread. When a pointer to an object is passed, the sub-thread uses the object referred to by the pointer, and neither copy- nor move-construction is required. However, when using a pointer to an object the programmer should make sure that the object's lifetime exceeds the duration of the thread (note that this is not automatically guaranteed, as the asynchronous task may not actually start before the future's get member is called).

Because of the default std::launch::deferred | std::launch::async argument used by the basic async call it is likely that the function which is passed to async doesn't immediately start. The launch::deferred policy allows the implementor to defer its execution until the program explicitly asks for the function's results. Consider the following program:

     1: void fun()
     2: {
     3:     std::cerr << "    hello from fun\n";
     4: }
     6: std::future<void> asyncCall(char const *label)
     7: {
     8:     std::cerr << label << " async call starts\n";
     9:     auto ret = std::async(fun);
    10:     std::cerr << label << " async call ends\n";
    11:     return ret;
    12: }
    14: int main()
    15: {
    16:     asyncCall("First");
    17:     asyncCall("Second");
    18: }
Although async is called in line 9, the program's output may not show fun's output line when it is run. This is a result of the (default) use of lauch::deferred: the system simply defers fun's execution until requested, which doesn't happen. But the future object that's returned by async has a member wait. Once wait returns the shred state must be available. In other words: fun must have finished. Here is what happens when after line 9 the line ret.wait() is inserted:
    First async call starts
        hello from fun
    First async call ends
    Second async call starts
        hello from fun
    Second async call ends

Actually, evaluation of fun can be requested at the point where we need its results, maybe even after calling asyncCall, as shown in the next example:

     1: int main()
     2: {
     3:     auto ret1 = asyncCall("First");
     4:     auto ret2 = asyncCall("Second");
     6:     ret1.get();
     7:     ret2.get();
     8: }
Here the ret1 and ret2 std::future objects are created, but their fun functions aren't evaluated yet. Evaluation occurs at lines 6 and 7, resulting in the following output:
    First async call starts
    First async call ends
    Second async call starts
    Second async call ends
        hello from fun
        hello from fun

The std::async function template is used to start a thread, making its results available to the calling thread. On the other hand, we may only be able to prepare (package) a task (a thread), but may have to leave the completion of the task to another thread. Scenarios like this are realized through objects of the class std::packaged_task, which is the topic of the next section.

20.11: Preparing a task for execution: std::packaged_task

The class template std::packaged_task allows a program to `package' a function or functor and pass the package to a thread for further processing. The processing thread then calls the packaged function, passing it its arguments (if any). After completing the function the packaged_task's future is ready, allowing the program to retrieve the results produced by the function. Thus, functions and the results of function calls can be transferred between threads.

Before using the class template packaged_task the <future> header file must be included.

Before describing the class's interface, let's first look at an example to get an idea about how a packaged_task can be used. Remember that the essence of packaged_task is that part of your program prepares (packages) a task for another thread to complete, and that the program at some point needs the result of the completed task.

To clarify what's happening here, let's first look at a real-life analogon. Every now and then I make an appointment with my garage to have my car serviced. The `package' in this case are the details about my car: its make and type determine the kind of actions my garage performs when servicing it. My neighbor also has a car, which also needs to be serviced every now and then. This also results in a `package' for the garage. At the appropriate time me and my neighbor take our cars to the garage (i.e., the packages are passed to another thread). The garage services the cars (i.e., calls the functions stored in the packaged_tasks [note that the tasks differ, depending on the types of the cars]), and performs some actions that are associated with it (e.g., registering that my or my neighbor's car has been serviced, or order replacement parts). In the meantime my neighbor and I perform our own businesses (the program continues while a separate thread runs as well). But by the end of the day we'd like to use our cars again (i.e., get the results associated with the packaged_task). A common result in this example is the garage's bill, which we have to pay (the program obtains the packaged_task's results).

Here is a little C++ program illustrating the use of a packaged_task (assuming the required headers and using namespace std have been specified):

     1: mutex carDetailsMutex;
     2: condition_variable condition;
     3: string carDetails;
     4: packaged_task<size_t (std::string const &)> serviceTask;
     6: size_t volkswagen(string const &type)
     7: {
     8:     cout << "performing maintenance by the book for a " << type << '\n';
     9:     return type.size() * 75;            // the size of the bill
    10: }
    12: size_t peugeot(string const &type)
    13: {
    14:     cout << "performing quick and dirty maintenance for a " << type << '\n';
    15:     return type.size() * 50;             // the size of the bill
    16: }
    18: void garage()
    19: {
    20:     while (true)
    21:     {
    22:         unique_lock<mutex> lk(carDetailsMutex);
    23:         while (carDetails.empty())
    24:             condition.wait(lk);
    26:         cout << "servicing a " << carDetails << '\n';
    27:         serviceTask(carDetails);
    28:         carDetails.clear();
    29:     }
    30: }
    32: int main()
    33: {
    34:     thread(garage).detach();
    36:     while (true)
    37:     {
    38:         string car;
    39:         if (not getline(cin, car) || car.empty())
    40:             break;
    41:         {
    42:             lock_guard<mutex> lk(carDetailsMutex);
    43:             carDetails = car;
    44:         }
    45:         serviceTask =  packaged_task<size_t (string const &)>(
    46:                     car[0] == 'v' ? volkswagen : peugeot
    47:                 );
    48:         auto bill = serviceTask.get_future();
    49:         condition.notify_one();
    50:         cout << "Bill for servicing a " << car <<
    51:                                 ": EUR " << bill.get() << '\n';
    52:     }
    53: }
Now that we've seen an example of a program using a packaged_task, let's have a look at its interface. Note that the class packaged_task is a class template: its template type parameter specifies the prototype of a function or function object implementing the task performed by the packaged_task object.

Constructors and destructor:

Member functions:

The following non-member (free) function operating on packaged_task objects is available:

20.12: The class `std::promise'

In addition to std::packaged_task and std::async the class template std::promise can be used to obtain the results from a separate thread.

Before using the class template promise the <future> header file must be included.

A promise is used to obtain the results from another thread without further synchronization requirements. Consider the following program:

    void compute(int *ret)
        *ret = 9;
    int main()
        int ret = 0;
        std::thread(compute, &ret).detach();
        cout << ret << '\n';
Chances are that this program shows the value 0: the cout statement has already been executed before the detached thread has had a chance to complete its work. In this example that problem can easily be solved by using a non-detached thread, and using the thread's join member, but when multiple threads are used that requires named threads and as many join calls. Instead, using a promise might be preferred:
     1: void compute(promise<int> &ref)
     2: {
     3:     ref.set_value(9);
     4: }
     6: int main()
     7: {
     8:     std::promise<int> prom;
     9:     std::thread(compute, ref(prom)).detach();
    11:     cout << prom.get_future().get() << '\n';
    12: }
This example also uses a detached thread, but its results are kept for future reference in a promise object, instead of directly being assigned to a final destination variable. The promise object contains a future object holding the computed value. The future's get member blocks until the future has been made ready, at which point the result becomes available. By then the detached thread may or may not yet have been completed. If it already completed its work then get immediately returns, otherwise there will be a slight delay.

Promises are useful when implementing a multi threaded version of some algorithm without having to use additional synchronization statements. As an example consider matrix multiplications. Each element of the resulting product matrix is computed as the inner product of two vectors: the inner product of a row of the left-hand matrix operand and a column of the right-hand matrix operand becomes element [row][column] of the resulting matrix. Since each element of the resulting matrix can independently be computed from the other elements, a multi threaded implementation is well possible. In the following example the function innerProduct (lines 4..11) leaves its result in a promise object:

     1: int m1[2][2] = {{1, 2}, {3, 4}};
     2: int m2[2][2] = {{3, 4}, {5, 6}};
     4: void innerProduct(promise<int> &ref, int row, int col)
     5: {
     6:     int sum = 0;
     7:     for (int idx = 0; idx != 2; ++idx)
     8:         sum += m1[row][idx] * m2[idx][col];
    10:     ref.set_value(sum);
    11: }
    13: int main()
    14: {
    15:     promise<int> result[2][2];
    17:     for (int row = 0; row != 2; ++row)
    18:     {
    19:         for (int col = 0; col != 2; ++col)
    20:             thread(innerProduct, ref(result[row][col]), row, col).detach();
    21:     }
    23:     for (int row = 0; row != 2; ++row)
    24:     {
    25:         for (int col = 0; col != 2; ++col)
    26:             cout << setw(3) << result[row][col].get_future().get();
    27:         cout << '\n';
    28:     }
    29: }
Each inner product is computed by a separate (anonymous and detached) thread (lines 17..21), which starts as soon as the run-time system allows it to start. By the time the threads have finished the resulting inner products can be retrieved from the promises' futures. Since futures' get members block until their results are actually available, the resulting matrix can simply be displayed by calling those members in sequence (lines 23..28).

So, a promise allows us to use a thread to compute a value (or exception, see below), which value may then be collected by another thread at some future point in time. The promise remains available, and as a consequence further synchronization of the threads and the program starting the threads is not necessary. When the promise object contains an exception, rather than a value, its future's get member rethrows the stored exception.

Here is the class promise's interface. Note that the class promise is a class template: its template type parameter ReturnType specifies the template type parameter of the std::future that can be retrieved from it.

Constructors and destructor:

Member functions:

The following non-member (free) function operating on promise objects is available:

20.12.1: Exception propagation: std::exception_ptr

Std::promise's member set_exception does not expect a std::exception argument, but an object of the class std::exception_ptr. In this section we take a closer look at the class exception_ptr.

Before using the class exception_ptr the <future> header file must be included.

The class exception_ptr's default constructor initializes it to a null-pointer. In the following code snippet the variable isNull is set to true:

    std::exception_ptr obj;
    bool isNull =  obj == nullptr && obj == 0;

The class exception_ptr provides copy and move constructors as well as copy and move assignment operators.

Two exception_ptr objects can be compared for equality. They are equal if they refer to the same exception. Move assignment transfers the exception referred to by the right-hand side operand to the left-hand side operand, and turns the right-hand side operand into a null pointer.

There is no published method directly retrieving the exception to which an exception_ptr object refers. However, there are some free functions constructing or handling exception_ptr objects:

20.13: An example: multi-threaded compilations

In this section another program is developed. This section's example program illustrates the use of packaged_tasks.

Like the multi-threaded quicksort example a worker pool is used. However, in this example the workers in fact do not know what their task is. In the current example the tasks happens to be identical, but different tasks might as well have been used, without having to update the workers.

The program uses a class Task containing a command-specification (d_command), and a task specification (d_task) (cf. Figure 26), the sources of the program are found in the yo/threading/examples/multicompile directory of the C++ Annotations.

Figure 26: Data structure used for the multi-threading compilation

Like before main first starts its workforce as detached threads. Following this, the compilation jobs are performed by yet another detached thread. Eventually, the results of the compilation jobs are handled by results:

    int main()
        workforce();                    // start the worker threads
        thread(jobs).detach();          // start working on the jobs
        results();                      // handle the results.

The jobs-thread reads the names of the files to compile from the standard input stream, and passes them on to dispatch for further handling by the workers. Lines 7 and 8 are executed at the end, and are a safeguard against empty input. The safeguard is discussed below, at newResult's description.

     1: void jobs()
     2: {
     3:     string line;
     4:     while (getline(cin, line) && dispatch(line))
     5:         ;
     7:     g_ready = true;
     8:     g_resultCond.notify_one();
     9: }

Next the dispatcher. It ignores empty lines. Also, if a compilation has failed by the time the dispatcher is called, processing stops (lines 6-7). Otherwise, the dispatcher waits for an available worker, prepares a new task, and notifies a worker to handle it:

     1: bool dispatch(string const &line)
     2: {
     3:     if (line.empty())
     4:         return true;
     6:     if (g_done.load())
     7:         return false;
     9:     g_workforce.wait();
    10:     newTask(line);
    11:     g_worker.notify_all();
    12:     return true;
    13: }

The function newTask prepares the program for the next task. First a Task object is created. Task contains the name of the file to compile, and a packaged_task. It encapsulates all activities that are associated with a packaged_task. Here is its (in-class) definition:

     1: typedef packaged_task<Result (string const &fname)> PackagedTask;
     3: class Task
     4: {
     5:     string d_command;
     6:     PackagedTask d_task;
     8:     public:
     9:         Task()  = default;
    11:         Task(string const &command, PackagedTask &&tmp)
    12:         :
    13:             d_command(command),
    14:             d_task(move(tmp))
    15:         {}
    17:         void operator()()
    18:         {
    19:             d_task(d_command);
    20:         }
    22:         shared_future<Result> result()
    23:         {
    24:             return d_task.get_future().share();
    25:         }
    26: };

Note (lines 22-25) that result returns a shared_future. Since the dispatcher runs in a different thread than the one processing the results, the futures created by the dispatcher must be shared with the futures required by the function processing the results. Hence the shared_futures returned by Task::result.

Once a Task object has been constructed its shared_future is pushed on the result queue. Although the actual results aren't available by this time, the result function (see below) is notified that something has been pushed on its queue. Additionally, the Task itself is pushed on the task queue, where a worker may retrieve it:

    class Task
        string d_command;
        PackagedTask d_task;

            Task()  = default;

            Task(string const &command, PackagedTask &&tmp);
            void operator()();
            shared_future<Result> result();

    void pushResultQ(shared_future<Result> const &sharedResult)
        lock_guard<mutex> lk(g_resultQMutex);
        g_ready = true;

The workers have a simple task: retrieve the next task from the task queue's front, then perform that task. Whatever happens inside the tasks themselves is of no concern: the worker thread eventually ends when the program ends:

    void worker()
        Task task;
        while (true)

This completes the description of how tasks are handled. The task itself remains to be described. In the current example program source files are being compiled. The compilation command is passed to the constructor of a CmdFork object, which starts the compiler as a child process. The result of the compilation is retrieved via its childExit (returning the compiler's exit code) and childOutput (returning any textual output produced by the compiler) members. If compilation fails, the exit value won't be zero. In this case no further compilation tasks will be initiated (lines 11 and 12; the implementation of the class CmdFork is available from the C++ Annotations' yo/threading/examples/cmdfork directory). Here is the function compile:

     1: Result compile(string const &line)
     2: {
     3:     string command("/usr/bin/g++ -Wall --std=c++11 -c " + line);
     5:     CmdFork cmdFork(command);
     6:     cmdFork.fork();
     8:     Result ret {cmdFork.childExit() == 0,
     9:                 line + "\n" + cmdFork.childOutput()};
    11:     if (not ret.ok)
    12:         g_done = true;
    14:     return ret;
    15: }

The results function continues for as long as there are results. Once results are available they are displayed. Whenever a compilation has failed the Result's d_ok member is false and results ends:

    void results()
        while (newResult())
            Result const &result = g_resultQ.front().get();
            cerr << result.display;
            if (not result.ok)

The function newResult controls results' while-loop. It returns true when there are some results available in the result queue. When no filenames are presented at the standard input stream or when no result has as yet been pushed on the result queue newResult waits. It also waits when there are no results available in the results queue, but at least one worker thread is busy. Whenever a result is pushed on the result queue, and also once the input stream has been processed newResult is notified. Following the notification newResult returns, and results either ends or shows the results appearing at te result queue's front:

    bool newResult()
        unique_lock<mutex> lk(g_resultQMutex);
                g_workforce.size() != g_sizeofWorkforce
            not g_ready.load()
        return not g_resultQ.empty();

20.14: Transactional Memory

Transactional memory is used to simplify shared data access in multithreaded programs. The benefits of transactional memory is best illustrated by a small program. Consider a situation where threads need to write information to a file. A plain example of such a program would be:

    void fun(int value)
        for (size_t rept = 0; rept != 10; ++rept)
            cout << "fun " << value << '\n';
    int main()
        thread thr{ fun, 1 };

When this program is run the fun 1 and fun 2 messages are intermixed. To prevent this we traditionally define a mutex, lock it, write the message, and release the lock:

    void fun(int value)
        static mutex guard;

        for (size_t rept = 0; rept != 10; ++rept)
            cout << "fun " << value << '\n';

Transactional memory handles the locking for us. Transactional memory is used when statements are embedded in a synchronized block. The function fun, using transactional memory, looks like this:

    void fun(int value)
        for (size_t rept = 0; rept != 10; ++rept)
                cout << "fun " << value << '\n';

To compile source files using transactional memory the g++ compiler option -fgnu-tm must be specified.

The code inside a synchronized block is executed as a single, as if the block was protected by a mutex. Different from using mutexes transactional memory is implemented in software instead of using hardware-facilities.

Considering how easy it is to use transactional memory compared to using the mutex-based locking mechanism using transactional memory appears too good to be true. And in a sense it is. When encountering a synchronized block the thread unconditionally executes the block's statements. At the same time it keeps a detailed log of all its actions. Once the statements have been completed the thread checks whether another thread didn't start executing the block just before it. If so, it reverses its actions, using the synchronized block's log. The implication of this should be clear: there's at least the overhead of maintaining the log, and if another thread started executing the synchronized block before the current thread then there's the additional overhead of reverting its actions and to try again.

The advantages of transactional memory should also be clear: the programmers no longer is responsible for correctly controlling access to shared memory; risks of encountering deadlocks have disappeared as has all adminstrative overhead of defining mutexes, locking and unlocking. Especially for inherently slow operations like writing to files transactional memory can greatly simplify parts of your code. Consider a std::stack. Its top-element can be inspected but its pop member does not return the topmost element. To retrieve the top element and then maybe remove it traditionally requires a mutex lock surrounding determining the stack's size, and if empty, release the lock and wait. If not empty then retrieve its topmost element, followed by removing it from the stack. Using a transactional memory we get something as simple as:

    bool retrieve(stack<Item> &itemStack, Item &item)
            if (itemStack.empty())
                return false;
            item = std::move(;
            return true;

Variants of synchronized are: