Byte-addressable non-volatile main memory (NVRAM) combines the performance benefits of conventional DRAM-based main memory with the durability of secondary storage. Systems where NVRAM co-exists with (or even replaces) traditional volatile main memory have recently become commercially available and are anticipated to become increasingly widespread in the future. Consequently, there is growing interest in recoverable concurrent objects (also called persistent or durable objects) — concurrent objects that leverage state saved in NVRAM for efficient recovery from crash failures.
In this tutorial, we will describe emerging shared-memory models for recoverable objects, as well as algorithms for implementing recoverable versions of mutual-exclusion locks and lock-free concurrent objects.
Research is about asking the right questions. Distributed computing started with the impossibility result of the "coordinated attack problem", and the question of "mutual exclusion". In this class, I'll do what is called in US Football, "Monday Morning Quarterbacking". How could our team score easily "if only...."
I'll describe the two problems mentioned above and pose similar alternative problems. I'll show that the solutions to these problems are natural and almost trivial, in a new setting called "message adversary", and by this I'll cover some of the main achievements of theoretical distributed computing that took decades in the making.
How to think about concurrent systems mathematically is explained using the Paxos consensus algorithm as an example. First, the problem to be solved is precisely specified. Then, a "shared memory" voting algorithm is specified and shown to implement the problem specification. Finally, the Paxos algorithm is specified and shown to implement the voting algorithm. How mathematical thinking is used in industry is then briefly discussed.
More and more applications are now distributed and in non-trivial distributed applications, it appears that the computing entities (processes) have to agree in one way or another, for example, to take a common decision, execute specific actions, or validate some commitment. The most famous distributed agreement problem is the consensus problem. A process crash failure occurs when a process stops prematurely; it can be seen as a benign failure, as a crashed process did not pollute the computation before crashing. The situation is different with Byzantine failures. A process has a Byzantine behavior when it arbitrarily deviates from its intended behavior. Let us notice that, from a failure hierarchy point of view, process crashes (unexpected halting) constitute a strict subset of Byzantine failures. As message-passing distributed systems are more and more pervasive, the assumption "no process has a bad behavior" is no longer sensible. Hence, agreement in Byzantine message-passing systems is becoming a more and more important issue of fault-tolerance.
This lecture aims to show how the consensus problem can be solved in both synchronous and asynchronous distributed message-passing systems prone to Byzantine failures. We present some lower bounds on the ratio of Byzantine processes that can be tolerated and how these bounds evolve according to the expected complexity of the obtained solutions. In synchronous systems, we will present the basic consensus algorithm called EIG (exponential information gathering) and show how its exponential complexity can be lowered. In the asynchronous model, consensus cannot be solved deterministically, we will show how to circumvent this impossibility by using timing assumptions or randomization. For this and in order to have more intuitive algorithms, we will introduce a series of broadcast primitives that help reduce the noise due to the Byzantine behavior of part of the processes.
University of Nantes
How do you know that your program is efficient? What is efficiency in the first place? First, we need to choose a complexity metric and agree that the metric is relevant. Second, we need to show that the program performs "well", having this metric in mind. Ideally, we would like to show that no program can solve the given problem with a better complexity. How to do it? We need to establish a tight lower bound, i.e. (1) find a complexity level that no program can beat, and (2) show that our program exhibits precisely this complexity.
In this lecture, we discuss popular techniques designed for establishing (tight) lower bounds in distributed computing. Models for distributed systems come in many flavors: processes may communicate through shared memory or by message passing, they may be synchronous or asynchronous, prone to failures or not, and so on. Different models require different lower bound results. Nevertheless, many of these results rely on common techniques, having to do with information transfer and the difficulty of dealing with indistinguishable scenarios. This lecture will explore and explain some of these techniques, together with their applications.
Nonblocking concurrent data structures are an increasingly valuable tool for shared-memory parallel programming. By ensuring that no reachable state precludes forward progress by any given thread, nonblocking structures avoid performance anomalies in the presence of preemption, and can in principle tolerate thread failures. They can also outperform lock-based alternatives in several important cases, and are competitive in others. They are, however, quite difficult to write — due, among other things, to inherent data races, interactions with the language and hardware memory model, and the need for concurrent memory management.
This course will briefly survey background material on hardware primitives and memory models, together with formal notions of safety, liveness, and proof techniques. It will then explore nonblocking versions of important data structures, including stacks, queues, linked lists, hash tables, skip lists, and search trees. In the process, it will introduce appropriate memory management techniques. To the extent that time permits, it will also point to work on several more advanced topics, including condition synchronization (partial methods), combining and flat combining, universal constructions, nonblocking software transactional memory, and RCU.
University of Rochester
In the first part of this tutorial we will provide a new general framework that can be used to explain many consensus protocol variants. We start with a simple primary backup replication scheme and via a series of steps extend it in several dimensions:
from crash failures, to omission failures, to Byzantine failures.
from single-shot to multi-shot.
from synchrony, to partial synchrony, to asynchrony.
In the second part of this tutorial we will touch upon the deep connections between Byzantine fault-tolerance and blockchains.
Modern servers have dozens or even hundreds of cores, which can execute many threads of computation in parallel. In such a system, the difference between the performance of a bad implementation and a good one can easily be 100x.
This lecture uses concurrent data structures as examples to explain high performance implementation techniques and surprising performance pitfalls. Along the way, we will cover linearizability, which is a way to define what correctness means for a concurrent data structure.
Students should finish with knowledge of some simple linearizable concurrent data structures, and an understanding of how these data structures interact with various aspects of real systems, with a special focus on processor caches.
University of Waterloo
Modern distributed data management systems face a new challenge: how to allow autonomous, mutually-distrusting parties to co-operate safely and effectively. This challenge presents many questions familiar from classical distributed systems: how to combine multiple steps into a single atomic action, how to recover from failures, how to synchronize concurrent access to data, and what it means for that action to be correct. These lectures describe how blockchain-based ledgers require rethinking the foundations of classical distributed computing.