Gbcast

From HandWiki

Gbcast (also known as group broadcast) is a reliable multicast protocol that provides ordered, fault-tolerant (all-or-none) message delivery in a group of receivers within a network of machines that experience crash failure.[1][2][3] The protocol is capable of solving Consensus in a network of unreliable processors, and can be used to implement state machine replication.[4][5] Gbcast can be used in a standalone manner, or can support the virtual synchrony execution model, in which case Gbcast is normally used for group membership management while other, faster, protocols are often favored for routine communication tasks.

History

Introduced in 1985,[1] Gbcast was the first widely deployed reliable multicast protocol to implement state machine replication with dynamically reconfigurable membership. Although this problem had been treated theoretically under various models in prior work, Gbcast innovated by showing that the same multicasts used to update replicated data within the state machine can also be used to dynamically reconfigure the group membership, which can then evolve to permit members to join and leave at will, in addition to being removed upon failure. This functionality, together with a state transfer mechanism used to initialize joining members, represents the basis of the virtual synchrony process group execution model.

The term state machine replication was first suggested by Leslie Lamport[4] and was widely adopted after publication of a survey paper written by Fred B. Schneider.[6] The model covers any system in which some deterministic object (a state machine) is replicated in such a way that a series of commands can be applied to the replicas fault-tolerantly. A reconfigurable state machine is one that can vary its membership, adding new members or removing old ones.[7] Some state machine protocols can also ride out the temporary unavailability of a subset of the current members without requiring reconfiguration when such situations arise, including Gbcast and also Paxos,[5] Lamport's widely cited protocol for state machine replication.

State machine replication is closely related to the distributed Consensus problem,[8] in which a collection of processes must agree upon some decision outcome, such as the winner of an election. In particular, it can be shown that any solution to the state machine replication problem would also be capable of solving distributed consensus. As a consequence, impossibility results for distributed consensus [9] apply to solutions to the state machine replication problem. Implications of this finding are discussed under liveness.

Gbcast is somewhat unusual in that most solutions to the state machine replication problem are closely integrated with the application being replicated. Gbcast, in contrast, is designed as a multicast API and implemented by a library that delivers messages to group members. Lamport, Malkhi and Zhou note that few reliable multicast protocols have the durability properties required to correctly implement the state machine model. Gbcast does exhibit the necessary properties.[7]

The Gbcast protocol was first described in a 1985 publication that discussed infrastructure supporting the virtual synchrony model in the Isis Toolkit.[1] Additional details were provided in a later 1987 journal article,[2] and an open-source version of the protocol was released by the Cornell developers in November of that year. Isis used the protocol primarily for maintaining the membership of process groups but also offered an API that could be called directly by end-users. The technology became widely used starting in 1988, when the Isis system was commercialized and support became available. Commercial support for the system ended in 1998 when Stratus Computer, then the parent of Isis Distributed Systems, refocused purely on hardware solutions for the telecommunications industry.

Examples of systems that used Isis in production settings include the New York Stock Exchange, where it was employed for approximately a decade to manage a configurable, fault-tolerant and self-healing reporting infrastructure for the trading floor, to relay quotes and trade reports from the "back office" systems used by the exchange to overhead display. The French Air Traffic Control System continues to use Isis; since 1996 the system has been employed to create fault-tolerant workstation clusters for use by air traffic controllers and to reliably relay routing updates between air traffic control centers; over time the French technology has also been adopted by other European ATC systems. The US Navy AEGIS has used Isis since 1993 to support a reliable and self-healing communication infrastructure. Isis also had several hundred other production users in the financial, telecommunications, process control, SCADA and other critical infrastructure domains. More details can be found in.[3]

Problem statement

The fundamental problem solved by Gbcast is this: we are given an initial set of group members and wish to support a multicast abstraction, permitting members of the group to send messages that encode various commands or requests. The protocol must agree on the messages to deliver, and on their ordering, so that if any member of the group sends a message, every member of the group that doesn't fail will receive that message and in the same order with respect to other delivered messages.

The set of group members changes each time a member fails or joins, and Gbcast is also used to maintain group membership by means of special multicasts that are delivered to the application as "new view" events, but that also adjust the group membership list maintained by the Gbcast protocol library. The application thus sees a series of membership views that start with an "initial view" when a particular group member joins, and then evolve over time, and that are ordered with respect to other view-changing events and multicast messages. These multicasts are delivered to all the non-failed members listed in the view during which delivery is scheduled, a property referred to as virtual synchrony.

Network partitions can split a group into two or more disjoint subgroups, creating the risk of split brain behavior, in which some group members take a decision (perhaps, to launch the rocket) without knowing that some other partition of the group has taken a different, conflicting decision. Gbcast offers protection against this threat: the protocol ensures that progress occurs only in a single primary partition of the group. Thus, should a network partition arise, at most one subgroup of members will continue operations, while the other is certain to stall and shut down.

Should a failed member recover (or if a partitioning failure caused some member to be incorrectly sensed as faulty and hence dropped from the view), after communication is restored, that member can rejoin. An incarnation number is used to avoid ambiguity: a counter that will be incremented each time a process joins the group, and is treated as part of the process identifier. Any given (processor-id, process-id, incarnation-number) tuple joins the group at most once, then remains in the group until it fails, or is forced to leave because a time out occurred.

Any dynamically reconfigurable system, including both Gbcast and Paxos, can enter states from which no further progress is possible. For example, this could happen if operational processes are wrongly removed from the configuration, and then too many real crashes occur within the remaining members of the view. In such situations, the data center management infrastructure is responsible for restarting the entire application. This is in contrast to the behavior of non-reconfigurable (vanilla) Paxos, which can tolerate disruptions of unlimited duration and then will resume once enough group members are accessible, without intervention of the management infrastructure. The following terms are used in the detailed protocol description.

Processes

  • Processes run on processors that operate at arbitrary speed.
  • Processes may experience crash (halting) failures.
  • A process is uniquely identified by a three-tuple: (processor-id, process-id, incarnation-number).
  • Processes with stable storage may re-join the protocol after failures (following a crash-recovery failure model), after incrementing the incarnation number.
  • Processes do not collude, lie, or otherwise attempt to subvert the protocol. (That is, Byzantine failures [10] don't occur.)

Network

  • All processes in the system can send messages to all other processes in the system.
  • Messages are sent asynchronously: there is no time bound on message delivery.
  • Messages may be lost, reordered, or duplicated.
  • Messages are delivered without corruption.

These are weak assumptions: a network that never delivers any messages would satisfy them (we would say that such a network is experiencing a complete and permanent partitioning failure). The network conditions required for Gbcast to guarantee progress are discussed below. In practice Gbcast is normally used within data centers; these have networks that can experience transient failures, but in which partitioning failures are rare, and generally impact just small subsets of the nodes. Thus for purposes of analysis we assume a harsher networking environment than would arise in actual deployments.

To simplify the presentation, we assume that a TCP-like acknowledgement / retransmission scheme is employed, creating the illusion of a reliable, sequenced, non-repeating message channel between each pair of processes. A timeout occurs if this channel abstraction retries repeatedly and is unable to obtain an acknowledgement for some message. Using the same TCP-like channels, we can also support a 1-to-all capability, whereby a single process sends some message over its channels to all the other members of some view of some group. This is done by mapping the 1-to-all request into multiple 1-to-1 messages. Notice that these 1-to-all channels lack any atomicity guarantee: if the sender fails while a message is being sent, it might reach just some of the destinations.

Process Groups and Views

Gbcast is defined with respect to a "process group:" a set of processes. In a deployed system such a group might have a name (like a file name), a way to initially contact the group, and other attributes such as flow-control parameters. However, those kinds of details are omitted here for brevity.
The term membership view is a list of members, rank-ordered by age (determined by the view in which each member most recently joined the group) and with ties broken by a lexicographic ordering rule.
The initial membership of the group is specified by an external agent and defines the first membership view of the group.
Subsequent membership views arise by applying add and remove commands and are identified by sequence number.
New views are reported to the processes belonging to the view by means of "new view" events. The application is notified via an upcall (a call from the library into a handler defined by the application program).

Multicast Messages

Members of a view can request that multicast messages be sent to a process group without knowledge of the membership that will apply at the time of delivery.
The Gbcast protocol carries out these operations with a series of guarantees, discussed below.
Delivery is by upcall to the application, which can perform whatever action the message requests.

Roles

Gbcast is best understood in terms of a set of roles.

Application

An application corresponds to a program which can be launched on one or more processors. Each application process then joins one or more process groups.
An application process belonging to a group initiates new multicasts by invoking Gbcast. The protocol is considered to have terminated when all members of the target group have either acknowledged delivery of the message, or have been detected as faulty, via a mechanism explained below.
Incoming Gbcast messages are delivered via upcalls, as are view change notifications.
As noted earlier, the members of a group observe the same sequence of upcalls starting when they initially join: an initial view and then a sequence of new views and multicast messages. All members of a group receive any particular multicast in the same view, and the multicast is delivered to all non-failed members of that view.

Leader

The leader of a group is defined with respect to some view of the group, and is the member with lowest rank in the view. As noted, the rank is age-ordered (with older members having lower rank), and ties are broken using a lexicographic sort.

Failure detection

All components of the system are permitted to participate in the role of "detecting" failures. Detection is distinct from the reporting of the failure (which occurs through a new view and is ordered with respect to message deliveries).
The channel abstraction supported by the network layer senses failures by timeouts. (Notice that under the network model, a process that attempts to send a message to a crashed target process will always experience a timeout, but it is also possible that the channel abstraction could misreport an operational process as faulty if messages are delayed because of a transient partitioning failure).
Any process that experiences a timeout can declare that the endpoint of the associated channel has failed.
If a process learns of a failure for some (processor-id, process-id, incarnation-number) tuple, it includes that information on the next outgoing message on all channels.
A process that considers some other process to have failed will reject messages from the failed incarnation, responding "you have failed". (That is, processes gossip about failures, and shun failed group members).
An incoming message from a new incarnation of a failed process is treated as a message from a "new" process.

Failed process

Any member of the current view that has been detected as failed is considered to be a failed process.
An operational process that learns that it is considered to have failed (by attempting to communicate with some other process that rejects the message, thereby "shunning" it) might exit from the system, or can increase its incarnation number and rejoin.

New Leader

If every lower-ranked process in the current view is a failed process, then the next highest-ranked non-failed process is designated as the new leader.
The new leader must run a protocol, discussed below, to become the leader.

Quorums

Quorums are used to guarantee the safety properties of Gbcast by ensuring that there is a single globally agreed-upon sequence of group views and multicast messages and by preventing progress in more than one partition if a group becomes fragmented into two or more partitions (disjoint subsets of members that can communicate with other members of their subsets, but not with members of other subsets). Quorums are defined for a specific view.

Given view i with n members {A,B,C….}, a quorum of the view is any majority subset of the members of that view. Notice that this is in contrast to the way the term is defined in systems that have a static underlying membership: for Gbcast, the quorum size will change over time as the membership of a group changes and new views become defined.

Safety and liveness properties

In order to guarantee safety, Gbcast defines three safety properties and ensures they hold, regardless of the pattern of failures:

Non-triviality

Only multicasts actually sent by some group member are delivered. If a process receives a message from a group member that it considers to have failed, it will reject those messages.

Consistency

If any member of a view delivers a multicast (or reports a new view) in some order relative to other multicasts, then all other members of the same view that deliver the same message (or report the same view) will do so in the same order.

Conditional liveness

If multicast M is sent in some view and the sender remains operational, then eventually all members of that view (with the exception of any that crash) will deliver M. Liveness cannot be guaranteed under all conditions, hence we impose a further condition: we require this property only while sufficiently many processes remain non-faulty (we'll discuss this further below).

Basic Gbcast

This protocol is the one used under normal conditions.

Recall that in Gbcast, each operational process has a current view, and each view defines a leader. Only a process that believes itself to be the leader in the current view can initiate a new multicast; other members must relay multicasts by sending them to the leader, over 1-to-1 connections, and then waiting for the leader to run the protocol.

Should the leader fail while some member that is not the leader is attempting to relay a multicast, the sender must determine the status of its pending request. This is accomplished as follows: Notice that members observe the delivery of their own multicasts. Accordingly, if a new view becomes defined in which the old leader has failed, either the multicast has been delivered (in which case the sender knows this because it was one of the receivers), or the delivery of the new view allows it to conclude that the leader failed to relay the pending message, and that it should be resent by asking the new leader to relay it (non-triviality).

Prepare step

The leader proposes some sequence of one or more multicast messages by using the 1-to-all reliable network layer to send the message(s) to the members of the most current view, identifying each by means of an integer sequence number. The sequence numbers reset to 1 as each new view is defined (via a special kind of multicast, as explained below). A leader "talks to itself", participating in the protocol just as do other members. During recovery (discussed below), a new leader might re-propose some previously proposed view or message, as the new leader attempts to complete protocols that the old leader might have started but failed to complete. When this occurs, the new leader will respect the original sequencing and will re-propose the identical view or message.

Promise step

Each recipient retains a copy of the message(s) and responds with a promise to deliver them (such a promise will be fulfilled so long as the recipient itself remains a member of the group view, but if the recipient fails, the promise might not be carried out). During recovery, a recipient might receive a duplicated prepare request for the same message. If some message is re-proposed with the same sequence number, a recipient simply repeats its promise.

Commit step

The leader collects promise messages until, for each member of the group, it either has a promise message or a timeout has occurred causing the leader to suspect the corresponding member as faulty (recall that in this latter case, the leader will shun the suspected member, and because the message-sending subsystem piggybacks this information on the next messages it sends, any process receiving a subsequent message from the leader will also begin to shun these newly suspected members).
If the leader receives promises from a quorum of members, as defined with respect to the view in which it is running the protocol, it sends a commit request. If the leader lacks a quorum, and hence suspects more than a majority of group members, it will never again be able to make progress, and the leader therefore terminates (the application program may rejoin the group using a new process name, but further progress by this process in the old view, under the old process name, is impossible).
Notice that the leader may also have learned of failures during the prepare phase or the propose phase.
In the prepare phase, some view members may have failed to acknowledge the propose request, in which case the leader's channel to those members will have experienced timeouts. The leader will have marked them as failed members.
Additionally, it may be the case that by receiving the promise messages in the promise phase, the leader has learned of failed members that were detected by other group members. Thus, at the start of the commit phase, the leader has a quorum of promises together with a possibly empty list of failed view members.
The leader therefore sends the "Commit" message to the non-failed members of the view, together with a proposal for a view change event that will remove the failed member(s) from the view, thereby combining a commit step and a propose step into a single actions. Recall that the after any failure detection occurs, the first message to each member in the group will piggyback that failure detection information, and that members shun failed members. Thus members that learn of a failure instantly begin to shun failed members, and the leader takes the further step of starting a view change protocol (which will then take some time to complete).
If a proposal changed the view by adding members, the leader sends the new view to the joining members; it becomes their initial view, and they can then participate in any subsequent runs of the protocol.
During recovery, a participant might receive a duplicated commit for a previously committed message. If so, it enters the delivery phase but does not redeliver the message or view to the application.

Delivery step

If a member receives a Commit message, it delivers the associated message(s) or new view(s) to the application, in the order that they were proposed by the leader. The leader learns that this step has succeeded when the acknowledgements used by the reliable 1-to-1 channel are received.

Message flow: Basic Gbcast, simplest case

(Quorum size = 2, view1={A,B,C})

Member   Leader        Members      Application Layer
            A          A  B  C       A  B  C
  |         |          |  |  |       |  |  |
  X-------->|          |  |  |       |  |  |  Request that the leader send a multicast M
  |         X--------->|->|->|       |  |  |  Propose(1.1: M) (View 1, sequence 1, message M)
  |         |<---------X--X--X       |  |  |  Promise(1.1)
  |         X--------->|->|->|       |  |  |  Commit(1.1)
  |         |<---------X--X--X------>M->M->M  Committed(1.1); Delivers M
  |         |          |  |  |       |  |  |

Error cases in basic Gbcast

The simplest error cases are those in which one or more members fail, but a quorum remains active. In the example below, the group consists of {A,B,C} with A playing the leader role. C fails during the promise phase and a timeout occurs within the reliable channel from the leader to process C. The leader therefore commits the delivery of M, but simultaneously initiates a protocol to remove C from the group, which commits, creating the new view {A,B}. If C has not actually failed, it can now rejoin the group but with a new incarnation number: in effect, C must rejoin as C'. Any messages from C to A or B will be rejected from the instant that each learns of the apparent failure: C will be shunned by A and B.

Message flow: Basic Gbcast, failure of member other than the Leader

(Quorum size = 2, view1={A,B,C})

Member   Leader        Members      Application Layer 
            A          A  B  C       A  B  C
  |         |          |  |  |       |  |  |
  X-------->|          |  |  |       |  |  |  Request(M)
  |         X--------->|->|->|       |  |  |  Propose(1.1: M)
  |         |          |  |  *       |  |  *  !! C FAILS !!
  |         |<---------X--X          |  |     Promise(1.1)
  |         X--------->|->|          |  |     Commit(1.1); Propose(1.2: "remove C")
  |         |<---------X--X--------->M->M     Committed(M); Delivers M; Promise(1.2)
  |         X--------->|->|->|       |  |     Commit(1.2);
  |         |<---------X--X--X------>V->V     Committed(1.2); Delivers view2={A,B}
  |         |          |  |          |  |

Notice that the Commit and the new Proposal (and the piggybacked failure notification) are combined into a single message. This ensures that any process that commits an action after a new failure has been sensed simultaneously learns of that failure and will shun the associated process, and that the process will quickly be removed from the view. If C hasn't crashed, it can rejoin by incrementing its incarnation number (so it is now named C') and then requesting that it be added back into the group by the leader. It will be appended to the membership list with its new name, and will have the highest rank (because it is the youngest member) among members of the view.

Message flow: Basic Gbcast, add members {D,E,F}, failure of member other than the Leader

In the example shown below, a group that initially contains members {A,B,C} is asked to add {D,E,F}, but member C fails during the protocol. Membership change requests are treated as a special kind of multicast and the sequence of events is the same. The example is thus nearly identical to the prior one, but now a series of new view events are delivered to the application. (Quorum size = 2, view1={A,B,C})

Member   Leader        Members               Application Layer 
            A          A  B  C  D  E  F       A  B  C  D  E  F
  |         |          |  |  |                |  |  |  |  |  |
  X-------->|          |  |  |                |  |  |  |  |  |  Request("add D,E,F")
  |         X--------->|->|->|                |  |  |  |  |  |  Propose(1.1: "add D,E,F")
  |         |          |  |  *                |  |  *  |  |  |  !! C FAILS !!
  |         |<---------X--X                   |  |     |  |  |  Promise(1.1)
  |         X--------->|->|                   |  |     |  |  |  Commit(1.1); Propose(2.1: "remove C")
  |         |<---------X--X-----X--X--X------>V->V---->V->V->V  Committed(1.1); Deliver view2={A,B,C,D,E,F}; Promise(2.1)
  |         X--------->|->|---->|->|->|       |  |     |  |  |  Commit(2.1)
  |         |<---------X--X-----X--X--X------>V->V---->V->V->V  Committed(2.1); Deliver view3={A,B,D,E,F}
  |         |          |  |     |  |  |       |  |     |  |  |

At the end of the protocol, the new active view is view3={A,B,D,E,F} and the new quorum size is 3. But notice that there was an "intermediate" view, view2={A,B,C,D,E,F} with quorum size of 4. Had the leader not received 4 promises to the proposal phase that removed C, it would not have been able to run the commit phase for view3. This illustrates a basic policy: the quorum required to commit a new view is always based on the size of the prior view.

Takeover protocol, used when the leader fails

The next failure case is when a leader fails, resulting in a new leader. To take over as the leader, the new leader first runs a takeover protocol, and then the new leader can run basic Gbcast as above. The takeover protocol is as follows:

Inquiry Step

The new leader sends a 1-to-n message interrogating non-failed members to learn of any messages they have promised to deliver.

Promise-List Step

Each recipient sends the current list of promised messages to the leader. If a recipient lacks its initial view, it sends a request for an initial view to the leader.
The new leader waits until it has either received a promise-list from each of the members it contacted, or has timed out. If a timeout occurs, the new leader suspects the member in question, and will shun it, as will any other members that it contacts. It will eventually propose a view that excludes these shunned members, as explained further below.

Repeat If Necessary

The new leader examines the promise-list, looking for membership-change messages that add new members. If any are present, it iterates the inquiry phase and promise-list collection phase, sending inquiries to the new members. This in turn could lead to the discovery of additional proposals that add still further members. The process terminates when every member (current or proposed to be added) has responded with a promise-list or been suspected by the new leader.

Check for Quorums

At the end of the inquiry phase, the leader has received promise-list responses from some of the processes it contacted; any unresponsive members will now be suspected. The new leader constructs a list of proposed views. To advance to the next step of the take-over proposal, the new leader must have received a quorum of responses from each of the committed or proposed views on this list. If it has failed to receive a quorum of responses for any committed or proposed view on the list, the new leader has failed to take over as leader and will never succeed. It terminates the protocol and must rejoin the system as a new member, using a new process incarnation number.

Start as New Leader

Having successfully checked for quorums, the new leader becomes the leader. It can now run the basic protocol. It re-proposes any promised messages or view-changes, in the order it learned them from the promise-lists, following them with a new view-change command that removes the old leader and any other members that failed to respond during the inquiry phase. If any member responded, during the promise-list phase, that it lacks its initial view, the new leader sends the appropriate initial view to that member.

Dueling Leaders

It is possible that the promise-lists include two or more distinct proposals for the same slot. This happens (only) if a first leader A became partitioned from the system, but nonetheless made a proposal X that was seen only by a small (non quorum) set of members. A new leader B then took over successfully, but didn't learn of A's proposal (which cannot have become committed). B now proposes Y, again at a small minority of members. Now B is believed to have failed and C takes over. It is possible for C to learn of proposals X and Y, for the same slot. C should ignore the proposal associated with the older leader, A, but retain the proposal associated with the newer leader, B: in this situation, proposal X cannot have achieved a quorum and hence cannot have become committed, whereas proposal Y, made by the more recent leader, could have become committed (otherwise, which is to say if X might have been reached a quorum, B would have learned of and hence repeated proposal X; thus because B didn't learn of X, X must not have received a quorum).
Note that C's take-over protocol uses a deterministic ordering among leaders A and B to determine that proposal X is doomed, because leader B must have shunned A in order to become leader. Conversely, C must assume that proposal Y may become committed, even if A suspected that B has failed, because proposal Y intersected with C's take-over step. The rule is implemented: by numbering the leaders sequentially and including the leader-number in the proposal. During the inquire step, a new leader can then use the proposal from the leader with the larger number, if it receives conflicting proposals for the same slot.

Failure Suspicions Piggyback on Outgoing Messages

Notice that the new leader believes the old leader to have failed, and may also believe that other members have failed. Thus, the inquiry phase, and or the new propose phase, may also carry piggybacked failure messages for one or more members. This is a central requirement for the protocol, because it ensures that those members will subsequently be shunned: if further communication is received from a shunned member, the receiver will reject those messages. It follows that if any member executes the promise-list phase for an old leader L, no further propose or commit messages from L will be processed by that member. From this we can see that the promise-list collected by the new leader will be complete, containing all promised messages that could possibly have achieved a quorum in the current view. It may also contain some additional promised messages that have not yet achieved a quorum.

Message flow: Basic Gbcast, failure of Leader, TakeOver, Basic Gbcast by the new leader

(Quorum size = 2, view 1={A,B,C})

Member   Leader        Members      Application Layer 
         A  B          A  B  C       A  B  C
  |      |             |  |  |       |  |  |
  X----->|             |  |  |       |  |  |  Request(M)
  |      X------------>|->|  |       |  |  |  Propose(1.1: M) !! Leader fails during send, Propose doesn't reach C !!
  |      *<------------X—-X  |       |  |  |  Promise(1.1) 
  |         |          *  |  |       *  |  |  !! A (THE LEADER) HAS FAILED !! 
  |         |             |  |          |  |  !! NEW LEADER: B !!
  |         ?------------>|->|          |  |  Inquire("B is taking over because A has failed")
  |         |<------------X--X          |  |  PromiseLists(1.1: M)
  |         X------------>|->|          |  |  Propose(1.1: M); Propose(1.2: "remove A")
  |         |<------------X--X--------->|  |  Promise(1.1); Promise(1.2) 
  |         X------------>|->|--------->|  |  Commit(1.1); Commit(1.2); 
  |         |<------------X--X-------->M;V->M;V  Committed(1.1); Committed(1.2); Delivers(M). Delivers view2={B,C}

Message flow: Basic Gbcast, Add members {D,E,F}, failure of the Leader

As an example of a more complex case, here the leader fails in the middle of a commit that increases the size of the view

(Quorum size = 2, view 1={A,B,C})

Member   Leader        Members                Application Layer 
         A  B          A  B  C  D  E  F       A  B  C  D  E  F
  |      |             |  |  |  |  |  |       |  |  |  |  |  |
  X----->|             |  |  |  |  |  |       |  |  |  |  |  |  Request("add D, E, F")
  |      X------------>|->|  |  |  |  |       |  |  |  |  |  |  Propose(1.1) !! Leader fails during send, Propose doesn't reach C !!
  |      *<------------X—-X  |  |  |  |       |  |  |  |  |  |  Promise(1.1) 
  |         |          *  |  |  |  |  |       *  |  |  |  |  |  !! A (THE LEADER) HAS FAILED !! 
  |         |             |  |  |  |  |          |  |  |  |  |  !! NEW LEADER: B !! 
  |         ?------------>|->|  |  |  |          |  |  |  |  |  Inquire("B is taking over because A has failed")
  |         |<------------X--X  |  |  |          |  |  |  |  |  PromiseLists(1.1: "add D, E, F");
  |         ?-------------|--|->|->|->|          |  |  |  |  |  Iterated Inquire("B is taking over because A has failed")
  |         |<------------|--|--X--X--X          |  |  |  |  |  PromiseLists(1.1: "add D, E, F");
  |         X------------>|->|->|->|->|          |  |  |  |  |  Propose(1.1: "add D, E, F"); Propose(2.1: "remove A")
  |         |<------------X--X--X--X--X          |  |  |  |  |  Promise(1.1); Promise(2.1); 
  |         X------------>|->|->|->|->|          |  |  |  |  |  Commit(1.1); Commit(2.1); 
  |         |<------------X--X->X->X->X -------->V->V->V->V->V  Committed(1.1); Committed(2.1); Delivers
                                                                   view2={A,B,C,D,E,F}. Delivers view3={B,C,D,E,F}

In this example we see the inquiry iteration "in action": B learns of the protocol that adds {D,E,F} in a first phase of the inquiry, hence it repeats the inquiry, this time contacting D, E and F. There is no need to repeat the inquiry at C since this would simply return the same information previously obtained.

In this example, the final commit actually causes two views to be delivered in succession at members B and C. Even though the two proposals were sent concurrently, the commit for view2 requires a promise from a quorum of view1, whereas the commit for view3 requires a quorum response from the members of view2. Although the sending of initial views isn't explicitly shown in the diagram, the joining members don't participate in the 1.1 protocol because they don't join the group until view2. Notice that at members B and C a pipelining effect arises: events associated with view2 are already being proposed even as events in view1 are still being committed.

Correctness

To show that Gbcast satisfies non-triviality we start by tracing backwards from an arbitrary delivery action to the point at which a client requested the corresponding event; clearly, only messages that were legitimately sent will be delivered. However, nontriviality for this protocol goes further: we must also show that messages from a given member are delivered only while that member is still a live participant in some view. Accordingly, we look at the case in which the leader initiates some multicast but then fails before it is delivered. Here, the new leader either discovers the pending proposal, and will order it before the view-change event, or the new leader fails to discover the pending proposal, in which case all members of the new view will shun any late-arriving incoming message from the old leader. Thus either a multicast message is delivered while the view in which it was sent is still pending, or it will not be delivered at all.

To establish consistency we begin by analysis of the case in which there is just a single leader that never fails or loses connectivity with a quorum. Since the leader sequences the events and includes each member starting with the first view that contains that member, all members deliver the identical messages starting from the view in which they were added to the system.

When a new leader takes over, the inquiry is required to reach a quorum of members for the most recent committed view. This quorum necessarily will include at least one process that received any proposal that the old leader could have committed. Thus the new leader will learn of any potentially committed proposal and include it as a preflix to its own new proposals. It follows that if any process delivers any event, then if the system makes progress, every surviving member will eventually deliver that same event and in the same order.

We can show that a joining member will receive its initial view by analysis of the two relevant cases. If the leader doesn't fail, it sends the initial view on an eventually reliable channel. If the leader does fail and some member lacks its initial view, the new leader sends that view after receipt of the "promise-list" response to its inquiry-phase message.

A logical partitioning of the group is impossible because of the shunning rule. In order to commit any new view, the old leader must obtain promises from a quorum of the current view. A new leader, taking over, will learn of any view that could have become committed. To commit its own proposed next view, it will thus be required to interact with a quorum of that intermediary view, if any. In a scenario that could lead to partitioning, the leader, A, might have timed out on B and gone on to create a sequence of new views and events that excluded B. But in this case a majority of the old or of the intermediary view members will have learned that A believes B to have failed, and will shun B when it inquires. In either case, B is prevented from obtaining a quorum and hence cannot make progress. A symmetric argument shows that if B succeeds in defining a new view that excludes A, A would be unable to obtain a quorum for any other new view that it might attempt to propose.

Liveness

The Gbcast protocol will make progress provided that at all times in the execution, if view v holds at time t, then less than a quorum of members of v fail (or are suspected as failing) within some subset of the members of the view. To maximize progress, it is important that excluded but still live members rejoin the group, so that erroneous failure detections don't cause the view to shrink in a persistent manner. However, the protocol will not recover and make progress if at any time, every process suspects more than a quorum of members of the current view of having failed.

This property is similar to but "stronger" than <>W, the "weakest failure detector" for achieving consensus, as defined by Chandra and Toueg. To see this, consider a run in which a mutually suspecting quorum arises "too quickly" for processes that have been wrongly excluded from the view to rejoin it. Gbcast will not make progress and, indeed, the group will need to shut down and restart.

Arguably, such runs would be unlikely in the kinds of data centers where Gbcast is typically used, but clearly they can be constructed in an adversarial manner.

Discussion: Failure sensing

The Gbcast protocol presumes that the probability of incorrect failure suspicions will be low; the scheme breaks down if failure suspicions occur frequently and operational processes are often suspected as faulty. By analogy, consider the TCP protocol, in which the failure to receive an acknowledgement will eventually cause a connection to break. TCP is used nearly universally, a tremendous disruption to the Web would result if TCP connections frequently were to break when neither endpoint has failed. Thus timeouts are set conservatively. A similar assumption is required for systems that use Gbcast.

In contrast, there are other failure detection schemes, such as the one explored by Chandra and Toueg, that can yield high rates of incorrect failure suspicions. Some protocols, including Paxos, are able to tolerate incorrect failure suspicions without any costly consequence. Whether one approach is inherently better than the other is beyond the scope of this discussion. We simply underscore that the approaches differ, and that Gbcast would be ineffective if timeouts are set overly aggressively.

One extreme scenario is worthy of further mention: network partitioning events. Modern data centers and networks often experience events in which a single machine and all the processes on it becomes transiently partitioned from a larger pool of machines that remain connected to one another. Such cases are treated as failures in Gbcast, but if the surviving, connected members include a sufficiently large number of processes, the majority portion of the system will simply reconfigure itself to exclude the disconnected member. It can reconnect and rejoin the group later when the partition heals.

A more extreme kind of partitioning is sometimes seen in data centers: in this situation, a network switch might fail, causing a collection of machines (perhaps, a whole rack or even an entire container) to become disconnected from the Internet and from the remainder of the data center. In such cases one could imagine a group in which all members begin to suspect all other members; Gbcast will not progress in this case and the management infrastructure would need to relaunch the entire application. On the other hand, in most large data centers, the operating systems of the machines experiencing such a failure would also shut down, restarting only when connectivity is restored. Thus in practice, the restart of the system is unavoidable. This said, there are protocols, such as Paxos, that could ride out such an outage if the machines themselves were to remain operational and later regained adequate connectivity.

The Transis system explored extensions to the Gbcast protocol that permit multiple partitions to form, to make independent progress, and then to remerge. This topic, however, is beyond the scope of the present discussion.

Discussion: Dueling leaders

In the Paxos protocol, a situation can arise in which two or more leaders "duel" by proposing different commands for the same slot. This can also occur in Gbcast.

In the normal sequence of events, one leader takes over because the prior leader has failed, learns of any proposals the prior leader made during the inquiry phase, and then repeats those same proposals, extended with new ones. Thus no duel over the content of slots arises because the same proposals are repeated in the same slots.

The closest situation to a duel is seen if the old leader has become partitioned from the majority and the new leader, taking over, is unable to contact some set of members (but does obtain the required quorum during the INQUIRE phase). Here the new leader may be unaware of some proposals that the old leader made, or might still issue, if those reach only the members the new leader didn't contact.

The shunning mechanism resolves such duels. When the new leader obtained a quorum during the INQUIRE phase, it also blocked the old leader from ever again achieving a quorum for any new PROPOSE it might initiate: a majority of members are now shunning the old leader. Thus if any proposal is missed by the new leader it necessarily is a proposal that didn't reach a quorum of members, and won't reach a quorum in the future. Moreover, members aware of such a proposal will be shunned by the new leader, since (when it gave up waiting for them to respond to its INQUIRE) it considers them to have failed. Any member learning of new proposals from the new leader will shun them as well.

Shunning of leaders in Gbcast occurs in the pre-determined order of leader ranks: a higher-ranking leader only shuns a lower-ranking leader when it tries to take-over its place. The Paxos ballots mechanism serves precisely the same purpose, but differs in allowing participants to attempt to take-over repeatedly, eaach time assuming a new ballot ("rank"). The result is that, one the one hand, Paxos leader demotion is reversible, and on the other, dueling leaders could theoretically continue forever.

Bi-simulation equivalence to Paxos

Although superficially quite different, upon close study Gbcast is seen to be surprisingly similar to Paxos. Indeed, Paxos can be "transformed" into Gbcast with the following (reversible) sequence of steps. For brevity we describe these steps informally and omit a detailed proof.

Note that this transformation does not address durability. Gbcast treats durable state as a property of the application, not the protocol, whereas Paxos logs events to a set of durable command logs, and hence can still recover its state even after the whole service is shut down and restarted. The equivalent behavior with Gbcast involves having the application log all received messages, but that case will not be considered here.

  1. Start with the basic Paxos protocol. Add a process incarnation number to distinguish a rejoining process from one that has been continuously a member of the view. Impose an age-based ordering on the members of the group, designate the oldest member (breaking ties lexicographic) as the leader. Non-leaders issue requests through the leader.
  2. Both protocols permit batching of requests: Basic Paxos has a concurrency parameter, alpha: a leader can concurrently run a maximum of alpha instances of the protocol. Gbcast permits the leader to propose multiple events in a single protocol instance, which could be message deliveries or view events.
  3. Paxos does not normally require reliable, ordered communication. Modify the protocol to run over the reliable one-to-one channel abstraction (a one-to-many message would be sent by Paxos over a set of one-to-one channels). We can now assume that any message sent will either be received and delivered in order, or that a timeout will occur at the sender side.
  4. The Paxos slot number will become the Gbcast sequence number. The Paxos ballot number is, in effect, transformed into the proposing leader-number used to discriminate between conflicting proposals during the inquire step.
  5. Define a category of view-modifying commands that operate by adding or removing processes from the group membership. Introduce a failure detection mechanism as used in Gbcast, asking the leader to remove any timed-out members. A member removed from the group that reestablishes connectivity to the group should rejoin with a new incarnation number. Report views by upcalls to the application.
  6. Basic Paxos can propose a multicast to just a quorum of group members, hence a typical member may have gaps in its command list. This is why, in Paxos, a learner must read a quorum of members and merge their command lists. In our modified protocol, any multicast is proposed to all non-failed members, while failed members are dropped from the view. Thus unlike Paxos, our modified protocol has the property that any single live member has the full committed event list. In effect, the protocol has a write quorum equal to the current membership view size, and a read quorum of 1. This can be convenient when building applications that maintain the actual state of a database or object and for which it is inconvenient to represent state as a series of updates in command lists that must be merged to learn the actual sequence of events.

The same quorum mechanisms that define Paxos, including the inquiry used when a new Paxos leader takes over, are now seen to correspond precisely to the steps of Gbcast. The ballot mechanism, generally viewed as the hallmark of Paxos protocols, reduces to a counter that tracks the order of succession of leaders. This simplification is fundamentally due to the guarantee that once a leader is suspected, it will be removed from the view and would need to rejoin before participating in the protocol.

It follows that Gbcast and Paxos can be transformed, each to the other, without changing assumptions and with the identical correctness properties. Obviously, the protocols don't look very similar, but they have a deep connection. Indeed, one can make a stronger claim: any sequence of delivery events exhibited by Gbcast can also arise in some run of Paxos, and vice versa: any sequence of learned events from Paxos can also arise in some run of Gbcast.

The type of proof outlined above is formally called a bi-simulation: one shows that any (input-sequence, output-behavior) pair that one protocol can exhibit is also possible with the other protocol. Notice that in carrying out a bisimulation, features that one protocol supports but the other lacks can be ignored if they are not considered to be part of the "behavior" being studied. For example, the Gbcast reporting of new views (events that Paxos lacks) are not treated as output events here.

Summary of differences between Paxos and Gbcast

  • Gbcast has no durable state: the protocol does not maintain a log of events on disk, and durability is treated as an application-specific property. In contrast, Paxos guarantees durability: after recovering from a complete shutdown of the system, a Paxos application will still be able to learn the full log of received messages.
  • In the propose phase, Gbcast must wait for responses from all participants (or for the maximal timeout and then suspect the remaining ones), instead of making progress with the fastest quorum. In Gbcast, the cost of a failure suspicion is high and the protocol may cease to make progress if too many failures are suspected, forcing a management layer to restart the entire application group. Thus, in practice, Gbcast requires conservative timeout settings relative to Paxos.
  • With Gbcast, if an error does occur (e.g. an operational process is suspected and shunned), that process must drop out (it can rejoin under a different name). With Paxos, if f>0, should a process be unable to participate in a protocol instance, it can continue to participate in subsequent protocol instances without error.
  • Operational members of a view will never have gaps in their command lists with Gbcast (every member has a complete state). Operational members can have gaps in their command lists when using Paxos (learners merge a quorum of lists in Paxos to "fill" these gaps).
  • With Paxos, to propose multiple commands we use alpha>1, but in this case commands can be committed in a different order from the order in which they were initiated (one case in which this problematic scenario is seen involves dueling leaders; leader A proposes commands a1 and a2, and leader B proposes commands b1 and b2; both then fail and leader C, taking over, ends up committing b2, and then a1: an outcome that might not be desired by the applications that initiated the requests [11]). With Gbcast, the leader can initiate multiple commands by issuing a single propose that describes a series of actions. The group will be committed all at once, hence the order of initiation will be respected.
  • With Gbcast, a command is delivered in the view in which it was initiated. Reconfigurable Paxos can commit a command in a slot associated with a membership view prior to the active membership view at the time when the commit occurs. Thus, in Paxos, if an application is in some way view sensitive, commands must carry a view identifier, so that recipients can determine whether or not the command is still executable.
  • Gbcast does not require that the protocol be halted when changing configurations: the rate of new proposals can be constant even across membership changes. For many implementations of reconfigurable Paxos, this would not be the case.
  • With both Gbcast and Paxos, reconfiguration is only possible if a quorum of the prior view is accessible and can acknowledge the new view. However, in Paxos, the requirement also extends to learning the outcomes of commands proposed for slots associated with the old view. In practice, this can cause the Paxos reconfiguration computation to extend over a longer period than for Gbcast, in which any state is stored within the application, not a long-lived command list: Paxos cannot discard the state associated with an old view until the new view is active and any replicas have learned the old state.
  • Gbcast does not require a garbage collection protocol because, as each message or view is committed and reported to the application it can be discarded. Paxos maintains state using a quorum scheme in the command logs at its acceptors, and requires a garbage collection protocol to free these command slots once the outcome is committed and all learners (replicas) have learned the outcome.

Liveness comparison

Both Paxos and Gbcast are subject to the FLP impossibility result.[9] Thus neither protocol can be guaranteed live under all possible conditions. At best we can talk about the conditions under which liveness is guaranteed, expressed as predicates on the failure detection mechanism: if the condition for liveness holds, then the protocol will be live. The liveness conditions of Basic Paxos and Gbcast are similar but not identical.

In Gbcast, progress will never resume if a circle of mutual suspicions arises, as noted above: once a quorum of mutually shunning processes arises, the shunning mechanism makes it impossible for any leader to obtain a quorum of promises.

With an (unmodified) Paxos protocol, this problem will not arise: once the excessive level of mutual suspicions ends, progress resumes. Thus Paxos makes progress with any failure-detection mechanism satisfying the <>W condition, even if periods arise during which more than a quorum of mutual suspicions occur.

For example, if we start with a group containing {A.B,C} and cause an extended network partition, Paxos would resume when the network partition resolves but Gbcast will shut down permanently and some form of management infrastructure may need to restart the system. If it is necessary to preserve group state across the failure, such an infrastructure would identify the last member to fail and restart the group using some form of checkpoint stored by that last member.

In Paxos deployments, it is common to require human operator intervention to reconfigure Paxos. In such settings, Gbcast may be able to make progress during period when Paxos cannot. Suppose that a group has membership that slowly drops to less than a quorum of the original group size. Gbcast can continue to operate with even a single member. Paxos would cease to make progress during periods when less than a quorum of its view are active.

Need for state transfer

Systems such as Isis that implement Gbcast typically provide a state transfer mechanism: at the instant the new view showing some joining member is delivered, some existing member makes a checkpoint of its copy of the group state. This is then copied to the new member, which loads the checkpoint as the initial group state as of the instant it joined. (Various out-of-band copying schemes can be used to pre-load some state prior to the join for cases where the state is too large to transfer at the last moment this way). State transfer is needed because in Gbcast, once a member is dropped from a group, it will no longer receive updates. Gbcast is typically used by applications that maintain their state in memory and apply updates one by one as received, hence once a gap arises, a replica is no longer useful.

Notice that this is in contrast to Paxos. In that protocol, gaps can arise as a consequence of the basic quorum update scheme, which doesn't guarantee that every member will see every update and can run over unreliable message passing layers that might never deliver some messages. The Paxos learner algorithm reads multiple histories and combines them to fill such gaps. Thus Paxos will normally ride out transient failures, continuing to operate without actually dropping the failed member from the group. The failed member misses updates, yet state transfer is not needed unless a group is being reconfigured.

Which dynamically reconfigurable state machine replication protocol came first?

The Gbcast protocol was published early in a period when several state machine protocols capable of managing their own membership were introduced: Gbcast, View-Stamped Replication (Oki and Liskov [12]), Basic Paxos (Lamport [5]), the partial synchrony protocol of Dwork, Lynch and Stockmeyer,[13] etc. Among these, Gbcast was the first to be published, in papers that appeared in 1985 and 1987; the others were published starting in 1988. One could thus argue that Gbcast was really the first Paxos protocol. Such a statement, however, treats "Paxos" as a fairly broad term covering a family of protocols that all implement state machine replication, all support dynamic reconfiguration of their membership, and have identical correctness properties but vary in their liveness conditions. Under this definition, Gbcast is a Paxos protocol.

If equivalence is formalized using bisimulation, in which any run that one protocol can exhibit is also exhibited by the other, and in which the assumptions made and the conditions for progress are identical, the comparison becomes more complex. Under this definition, Gbcast is not a Paxos protocol: although each can exhibit the same runs as the other (viewed purely in terms of requests from the application and notifications to the application), they have similar, but not identical, liveness conditions. However, this sort of stringent definition poses a different problem: if one adopts it, some versions of Paxos are not Paxos protocols. For example, "Cheap Paxos" and "Vertical Paxos" are not bisimulation-equivalent to Basic Paxos.[14]

Thus the question has no answer unless one makes it more specific, and has a different answer depending upon the definition of equivalence one uses.

See also

References

  1. 1.0 1.1 1.2 Birman, Kenneth (Dec 1985). "Replication and Fault-Tolerance in the ISIS System". 10th ACM Symposium on Operating Systems Principles. pp. 79–86. 
  2. 2.0 2.1 Birman, Kenneth; Joseph, Thomas (February 1987). "Reliable Communication in the Presence of Failures". ACM Transactions on Computer Systems 5: 47–76. doi:10.1145/7351.7478. https://ecommons.cornell.edu/bitstream/1813/6534/1/85-694.pdf. 
  3. 3.0 3.1 Birman, Kenneth (July 1999). "A Review of Experiences with Reliable Multicast". Software: Practice and Experience 29 (9): 741–774. doi:10.1002/(sici)1097-024x(19990725)29:9<741::aid-spe259>3.0.co;2-i. 
  4. 4.0 4.1 Lamport, Leslie (July 1978). "Time, Clocks and the Ordering of Events in a Distributed System". Communications of the ACM 21 (7): 558–565. doi:10.1145/359545.359563. http://research.microsoft.com/users/lamport/pubs/pubs.html#time-clocks. Retrieved 2007-02-02. 
  5. 5.0 5.1 5.2 Lamport, Leslie (May 1998). "The Part-Time Parliament". ACM Transactions on Computer Systems 16 (2): 133–169. doi:10.1145/279227.279229. http://research.microsoft.com/users/lamport/pubs/pubs.html#lamport-paxos. Retrieved 2007-02-02. 
  6. Schneider, Fred (1990). "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial". ACM Computing Surveys 22 (4): 299–319. doi:10.1145/98163.98167. http://www.cs.cornell.edu/fbs/publications/smsurvey.pdf. 
  7. 7.0 7.1 "Reconfiguring a State Machine". SIGACT News 41 (1): 63–73. March 2010. doi:10.1145/1753171.1753191. 
  8. Pease, Marshall; Robert Shostak; Leslie Lamport (April 1980). "Reaching Agreement in the Presence of Faults". Journal of the Association for Computing Machinery 27 (2): 228–234. doi:10.1145/322186.322188. http://research.microsoft.com/users/lamport/pubs/pubs.html#reaching. Retrieved 2007-02-02. 
  9. 9.0 9.1 Fischer, M. (April 1985). "Impossibility of distributed consensus with one faulty process". Journal of the ACM 32 (2): 374–382. doi:10.1145/3149.214121. http://portal.acm.org/ft_gateway.cfm?id=214121&type=pdf&coll=DL&dl=GUIDE&CFID=8106370&CFTOKEN=39451481. 
  10. Lamport, Leslie; Robert Shostak; Marshall Pease (July 1982). "The Byzantine Generals Problem". ACM Transactions on Programming Languages and Systems 4 (3): 382–401. doi:10.1145/357172.357176. http://research.microsoft.com/users/lamport/pubs/pubs.html#byz. Retrieved 2007-02-02. 
  11. Birman, Ken; Dahlia Malkhi; Robbert van Renesse (November 2011). Virtually Synchronous Methodology for Dynamic Service Replication. Microsoft Research TechReport MSR-2010-151. http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/vs-submit.pdf. 
  12. Oki, Brian; Barbara Liskov (1988). "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems". PODC '88: Proceedings of the seventh annual ACM Symposium on Principles of Distributed Computing. pp. 8–17. doi:10.1145/62546.62549. http://portal.acm.org/citation.cfm?id=62549. 
  13. Dwork, Cynthia; Lynch, Nancy; Stockmeyer, Larry (April 1988). "Consensus in the Presence of Partial Synchrony". Journal of the ACM 35 (2): 288–323. doi:10.1145/42282.42283. http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf. 
  14. Lamport, Leslie (2012). "Unpublished remark".