Skip to content

Commit 2411d2f

Browse files
lankasevergreen
authored andcommitted
SERVER-43387 Summarize changes to Election Protocol in Replication Architecture Guide
1 parent ca5ed31 commit 2411d2f

File tree

1 file changed

+82
-48
lines changed

1 file changed

+82
-48
lines changed

src/mongo/db/repl/README.md

Lines changed: 82 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -472,33 +472,59 @@ extended to support a sharded cluster and use a **Lamport Clock** to provide **c
472472

473473
## Step Up
474474

475-
A node runs for election when it does a priority takeover or when it doesn't see a primary within
476-
the election timeout.
477-
478-
#### Candidate Perspective
479-
480-
A candidate node first runs a dry-run election. In a **dry-run election**, a node sends out
481-
`replSetRequestVotes` commands to every node asking if that node would vote for it, but the
482-
candidate node does not increase its term. If a primary ever sees a higher term than its own, it
483-
steps down. By first conducting a dry-run election, we prevent nodes from increasing their own term
484-
when they would not win and prevent needless primary stepdowns. If the node fails the dry-run
485-
election, it just continues replicating as normal. If the node wins the dry-run election, it begins
486-
a real election.
487-
488-
In the real election, the node first increments its term and votes for itself. It then starts a
489-
[`VoteRequester`](https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/vote_requester.h),
475+
There are a number of ways that a node will run for election:
476+
* If it hasn't seen a primary within the election timeout (which defaults to 10 seconds).
477+
* If it realizes that it has higher priority than the primary, it will wait and run for
478+
election (also known as a **priority takeover**). The amount of time the node waits before calling
479+
an election is directly related to its priority in comparison to the priority of rest of the set
480+
(so higher priority nodes will call for a priority takeover faster than lower priority nodes).
481+
Priority takeovers allow users to specify a node that they would prefer be the primary.
482+
* Newly elected primaries attempt to catchup to the latest applied OpTime in the replica
483+
set. Until this process (called primary catchup) completes, the new primary will not accept
484+
writes. If a secondary realizes that it is more up-to-date than the primary and the primary takes
485+
longer than `catchUpTakeoverDelayMillis` (default 30 seconds), it will run for election. This
486+
behvarior is known as a **catchup takeover**. If primary catchup is taking too long, catchup
487+
takeover can help allow the replica set to accept writes sooner, since a more up-to-date node will
488+
not spend as much time (or any time) in catchup. See the "Transitioning to PRIMARY" section for
489+
further details on primary catchup.
490+
* The `replSetStepUp` command can be run on an eligible node to cause it to run for election
491+
immediately. We don't expect users to call this command, but it is run internally for election
492+
handoff and testing.
493+
* When a node is stepped down via the `replSetStepDown` command, if the `enableElectionHandoff`
494+
parameter is set to true (the default), it will choose an eligible secondary to run the
495+
`replSetStepUp` command on a best-effort basis. This behavior is called **election handoff**. This
496+
will mean that the replica set can shorten failover time, since it skips waiting for the election
497+
timeout. If `replSetStepDown` was called with `force: true` or the node was stepped down while
498+
`enableElectionHandoff` is false, then nodes in the replica set will wait until the election
499+
timeout triggers to run for election.
500+
501+
502+
### Candidate Perspective
503+
504+
A candidate node first runs a dry-run election. In a **dry-run election**, a node starts a
505+
[`VoteRequester`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/vote_requester.h),
490506
which uses a
491-
[`ScatterGatherRunner`](https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/scatter_gather_runner.h)
492-
to send a `replSetRequestVotes` command to every single node. Each node then decides if it should
493-
vote "aye" or "nay" and responds to the candidate with their vote. When nodes respond, the
494-
`ReplicationCoordinator` updates its `SlaveInfo` map to say that those nodes are still up for
495-
liveness information. The candidate node must be at least as up to date as a majority of voting
507+
[`ScatterGatherRunner`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/scatter_gather_runner.h)
508+
to send a `replSetRequestVotes` command to every node asking if that node would vote for it. The
509+
candidate node does not increase its term during a dry-run because if a primary ever sees a higher
510+
term than its own, it steps down. By first conducting a dry-run election, we make it unlikely that
511+
nodes will increase their own term when they would not win and prevent needless primary stepdowns.
512+
If the node fails the dry-run election, it just continues replicating as normal. If the node wins
513+
the dry-run election, it begins a real election.
514+
515+
If the candidate was stepped up as a result of an election handoff, it will skip the dry-run and
516+
immediately call for a real election.
517+
518+
In the real election, the node first increments its term and votes for itself. It then follows the
519+
same process as the dry-run to start a `VoteRequester` to send a `replSetRequestVotes` command to
520+
every single node. Each node then decides if it should vote "aye" or "nay" and responds to the
521+
candidate with their vote. The candidate node must be at least as up to date as a majority of voting
496522
members in order to get elected.
497523

498524
If the candidate received votes from a majority of nodes, including itself, the candidate wins the
499525
election.
500526

501-
#### Voter Perspective
527+
### Voter Perspective
502528

503529
When a node receives a `replSetRequestVotes` command, it first checks if the term is up to date and
504530
updates its own term accordingly. The `ReplicationCoordinator` then asks the `TopologyCoordinator`
@@ -507,7 +533,7 @@ if it should grant a vote. The vote is rejected if:
507533
1. It's from an older term.
508534
2. The config versions do not match.
509535
3. The replica set name does not match.
510-
4. The last committed OpTime that comes in the vote request is older than the voter's last applied
536+
4. The last applied OpTime that comes in the vote request is older than the voter's last applied
511537
OpTime.
512538
5. If it's not a dry-run election and the voter has already voted in this term.
513539
6. If the voter is an arbiter and it can see a healthy primary of greater or equal priority. This is
@@ -519,32 +545,40 @@ the `local.replset.election` collection. This information is read into memory at
519545
future elections. This ensures that even if a node restarts, it does not vote for two nodes in the
520546
same term.
521547

522-
#### Transitioning to PRIMARY
523-
524-
Now that the candidate has won, it must become PRIMARY. First it notifies all nodes that it won the
525-
election via a round of heartbeats. Then the node checks if it needs to catch up from the former
526-
primary. Since the node can be elected without the former primary's vote, the primary-elect will
527-
attempt to replicate any remaining oplog entries it has not yet replicated from any viable sync
528-
source. While these are guaranteed to not be committed, it is still good to minimize rollback when
529-
possible.
530-
531-
The primary-elect uses the
532-
[`FreshnessScanner`](https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/freshness_scanner.h)
533-
to send a `replSetGetStatus` request to every other node to see the last applied OpTime of every
534-
other node. If the primary-elect’s last applied OpTime is less than the newest last applied OpTime
535-
it sees, it will schedule a timer for the catchup-timeout. If that timeout expires or if the node
536-
reaches the old primary's last applied OpTime, then the node ends catch-up phase. The node then
537-
stops the `OplogFetcher`.
538-
539-
At this point the node goes into "drain mode". This is when the node has already logged "transition
540-
to PRIMARY", but has not yet applied all of the oplog entries in its oplog buffer.
541-
`replSetGetStatus` will now say the node is in PRIMARY state. The applier keeps running, and when it
542-
completely drains the buffer, it signals to the `ReplicationCoordinator` to finish the step up
543-
process. The node marks that it can begin to accept writes. According to the Raft Protocol, no oplog
544-
entries from previous terms can be committed until an oplog entry in the current term is committed.
545-
The node now writes a "new primary" noop oplog entry so that it can commit older writes as soon as
546-
possible. Finally, the node drops all temporary collections and logs “transition to primary
547-
complete”.
548+
### Transitioning to PRIMARY
549+
550+
Now that the candidate has won, it must become PRIMARY. First it clears its sync source and notifies
551+
all nodes that it won the election via a round of heartbeats. Then the node checks if it needs to
552+
catch up from the former primary. Since the node can be elected without the former primary's vote,
553+
the primary-elect will attempt to replicate any remaining oplog entries it has not yet replicated
554+
from any viable sync source. While these are guaranteed to not be committed, it is still good to
555+
minimize rollback when possible.
556+
557+
The primary-elect uses the responses from the recent round of heartbeats to see the latest applied
558+
OpTime of every other node. If the primary-elect’s last applied OpTime is less than the newest last
559+
applied OpTime it sees, it will set that as its target OpTime to catch up to. At the beginning of
560+
catchup, the primary-elect will schedule a timer for the catchup-timeout. If that timeout expires or
561+
if the node reaches the target OpTime, then the node ends the catch-up phase. The node then clears
562+
its sync source and stops the `OplogFetcher`.
563+
564+
We will ignore whether or not **chaining** is enabled for primary catchup so that the primary-elect
565+
can find a sync source. And one thing to note is that the primary-elect will not necessarily sync
566+
from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will
567+
mean that the primary-elect will still be able to catchup to its target OpTime. Since catchup is
568+
best-effort, it could time out before the node has applied operations through the target OpTime.
569+
Even if this happens, the primary-elect will not step down.
570+
571+
At this point, whether catchup was successful or not, the node goes into "drain mode". This is when
572+
the node has already logged "transition to PRIMARY", but has not yet applied all of the oplog
573+
entries in its oplog buffer. `replSetGetStatus` will now say the node is in PRIMARY state. The
574+
applier keeps running, and when it completely drains the buffer, it signals to the
575+
`ReplicationCoordinator` to finish the step up process. The node marks that it can begin to accept
576+
writes. According to the Raft Protocol, we cannot update the commit point to reflect oplog entries
577+
from previous terms until the commit point is updated to reflect an oplog entry in the current term.
578+
The node writes a "new primary" noop oplog entry so that it can commit older writes as soon as
579+
possible. Once the commit point is updated to reflect the "new primary" oplog entry, older writes
580+
will automatically be part of the commit point by nature of happening before the term change.
581+
Finally, the node drops all temporary collections and logs “transition to primary complete”.
548582

549583
## Step Down
550584

0 commit comments

Comments
 (0)