@@ -472,33 +472,59 @@ extended to support a sharded cluster and use a **Lamport Clock** to provide **c
472472
473473## Step Up
474474
475- A node runs for election when it does a priority takeover or when it doesn't see a primary within
476- the election timeout.
477-
478- #### Candidate Perspective
479-
480- A candidate node first runs a dry-run election. In a ** dry-run election** , a node sends out
481- ` replSetRequestVotes ` commands to every node asking if that node would vote for it, but the
482- candidate node does not increase its term. If a primary ever sees a higher term than its own, it
483- steps down. By first conducting a dry-run election, we prevent nodes from increasing their own term
484- when they would not win and prevent needless primary stepdowns. If the node fails the dry-run
485- election, it just continues replicating as normal. If the node wins the dry-run election, it begins
486- a real election.
487-
488- In the real election, the node first increments its term and votes for itself. It then starts a
489- [ ` VoteRequester ` ] ( https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/vote_requester.h ) ,
475+ There are a number of ways that a node will run for election:
476+ * If it hasn't seen a primary within the election timeout (which defaults to 10 seconds).
477+ * If it realizes that it has higher priority than the primary, it will wait and run for
478+ election (also known as a ** priority takeover** ). The amount of time the node waits before calling
479+ an election is directly related to its priority in comparison to the priority of rest of the set
480+ (so higher priority nodes will call for a priority takeover faster than lower priority nodes).
481+ Priority takeovers allow users to specify a node that they would prefer be the primary.
482+ * Newly elected primaries attempt to catchup to the latest applied OpTime in the replica
483+ set. Until this process (called primary catchup) completes, the new primary will not accept
484+ writes. If a secondary realizes that it is more up-to-date than the primary and the primary takes
485+ longer than ` catchUpTakeoverDelayMillis ` (default 30 seconds), it will run for election. This
486+ behvarior is known as a ** catchup takeover** . If primary catchup is taking too long, catchup
487+ takeover can help allow the replica set to accept writes sooner, since a more up-to-date node will
488+ not spend as much time (or any time) in catchup. See the "Transitioning to PRIMARY" section for
489+ further details on primary catchup.
490+ * The ` replSetStepUp ` command can be run on an eligible node to cause it to run for election
491+ immediately. We don't expect users to call this command, but it is run internally for election
492+ handoff and testing.
493+ * When a node is stepped down via the ` replSetStepDown ` command, if the ` enableElectionHandoff `
494+ parameter is set to true (the default), it will choose an eligible secondary to run the
495+ ` replSetStepUp ` command on a best-effort basis. This behavior is called ** election handoff** . This
496+ will mean that the replica set can shorten failover time, since it skips waiting for the election
497+ timeout. If ` replSetStepDown ` was called with ` force: true ` or the node was stepped down while
498+ ` enableElectionHandoff ` is false, then nodes in the replica set will wait until the election
499+ timeout triggers to run for election.
500+
501+
502+ ### Candidate Perspective
503+
504+ A candidate node first runs a dry-run election. In a ** dry-run election** , a node starts a
505+ [ ` VoteRequester ` ] ( https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/vote_requester.h ) ,
490506which uses a
491- [ ` ScatterGatherRunner ` ] ( https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/scatter_gather_runner.h )
492- to send a ` replSetRequestVotes ` command to every single node. Each node then decides if it should
493- vote "aye" or "nay" and responds to the candidate with their vote. When nodes respond, the
494- ` ReplicationCoordinator ` updates its ` SlaveInfo ` map to say that those nodes are still up for
495- liveness information. The candidate node must be at least as up to date as a majority of voting
507+ [ ` ScatterGatherRunner ` ] ( https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/scatter_gather_runner.h )
508+ to send a ` replSetRequestVotes ` command to every node asking if that node would vote for it. The
509+ candidate node does not increase its term during a dry-run because if a primary ever sees a higher
510+ term than its own, it steps down. By first conducting a dry-run election, we make it unlikely that
511+ nodes will increase their own term when they would not win and prevent needless primary stepdowns.
512+ If the node fails the dry-run election, it just continues replicating as normal. If the node wins
513+ the dry-run election, it begins a real election.
514+
515+ If the candidate was stepped up as a result of an election handoff, it will skip the dry-run and
516+ immediately call for a real election.
517+
518+ In the real election, the node first increments its term and votes for itself. It then follows the
519+ same process as the dry-run to start a ` VoteRequester ` to send a ` replSetRequestVotes ` command to
520+ every single node. Each node then decides if it should vote "aye" or "nay" and responds to the
521+ candidate with their vote. The candidate node must be at least as up to date as a majority of voting
496522members in order to get elected.
497523
498524If the candidate received votes from a majority of nodes, including itself, the candidate wins the
499525election.
500526
501- #### Voter Perspective
527+ ### Voter Perspective
502528
503529When a node receives a ` replSetRequestVotes ` command, it first checks if the term is up to date and
504530updates its own term accordingly. The ` ReplicationCoordinator ` then asks the ` TopologyCoordinator `
@@ -507,7 +533,7 @@ if it should grant a vote. The vote is rejected if:
5075331 . It's from an older term.
5085342 . The config versions do not match.
5095353 . The replica set name does not match.
510- 4 . The last committed OpTime that comes in the vote request is older than the voter's last applied
536+ 4 . The last applied OpTime that comes in the vote request is older than the voter's last applied
511537 OpTime.
5125385 . If it's not a dry-run election and the voter has already voted in this term.
5135396 . If the voter is an arbiter and it can see a healthy primary of greater or equal priority. This is
@@ -519,32 +545,40 @@ the `local.replset.election` collection. This information is read into memory at
519545future elections. This ensures that even if a node restarts, it does not vote for two nodes in the
520546same term.
521547
522- #### Transitioning to PRIMARY
523-
524- Now that the candidate has won, it must become PRIMARY. First it notifies all nodes that it won the
525- election via a round of heartbeats. Then the node checks if it needs to catch up from the former
526- primary. Since the node can be elected without the former primary's vote, the primary-elect will
527- attempt to replicate any remaining oplog entries it has not yet replicated from any viable sync
528- source. While these are guaranteed to not be committed, it is still good to minimize rollback when
529- possible.
530-
531- The primary-elect uses the
532- [ ` FreshnessScanner ` ] ( https://github.com/mongodb/mongo/blob/r3.4.2/src/mongo/db/repl/freshness_scanner.h )
533- to send a ` replSetGetStatus ` request to every other node to see the last applied OpTime of every
534- other node. If the primary-elect’s last applied OpTime is less than the newest last applied OpTime
535- it sees, it will schedule a timer for the catchup-timeout. If that timeout expires or if the node
536- reaches the old primary's last applied OpTime, then the node ends catch-up phase. The node then
537- stops the ` OplogFetcher ` .
538-
539- At this point the node goes into "drain mode". This is when the node has already logged "transition
540- to PRIMARY", but has not yet applied all of the oplog entries in its oplog buffer.
541- ` replSetGetStatus ` will now say the node is in PRIMARY state. The applier keeps running, and when it
542- completely drains the buffer, it signals to the ` ReplicationCoordinator ` to finish the step up
543- process. The node marks that it can begin to accept writes. According to the Raft Protocol, no oplog
544- entries from previous terms can be committed until an oplog entry in the current term is committed.
545- The node now writes a "new primary" noop oplog entry so that it can commit older writes as soon as
546- possible. Finally, the node drops all temporary collections and logs “transition to primary
547- complete”.
548+ ### Transitioning to PRIMARY
549+
550+ Now that the candidate has won, it must become PRIMARY. First it clears its sync source and notifies
551+ all nodes that it won the election via a round of heartbeats. Then the node checks if it needs to
552+ catch up from the former primary. Since the node can be elected without the former primary's vote,
553+ the primary-elect will attempt to replicate any remaining oplog entries it has not yet replicated
554+ from any viable sync source. While these are guaranteed to not be committed, it is still good to
555+ minimize rollback when possible.
556+
557+ The primary-elect uses the responses from the recent round of heartbeats to see the latest applied
558+ OpTime of every other node. If the primary-elect’s last applied OpTime is less than the newest last
559+ applied OpTime it sees, it will set that as its target OpTime to catch up to. At the beginning of
560+ catchup, the primary-elect will schedule a timer for the catchup-timeout. If that timeout expires or
561+ if the node reaches the target OpTime, then the node ends the catch-up phase. The node then clears
562+ its sync source and stops the ` OplogFetcher ` .
563+
564+ We will ignore whether or not ** chaining** is enabled for primary catchup so that the primary-elect
565+ can find a sync source. And one thing to note is that the primary-elect will not necessarily sync
566+ from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will
567+ mean that the primary-elect will still be able to catchup to its target OpTime. Since catchup is
568+ best-effort, it could time out before the node has applied operations through the target OpTime.
569+ Even if this happens, the primary-elect will not step down.
570+
571+ At this point, whether catchup was successful or not, the node goes into "drain mode". This is when
572+ the node has already logged "transition to PRIMARY", but has not yet applied all of the oplog
573+ entries in its oplog buffer. ` replSetGetStatus ` will now say the node is in PRIMARY state. The
574+ applier keeps running, and when it completely drains the buffer, it signals to the
575+ ` ReplicationCoordinator ` to finish the step up process. The node marks that it can begin to accept
576+ writes. According to the Raft Protocol, we cannot update the commit point to reflect oplog entries
577+ from previous terms until the commit point is updated to reflect an oplog entry in the current term.
578+ The node writes a "new primary" noop oplog entry so that it can commit older writes as soon as
579+ possible. Once the commit point is updated to reflect the "new primary" oplog entry, older writes
580+ will automatically be part of the commit point by nature of happening before the term change.
581+ Finally, the node drops all temporary collections and logs “transition to primary complete”.
548582
549583## Step Down
550584
0 commit comments