2.2. Replicator Database

Changed in version 2.1.0: Scheduling replicator was introduced. Replication states, by default are not written back to documents anymore. There are new replication job states and new API endpoints _scheduler/jobs and _scheduler/docs.

Changed in version 3.2.0: Fair share scheduling was introduced. Multiple _replicator databases get an equal chance (configurable) of running their jobs. Previously replication jobs were scheduled without any regard of their originating database.

Changed in version 3.3.0: winning_revs_only: true replicator option to replicate the winning document revisions.

The _replicator database works like any other in CouchDB, but documents added to it will trigger replications. Create (PUT or POST) a document to start replication. DELETE a replication document to cancel an ongoing replication.

These documents have exactly the same content as the JSON objects we used to POST to _replicate (fields source, target, create_target, create_target_params, continuous, doc_ids, filter, query_params, use_checkpoints, checkpoint_interval).

Replication documents can have a user defined _id (handy for finding a specific replication request later). Design Documents (and _local documents) added to the replicator database are ignored.

The default replicator database is _replicator. Additional replicator databases can be created. To be recognized as such by the system, their database names should end with /_replicator.

2.2.1. Basics

Let’s say you POST the following document into _replicator:

{
    "_id": "my_rep",
    "source": "http://user:password@myserver.com/foo",
    "target": {
        "url": "http://localhost:5984/bar",
        "auth": {
            "basic": {
                "username": "adm",
                "password": "pass"
            }
        }
    },
    "create_target": true,
    "continuous": true
}

In the couch log you’ll see 2 entries like these:

[notice] 2017-04-05T17:16:19.646716Z node1@127.0.0.1 <0.29432.0> -------- Replication `"a81a78e822837e66df423d54279c15fe+continuous+create_target"` is using:
    4 worker processes
    a worker batch size of 500
    20 HTTP connections
    a connection timeout of 30000 milliseconds
    10 retries per request
    socket options are: [{keepalive,true},{nodelay,false}]
[notice] 2017-04-05T17:16:19.646759Z node1@127.0.0.1 <0.29432.0> -------- Document `my_rep` triggered replication `a81a78e822837e66df423d54279c15fe+continuous+create_target`

Replication state of this document can then be queried from http://adm:pass@localhost:5984/_scheduler/docs/_replicator/my_rep

{
    "database": "_replicator",
    "doc_id": "my_rep",
    "error_count": 0,
    "id": "a81a78e822837e66df423d54279c15fe+continuous+create_target",
    "info": {
        "revisions_checked": 113,
        "missing_revisions_found": 113,
        "docs_read": 113,
        "docs_written": 113,
        "changes_pending": 0,
        "doc_write_failures": 0,
        "checkpointed_source_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ",
        "source_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ",
        "through_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ"
    },
    "last_updated": "2017-04-05T19:18:15Z",
    "node": "node1@127.0.0.1",
    "source_proxy": null,
    "target_proxy": null,
    "source": "http://myserver.com/foo/",
    "start_time": "2017-04-05T19:18:15Z",
    "state": "running",
    "target": "http://localhost:5984/bar/"
}

The state is running. That means replicator has scheduled this replication job to run. Replication document contents stay the same. Previously, before version 2.1, it was updated with the triggered state.

The replication job will also appear in

http://adm:pass@localhost:5984/_scheduler/jobs

{
    "jobs": [
        {
            "database": "_replicator",
            "doc_id": "my_rep",
            "history": [
                {
                    "timestamp": "2017-04-05T19:18:15Z",
                    "type": "started"
                },
                {
                    "timestamp": "2017-04-05T19:18:15Z",
                    "type": "added"
                }
            ],
            "id": "a81a78e822837e66df423d54279c15fe+continuous+create_target",
            "info": {
                "changes_pending": 0,
                "checkpointed_source_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ",
                "doc_write_failures": 0,
                "docs_read": 113,
                "docs_written": 113,
                "missing_revisions_found": 113,
                "revisions_checked": 113,
                "source_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ",
                "through_seq": "113-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE01ygQLsZsYGqcamiZjKcRqRxwIkGRqA1H-oSbZgk1KMLCzTDE0wdWUBAF6HJIQ"
            },
            "node": "node1@127.0.0.1",
            "pid": "<0.1174.0>",
            "source": "http://myserver.com/foo/",
            "start_time": "2017-04-05T19:18:15Z",
            "target": "http://localhost:5984/bar/",
            "user": null
        }
    ],
    "offset": 0,
    "total_rows": 1
}

_scheduler/jobs shows more information, such as a detailed history of state changes. If a persistent replication has not yet started, has failed, or is completed, information about its state can only be found in _scheduler/docs. Keep in mind that some replication documents could be invalid and could not become a replication job. Others might be delayed because they are fetching data from a slow source database.

If there is an error, for example if the source database is missing, the replication job will crash and retry after a wait period. Each successive crash will result in a longer waiting period.

For example, POST-ing this document

{
    "_id": "my_rep_crashing",
    "source": "http://user:password@myserver.com/missing",
    "target": {
        "url": "http://localhost:5984/bar",
        "auth": {
            "basic": {
                "username": "adm",
                "password": "pass"
            }
        }
    },
    "create_target": true,
    "continuous": true
}

when source database is missing, will result in periodic starts and crashes with an increasingly larger interval. The history list from _scheduler/jobs for this replication would look something like this:

[
      {
          "reason": "db_not_found: could not open http://adm:*****@localhost:5984/missing/",
          "timestamp": "2017-04-05T20:55:10Z",
          "type": "crashed"
      },
      {
          "timestamp": "2017-04-05T20:55:10Z",
          "type": "started"
      },
      {
          "reason": "db_not_found: could not open http://adm:*****@localhost:5984/missing/",
          "timestamp": "2017-04-05T20:47:10Z",
          "type": "crashed"
      },
      {
          "timestamp": "2017-04-05T20:47:10Z",
          "type": "started"
      }
]

_scheduler/docs shows a shorter summary:

{
      "database": "_replicator",
      "doc_id": "my_rep_crashing",
      "error_count": 6,
      "id": "cb78391640ed34e9578e638d9bb00e44+create_target",
      "info": {
           "error": "db_not_found: could not open http://myserver.com/missing/"
      },
      "last_updated": "2017-04-05T20:55:10Z",
      "node": "node1@127.0.0.1",
      "source_proxy": null,
      "target_proxy": null,
      "source": "http://myserver.com/missing/",
      "start_time": "2017-04-05T20:38:34Z",
      "state": "crashing",
      "target": "http://localhost:5984/bar/"
}

Repeated crashes are described as a crashing state. -ing suffix implies this is a temporary state. User at any moment could create the missing database and then replication job could return back to the normal.

2.2.2. Documents describing the same replication

Let’s suppose 2 documents are added to the _replicator database in the following order:

{
    "_id": "my_rep",
    "source": "http://user:password@myserver.com/foo",
    "target": "http://adm:pass@localhost:5984/bar",
    "create_target": true,
    "continuous": true
}

and

{
    "_id": "my_rep_dup",
    "source": "http://user:password@myserver.com/foo",
    "target": "http://adm:pass@localhost:5984/bar",
    "create_target": true,
    "continuous": true
}

Both describe exactly the same replication (only their _ids differ). In this case document my_rep triggers the replication, while my_rep_dup` will fail. Inspecting _scheduler/docs explains exactly why it failed:

{
    "database": "_replicator",
    "doc_id": "my_rep_dup",
    "error_count": 1,
    "id": null,
    "info": {
        "error": "Replication `a81a78e822837e66df423d54279c15fe+continuous+create_target` specified by document `my_rep_dup` already started, triggered by document `my_rep` from db `_replicator`"
    },
    "last_updated": "2017-04-05T21:41:51Z",
    "source": "http://myserver.com/foo/",
    "start_time": "2017-04-05T21:41:51Z",
    "state": "failed",
    "target": "http://user:****@localhost:5984/bar",
}

Notice the state for this replication is failed. Unlike crashing, failed state is terminal. As long as both documents are present the replicator will not retry to run my_rep_dup replication. Another reason could be malformed documents. For example if worker process count is specified as a string ("worker_processes": "a few") instead of an integer, failure will occur.

2.2.3. Replication Scheduler

Once replication jobs are created they are managed by the scheduler. The scheduler is the replication component which periodically stops some jobs and starts others. This behavior makes it possible to have a larger number of jobs than the cluster could run simultaneously. Replication jobs which keep failing will be penalized and forced to wait. The wait time increases exponentially with each consecutive failure.

When deciding which jobs to stop and which to start, the scheduler uses a round-robin algorithm to ensure fairness. Jobs which have been running the longest time will be stopped, and jobs which have been waiting the longest time will be started.

Note

Non-continuous (normal) replication are treated differently once they start running. See Normal vs Continuous Replications section for more information.

The behavior of the scheduler can configured via max_jobs, interval and max_churn options. See Replicator configuration section for additional information.

2.2.4. Replication states

Replication jobs during their life-cycle pass through various states. This is a diagram of all the states and transitions between them:

Replication state diagram

Replication state diagram

Blue and yellow shapes represent replication job states.

Trapezoidal shapes represent external APIs, that’s how users interact with the replicator. Writing documents to _replicator is the preferred way of creating replications, but posting to the _replicate HTTP endpoint is also supported.

Six-sided shapes are internal API boundaries. They are optional for this diagram and are only shown as additional information to help clarify how the replicator works. There are two processing stages: the first is where replication documents are parsed and become replication jobs, and the second is the scheduler itself. The scheduler runs replication jobs, periodically stopping and starting some. Jobs posted via the _replicate endpoint bypass the first component and go straight to the scheduler.

2.2.4.1. States descriptions

Before explaining the details of each state, it is worth noticing that color and shape of each state in the diagram:

Blue vs yellow partitions states into “healthy” and “unhealthy”, respectively. Unhealthy states indicate something has gone wrong and it might need user’s attention.

Rectangle vs oval separates “terminal” states from “non-terminal” ones. Terminal states are those which will not transition to other states any more. Informally, jobs in a terminal state will not be retried and don’t consume memory or CPU resources.

  • Initializing: Indicates replicator has noticed the change from the replication document. Jobs should transition quickly through this state. Being stuck here for a while could mean there is an internal error.

  • Failed: Replication document could not be processed and turned into a valid replication job for the scheduler. This state is terminal and requires user intervention to fix the problem. A typical reason for ending up in this state is a malformed document. For example, specifying an integer for a parameter which accepts a boolean. Another reason for failure could be specifying a duplicate replication. A duplicate replication is a replication with identical parameters but a different document ID.

  • Error: Replication document update could not be turned into a replication job. Unlike the Failed state, this one is temporary, and replicator will keep retrying periodically. There is an exponential backoff applied in case of consecutive failures. The main reason this state exists is to handle filtered replications with custom user functions. Filter function content is needed in order to calculate the replication ID. A replication job could not be created until the function code is retrieved. Because retrieval happens over the network, temporary failures have to be handled.

  • Running: Replication job is running normally. This means, there might be a change feed open, and if changes are noticed, they would be processed and posted to the target. Job is still considered Running even if its workers are currently not streaming changes from source to target and are just waiting on the change feed. Continuous replications will most likely end up in this state.

  • Pending: Replication job is not running and is waiting its turn. This state is reached when the number of replication jobs added to the scheduler exceeds replicator.max_jobs. In that case scheduler will periodically stop and start subsets of jobs trying to give each one a fair chance at making progress.

  • Crashing: Replication job has been successfully added to the replication scheduler. However an error was encountered during the last run. Error could be a network failure, a missing source database, a permissions error, etc. Repeated consecutive crashes result in an exponential backoff. This state is considered temporary (non-terminal) and replication jobs will be periodically retried.

  • Completed: This is a terminal, successful state for non-continuous replications. Once in this state the replication is “forgotten” by the scheduler and it doesn’t consume any more CPU or memory resources. Continuous replication jobs will never reach this state.

Note

Maximum backoff interval for states Error and Crashing is calculated based on the replicator.max_history option. See Replicator configuration section for additional information.

2.2.4.2. Normal vs Continuous Replications

Normal (non-continuous) replications once started will be allowed to run to completion. That behavior is to preserve their semantics of replicating a snapshot of the source database to the target. For example if new documents are added to the source after the replication are started, those updates should not show up on the target database. Stopping and restring a normal replication would violate that constraint.

Warning

When there is a mix of continuous and normal replications, once normal replication are scheduled to run, they might temporarily starve continuous replication jobs.

However, normal replications will still be stopped and rescheduled if an operator reduces the value for the maximum number of replications. This is so that if an operator decides replications are overwhelming a node that it has the ability to recover. Any stopped replications will be resubmitted to the queue to be rescheduled.

2.2.5. Compatibility Mode

Previous version of CouchDB replicator wrote state updates back to replication documents. In cases where user code programmatically read those states, there is compatibility mode enabled via a configuration setting:

[replicator]
update_docs = true

In this mode replicator will continue to write state updates to the documents.

To effectively disable the scheduling behavior, which periodically stop and starts jobs, set max_jobs configuration setting to a large number. For example:

[replicator]
max_jobs = 9999999

See Replicator configuration section for other replicator configuration options.

2.2.6. Canceling replications

To cancel a replication simply DELETE the document which triggered the replication. To update a replication, for example, change the number of worker or the source, simply update the document with new data. If there is extra application-specific data in the replication documents, that data is ignored by the replicator.

2.2.7. Server restart

When CouchDB is restarted, it checks its _replicator databases and restarts replications described by documents if they are not already in in a completed or failed state. If they are, they are ignored.

2.2.8. Clustering

In a cluster, replication jobs are balanced evenly among all the nodes nodes such that a replication job runs on only one node at a time.

Every time there is a cluster membership change, that is when nodes are added or removed, as it happens in a rolling reboot, replicator application will notice the change, rescan all the document and running replication, and re-evaluate their cluster placement in light of the new set of live nodes. This mechanism also provides replication fail-over in case a node fails. Replication jobs started from replication documents (but not those started from _replicate HTTP endpoint) will automatically migrate one of the live nodes.

2.2.9. Additional Replicator Databases

Imagine replicator database (_replicator) has these two documents which represent pull replications from servers A and B:

{
    "_id": "rep_from_A",
    "source": "http://user:password@aserver.com:5984/foo",
    "target": {
        "url": "http://localhost:5984/foo_a",
        "auth": {
            "basic": {
                "username": "adm",
                "password": "pass"
            }
        }
    },
    "continuous": true
}
{
    "_id": "rep_from_B",
    "source": "http://user:password@bserver.com:5984/foo",
    "target": {
        "url": "http://localhost:5984/foo_b",
        "auth": {
            "basic": {
                "username": "adm",
                "password": "pass"
            }
        }
    },
    "continuous": true
}

Now without stopping and restarting CouchDB, add another replicator database. For example another/_replicator:

$ curl -X PUT http://adm:pass@localhost:5984/another%2F_replicator/
{"ok":true}

Note

A / (%2F) character in a database name, when used in a URL, should be escaped.

Then add a replication document to the new replicator database:

{
    "_id": "rep_from_X",
    "source": "http://user:password@xserver.com:5984/foo",
    "target": "http://adm:pass@localhost:5984/foo_x",
    "continuous": true
}

From now on, there are three replications active in the system: two replications from A and B, and a new one from X.

Then remove the additional replicator database:

$ curl -X DELETE http://adm:pass@localhost:5984/another%2F_replicator/
{"ok":true}

After this operation, replication pulling from server X will be stopped and the replications in the _replicator database (pulling from servers A and B) will continue.

2.2.10. Fair Share Job Scheduling

When multiple _replicator databases are used, and the total number of jobs on any node is greater than max_jobs, replication jobs will be scheduled such that each of the _replicator databases by default get an equal chance of running their jobs.

This is accomplished by assigning a number of “shares” to each _replicator database and then automatically adjusting the proportion of running jobs to match each database’s proportion of shares. By default, each _replicator database is assigned 100 shares. It is possible to alter the share assignments for each individual _replicator database in the [replicator.shares] configuration section.

The fair share behavior is perhaps easier described with a set of examples. Each example assumes the default of max_jobs = 500, and two replicator databases: _replicator and another/_replicator.

Example 1: If _replicator has 1000 jobs and another/_replicator has 10, the scheduler will run about 490 jobs from _replicator and 10 jobs from another/_replicator.

Example 2: If _replicator has 200 jobs and another/_replicator also has 200 jobs, all 400 jobs will get to run as the sum of all the jobs is less than the max_jobs limit.

Example 3: If both replicator databases have 1000 jobs each, the scheduler will run about 250 jobs from each database on average.

Example 4: If both replicator databases have 1000 jobs each, but _replicator was assigned 400 shares, then on average the scheduler would run about 400 jobs from _replicator and 100 jobs from _another/replicator.

The proportions described in the examples are approximate and might oscillate a bit, and also might take anywhere from tens of minutes to an hour to converge.

2.2.11. Replicating the replicator database

Imagine you have in server C a replicator database with the two following pull replication documents in it:

{
     "_id": "rep_from_A",
     "source": "http://user:password@aserver.com:5984/foo",
     "target": "http://adm:pass@localhost:5984/foo_a",
     "continuous": true
}
{
     "_id": "rep_from_B",
     "source": "http://user:password@bserver.com:5984/foo",
     "target": "http://adm:pass@localhost:5984/foo_b",
     "continuous": true
}

Now you would like to have the same pull replications going on in server D, that is, you would like to have server D pull replicating from servers A and B. You have two options:

  • Explicitly add two documents to server’s D replicator database

  • Replicate server’s C replicator database into server’s D replicator database

Both alternatives accomplish exactly the same goal.

2.2.12. Delegations

Replication documents can have a custom user_ctx property. This property defines the user context under which a replication runs. For the old way of triggering a replication (POSTing to /_replicate/), this property is not needed. That’s because information about the authenticated user is readily available during the replication, which is not persistent in that case. Now, with the replicator database, the problem is that information about which user is starting a particular replication is only present when the replication document is written. The information in the replication document and the replication itself are persistent, however. This implementation detail implies that in the case of a non-admin user, a user_ctx property containing the user’s name and a subset of their roles must be defined in the replication document. This is enforced by the document update validation function present in the default design document of the replicator database. The validation function also ensures that non-admin users are unable to set the value of the user context’s name property to anything other than their own user name. The same principle applies for roles.

For admins, the user_ctx property is optional, and if it’s missing it defaults to a user context with name null and an empty list of roles, which means design documents won’t be written to local targets. If writing design documents to local targets is desired, the role _admin must be present in the user context’s list of roles.

Also, for admins the user_ctx property can be used to trigger a replication on behalf of another user. This is the user context that will be passed to local target database document validation functions.

Note

The user_ctx property only has effect for local endpoints.

Example delegated replication document:

{
    "_id": "my_rep",
    "source": "http://user:password@bserver.com:5984/foo",
    "target": "http://adm:pass@localhost:5984/bar",
    "continuous": true,
    "user_ctx": {
        "name": "joe",
        "roles": ["erlanger", "researcher"]
    }
}

As stated before, the user_ctx property is optional for admins, while being mandatory for regular (non-admin) users. When the roles property of user_ctx is missing, it defaults to the empty list [].

2.2.13. Selector Objects

Including a Selector Object in the replication document enables you to use a query expression to determine if a document should be included in the replication.

The selector specifies fields in the document, and provides an expression to evaluate with the field content or other data. If the expression resolves to true, the document is replicated.

The selector object must:

  • Be structured as valid JSON.

  • Contain a valid query expression.

The syntax for a selector is the same as the selectorsyntax used for _find.

Using a selector is significantly more efficient than using a JavaScript filter function, and is the recommended option if filtering on document attributes only.

2.2.14. Specifying Usernames and Passwords

There are multiple ways to specify usernames and passwords for replication endpoints:

  • In an {"auth": {"basic": ...}} object:

    Added in version 3.2.0.

    {
        "target": {
            "url": "http://someurl.com/mydb",
            "auth": {
                "basic": {
                    "username": "$username",
                    "password": "$password"
                 }
            }
        },
        ...
    }
    

    This is the prefererred format as it allows including characters like @, : and others in the username and password fields.

  • In the userinfo part of the endpoint URL. This allows for a more compact endpoint represention however, it prevents using characters like @ and : in usernames or passwords:

    {
        "target": "http://adm:pass@localhost:5984/bar"
        ...
    }
    

    Specifying credentials in the userinfo part of the URL is deprecated as per RFC3986. CouchDB still supports this way of specifying credentials and doesn’t yet have a target release when support will be removed.

  • In an "Authorization: Basic $b64encoded_username_and_password" header:

    {
        "target": {
            "url": "http://someurl.com/mydb",
                "headers": {
                    "Authorization": "Basic dXNlcjpwYXNz"
                }
            },
        ...
    }
    

    This method has the downside of the going through the extra step of base64 encoding. In addition, it could give the impression that it encrypts or hides the credentials so it could encourage invadvertent sharing and leaking credentials.

When credentials are provided in multiple forms, they are selected in the following order:

  • "auth": {"basic": {...}} object

  • URL userinfo

  • "Authorization: Basic ..." header.

First, the auth object is checked, and if credentials are defined there, they are used. If they are not, then URL userinfo is checked. If credentials are found there, then those credentials are used, otherwise basic auth header is used.

2.2.15. Replicate Winning Revisions Only

Use the winning_revs_only: true option to replicate “winning” document revisions only. These are the revisions that would be returned by the GET db/doc API endpoint by default, or appear in the _changes feed with the default parameters.

POST http://couchdb:5984/_replicate HTTP/1.1
Accept: application/json
Content-Type: application/json

{
    "winning_revs_only" : true
    "source" : "http://source:5984/recipes",
    "target" : "http://target:5984/recipes",
}

Replication with this mode discards conflicting revisions, so it could be one way to remove conflicts through replication.

Replication IDs and checkpoint IDs, generated by winning_revs_only: true replications will be different than those generated by default, so it is possible to first replicate the winning revisions, then later, to “backfill” the rest of the revisions with a regular replication job.

winning_revs_only: true option can be combined with filters or other options like continuous: true or create_target: true.