Ugrade a Vault cluster

12min
|
Enterprise
Vault

The tutorial provides a set of standard operating procedures (SOP) for upgrading Vault Enterprise clusters to a newer version.

The assumption is that all Vault clusters are running a minimum of three nodes as recommended in the Vault Reference Architecture guide.

The upgrade procedures differ depending on which storage backend is used (Integrated Storage or Consul). The difference in the steps to be followed are highlighted in these procedures.

Personas

This standard operating procedures is primarily aimed at operations personnel.

Prerequisites

The following prerequisite steps and knowledge are required in order to upgrade a Vault cluster. All of the following are required to understand or carry out before attempting to upgrade Vault.

Working knowledge of Vault: Some working knowledge of Vault is required in order to follow these SOPs
Vault cluster configuration is defined: Vault (and Consul, where using it as a storage backend) infrastructure configured as per Vault Reference Architecture.
A cluster configuration as defined in either our Vault with Integrated Storage Reference Architecture is required.
Vault has been initialised: This SOP assumes you have already initialised Vault, keyholders are available with access to the unseal keys for each, that you have access to tokens with sufficient privileges for both clusters and encrypted data is stored in the storage backend.
Upgrading Vault Guide: These SOPs assume that you have already reviewed Upgrading Vault Guides along with the Vault version-specific upgrade notes in that area of the documentation.

General guidance

When upgrading Vault, you should bear in mind this general advice:

Follow the Rolling Update Procedures when updating Vault.
If you want to upgrade non-LTS Vault versions across multiple versions (ie 1.3.10 -> 1.7.x), we recommend a combined approach of upgrading multiple Vault replicated clusters with a rolling update procedure on each cluster. Make sure to review the upgrade notes for each intervening version, as there may be additional steps or configurations required at each step.
If you are attempting to combine multiple operations requiring downtime (for example a storage migration AND an upgrade or hardware replacement AND an upgrade), you may find the DR Upgrade Approach helpful.
As well as following the above advice, always check the release notes and changelog (Upgrading Vault - Guides) to see if there are any breaking changes or special steps required for an upgrade of Vault.
You should never fail over from a newer version of Vault to an older version. Our procedures are designed to prevent this.
You should never replicate from a newer version of Vault to an older version.
Vault does not support true zero-downtime upgrades, but with proper upgrade procedure the downtime should be very short (a few hundred milliseconds to a second depending on how the speed of access to the storage backend).

IMPORTANT NOTE

Always back up your data before upgrading. Vault does not make backward-compatibility guarantees for its data store. Simply replacing the newly-installed Vault binary with the previous version will not cleanly downgrade Vault, as upgrades may perform changes to the underlying data structure that make the data incompatible with a downgrade. If you need to roll back to a previous version of Vault, you should roll back your data store as well. The procedures defined below include steps to backup Vault's storage as step in the upgrade procedure.

Understanding your Vault installation

Typical Vault Enterprise installations may have multiple Disaster Recovery Replicas and Performance Replica clusters, usually located in different region or datacenters. Additionally, Vault HA setups can differ on whether a load balancer is in use, what addresses clients are being given to connect to Vault (standby + leader, leader-only, or discovered via service discovery), etc.

Before beginning the upgrade procedure, you should make a note of each of these details and carefully produce an upgrade schedule, taking into account the advice given in these operating procedures.

An example of structuring an upgrade plan for multiple clusters is available in the general upgrade guidance documentation.

Procedures

Procedures are given below for 3 different scenarios:

Rolling Upgrade Procedure for upgrading a single HA cluster

Note

If your Vault Enterprise deployment consists of Disaster Recovery or Performance Replica clusters, read the next section on Replication Cluster Upgrade Procedure in conjunction with this section.

Integrated Storage

If you are running Vault 1.11.0 or later with Integrated Storage as the storage, leverage the automated upgrade feature provided by the autopilot. See the Automate Upgrades with Vault Enterprise tutorial for more details. Follow the steps in this section if your Vault version is 1.10.x or earlier.

Note

Interruption of service might occur as long as the Vault cluster is running different versions of Vault. Therefore it is highly recommended to complete the rolling upgrade as soon as possible.

Perform the following steps in order to perform a rolling upgrade of a Vault HA cluster:

Take a backup of your Vault cluster, the steps to which will depend on whether you're using Consul Storage Backend or Raft Integrated Storage.
Take a snapshot of the raft storage layer of Vault.
$ vault operator raft snapshot save demo.snapshot
Save the created snapshot file in a safe location in case the need arises to restore from the snapshot.
Take a Consul snapshot.
$ consul snapshot save demo.snapshot
Save the created snapshot file in a safe location in case the need arises to restore from the snapshot.
Determine the leader and followers nodes in your Vault cluster.
$ vault operator raft list-peers
$ curl http://127.0.0.1:8200/v1/sys/leader
Bring down a follower node.
```
$ systemctl stop vault
```
Replace the old Vault binary with the new Vault binary
Test the new binary is in place, confirm the new version is correctly installed.
```
$ vault --version
```
Restart Vault on the updated node.
```
$ systemctl start vault
```
Check the logs on the restarted node for any errors. Address these and roll-back if necessary.
Repeat the steps on the other follower nodes.
Bring down the active node.
```
$ systemctl stop vault
```
This should trigger a change in the active Vault node, as well as a leadership change in the underlying raft storage layer (in case of raft Integrated Storage). Additionally, examine the logs streaming from the remaining Vault HA nodes to confirm the active node.
If leadership and logs look ok, update the old vault binary on the old active/leader node by replacing it with the new Vault binary.
Restart vault.
```
$ systemctl start vault
```
It should be added back into the cluster automatically. Perform step 2 again to confirm this.
If a problem is experienced during the upgrade process, then remove all updated nodes and restore the backup from step 1 and bring up the old leader (not upgraded) and check logs for errors. Then restore the old version to all followers and add them back into the cluster making sure the Leader does not change.
Otherwise, if the cluster upgrade is complete, don't forget to unseal it if using the Shamir Shards unseal method.

Replication Clusters Upgrade Procedure

Vault Enterprise installations can consist of multiple Disaster Recovery and Performance Replica clusters. With this in mind, it's very important to upgrade secondary clusters first. These procedures should be followed

If you're not already aware of the status of a given cluster, use the following command to identify whether it is a primary or secondary cluster.

$ vault read -format=json sys/replication/status

If it is a primary, then in either the dr or performance sections of the response, you will see something similar to, stating that it has known secondaries. These secondary clusters should therefore be upgraded first:

 "dr": {
   "cluster_id": "12ace5c2-3876-8f99-db57-ee60f4cc6c80",
   "known_secondaries": [
     "secondary"
   ],
   "last_reindex_epoch": "0",
   "last_wal": 37,
   "merkle_root": "c519ae23573c4108c634c48d272a48d86d39bd65",
   "mode": "primary",
   "primary_cluster_addr": "",
   "state": "running"
 },

If the cluster you're currently working with is a secondary, the above command will return something similar to the below:

 "dr": {
   "cluster_id": "12ace5c2-3876-8f99-db57-ee60f4cc6c80",
   "known_primary_cluster_addrs": [
     "https://10.0.0.10:8201"
   ],
   "last_reindex_epoch": "1586437497",
   "last_remote_wal": 0,
   "merkle_root": "c519ae23573c4108c634c48d272a48d86d39bd65",
   "mode": "secondary",
   "primary_cluster_addr": "https://10.0.0.10:8201",
   "secondary_id": "secondary",
   "state": "stream-wals"
 },

Once you have confirmed the current cluster is a secondary, you should then follow the Rolling Upgrade Procedure (for upgrading a single HA cluster) to upgrade the nodes within that secondary cluster. Within those steps, it is very important to take backups of your cluster prior to upgrading.
Once the secondary cluster has been upgraded, it is important re-verify the replication status of your cluster. Further reading on checking status of replication status.
$ vault read -format=json sys/replication/status
$ curl -s $VAULT_ADDR/v1/sys/replication/status | jq
Repeat the steps for each secondary until you need to upgrade your primary cluster.
Once satisfied with the functionality of the upgraded secondary instances, upgrade the primary instance.

Combining multiple operations requiring downtime

This procedure (also known as the DR Update Method), involves creating a new cluster running the target version of Vault, configuring Disaster Recovery Replication onto it, then failing over from the primary to the new DR cluster.

Generally speaking, upgrade risk can be reduced by isolating upgrades from other major changes to your Vault deployment. We recommend performing, then validating, major changes separately. If there are issues, separating the procedures provides clarity on which change introduced an issue. If necessary, this procedure can be used to combine an upgrade with another procedure that requires downtime, like a storage migration or a hardware replacement.

This procedure does not support demoting the original DR primary cluster and replicating to it from a newly upgraded and promoted cluster. For example, if you have Cluster A (a DR Primary) on 1.3.10 and Cluster B (a new DR secondary upgraded to 1.6.7), you cannot promote Cluster B and replicate to Cluster A until Cluster A is upgraded to 1.6.7 or above.

This limitation exists because Vault does not make backward-compatibility guarantees for its data store.

Take a snapshot of the underlying storage for the existing primary cluster.
Take a snapshot of the raft storage layer of Vault using this command.
$ vault operator raft snapshot save demo.snapshot
Save the created snapshot file in a safe location in case the need arises to restore from the snapshot.
Take a Consul snapshot.
$ consul snapshot save demo.snapshot
Save the created snapshot file in a safe location in case the need arises to restore from the snapshot.
Build the infrastructure necessary to support the new DR replica. The resources allocated to this replica should be identical to your existing primary, but have your target version of Vault installed.
Follow the Disaster Recovery Replication Setup tutorial to configure Disaster Recovery replication from the primary onto this new DR secondary cluster.

Confirm that DR has been successfully configured.

$ vault read -format=json sys/replication/status

You should see something similar to the below when running on the primary.

"dr": {
  "cluster_id": "12ace5c2-3876-8f99-db57-ee60f4cc6c80",
  "known_secondaries": [
    "secondary"
  ],
  "last_reindex_epoch": "0",
  "last_wal": 37,
  "merkle_root": "c519ae23573c4108c634c48d272a48d86d39bd65",
  "mode": "primary",
  "primary_cluster_addr": "",
  "state": "running"
},

You should see something similar to the below when running on the secondary.

 "dr": {
   "cluster_id": "12ace5c2-3876-8f99-db57-ee60f4cc6c80",
   "known_primary_cluster_addrs": [
     "https://10.0.0.10:8201"
   ],
   "last_reindex_epoch": "1586437497",
   "last_remote_wal": 0,
   "merkle_root": "c519ae23573c4108c634c48d272a48d86d39bd65",
   "mode": "secondary",
   "primary_cluster_addr": "https://10.0.0.10:8201",
   "secondary_id": "secondary",
   "state": "stream-wals"
 },

You will need to demote the DR Primary cluster and promote the DR Secondary cluster. For Vault versions 1.4 and later, the most efficient way of performing these actions is using the Batch DR Operation Token Strategy. Follow those instructions to generate and save a DR operation token.

Demote the DR Primary cluster. This will stop the demoted cluster from answering requests to replicate data from secondaries. After this cluster is demoted, you can disable it. If you plan to eventually have the demoted cluster act as a DR secondary, you must upgrade the demoted cluster before using the update primary API command.

$ VAULT_ADDR=$PRIMARY_CLUSTER_ADDR vault write -f sys/replication/dr/primary/demote

$ curl -X POST -H "X-Vault-Token: $VAULT_TOKEN" $PRIMARY_CLUSTER_ADDR/v1/sys/replication/dr/primary/demote | jq
{
    "request_id": "9f87fcef-6a97-fef8-d20b-2a154107f4bf",
    "lease_id": "",
    "renewable": false,
    "lease_duration": 0,
    "data": null,
    "wrap_info": null,
    "warnings": [
      "This cluster is being demoted to a replication secondary. Vault will be unavailable for a brief period and will resume service shortly."
    ],
    "auth": null
}

Promote the upgraded DR Secondary cluster, using the DR batch operation token generated earlier. Note that promoting the upgraded cluster will not cause it to replicate to the old primary cluster. Replication should not be enabled between the two clusters until the downstream cluster's Vault version is at or above the upstream cluster's Vault version. Once promotion is complete, if the old cluster is not needed it can be discarded.

$ vault write sys/replication/dr/secondary/promote dr_operation_token=...

 WARNING! The following warnings were returned from Vault:

    * This cluster is being promoted to a replication primary. Vault will be
    unavailable for a brief period and will resume service shortly.

Create the request payload.

$ tee payload.json <<EOF
{
    "dr_operation_token": "b.AAAAAQI2iOzgDFxWOv..."
}
EOF

Invoke the /sys/replication/dr/secondary/promote endpoint.

$ curl -X POST -d @payload.json $VAULT_ADDR/v1/sys/replication/dr/secondary/promote | jq
{
    "request_id":"44650636-e6d8-22dd-9c74-9ee75b8e480f",
    "lease_id":"",
    "renewable":false,
    "lease_duration":0,
    "data":null,
    "wrap_info":null,
    "warnings":[
        "This cluster is being promoted to a replication primary. Vault will be unavailable for a brief period and will resume service shortly."
    ],
    "auth":null
}

Don't forget to take any necessary steps to ensure clients can communicate with the new primary, such as updating DNS records to point to the new cluster once failover has completed.

Backup Vault data

Restore Vault data