Monitoring a tripleo Overcloud upgrade

The tripleo overcloud upgrades workflow (WIP Docs) has been well tested for upgrades to stable/liberty. There is ongoing work to adapt this workflow for upgrades to stable/mitaka/newton (current master), as well as to change the process altogether and make it more composable.

This post is a description of the kinds of things I look for when monitoring a stable/liberty upgrade - verification points after a given step and some explanation in various points that may/not be helpful. I recently had to share a lot of this information as as part of a customer POC upgrade and thought it would be useful to have written down somewhere.

For reference, the overcloud being upgraded in the examples below was deployed like:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Upgrade your undercloud.

The first thing to check and very likely have to re-instate is any post-create customizations you had to make to your undercloud, such as creation of a new ovs interface for talking to your overcloud nodes, or any custom IP routes. The undercloud upgrade will revert those and you’ll have to re-add/create them.

The upgrade to liberty delivers a new upgrade-non-controller.sh script for the undercloud, so you can check this:

[stack@instack ~]$ which upgrade-non-controller.sh
/bin/upgrade-non-controller.sh

Other than that I always just sanity check that services are running OK post upgrade:

[stack@instack ~]$ openstack-service status
MainPID=2107 Id=neutron-dhcp-agent.service ActiveState=active
MainPID=2106 Id=neutron-openvswitch-agent.service ActiveState=active
MainPID=1191 Id=neutron-server.service ActiveState=active
MainPID=1232 Id=openstack-glance-api.service ActiveState=active
MainPID=1172 Id=openstack-glance-registry.service ActiveState=active
MainPID=1201 Id=openstack-heat-api-cfn.service ActiveState=active

Execute the upgrade initialization step

This is called the initialization step since it sets up the repos on the overcloud nodes (for the upgrade we are going to) and delivers the upgrade script to the non-controller nodes. This step is instigated through the inclusion of the major-upgrade-pacemaker-init.yaml in the deployment command. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Once the heat stack has gone to UPDATE_COMPLETE you can check all non controller nodes for the presence of the newly delivered upgrade script tripleo_upgrade_node.sh:

[root@overcloud-novacompute-0 ~]# ls -l /root
-rwxr-xr-x. 1 root root 348 Jun  3 11:26 tripleo_upgrade_node.sh

One point to note is that the rpc version which we will use for pinning nova rpc during the upgrade is set in the compute upgrade script:

[root@overcloud-novacompute-0 ~]# cat tripleo_upgrade_node.sh
### DO NOT MODIFY THIS FILE
### This file is automatically delivered to the compute nodes as part of the
### tripleo upgrades workflow

# pin nova to kilo (messaging +-1) for the nova-compute service

crudini  --set /etc/nova/nova.conf upgrade_levels compute mitaka

yum -y install python-zaqarclient  # needed for os-collect-config
yum -y update

The line with the upgrade_levels compute above is actually written using the parameter we passed in the major-upgrade-pacemaker-init.yaml

You should also see the updated /etc/yum.repos.d/* on all overcloud nodes after this step so you can confirm that is all in order for the upgrade to proceed.


Upgrade controller nodes (and your entire pacemaker cluster)

(I skipped upgrading swift nodes, as it isn’t very interesting/much to say, see the WIP Docs for more or ping me).

This step will upgrade your controller nodes and during this process the entire cluster will be taken offline - this is normal. This step is instigated by including the major-upgrade-pacemaker.yaml environment file. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

I typically observe the pacemaker cluster during the upgrade process. For example on controller 1 i have watch -d pcs status and on controller-2 I have watch -d pcs status | grep -ni stop -C 2. During the upgrade the pacemaker cluster goes down completely at some point, before yum packages are updated and then the cluster is brought back up.

Once you start to see pacemaker services go down it means that the code in major_upgrade_controller_pacemaker_1.sh is running and eventually the cluster is stopped completely.

Every 2.0s: pcs status | grep -ni stop -C2 -B1                                                               Fri Jun  3 11:52:07 2016

Error: cluster is not currently running on this node

At this point you can start to monitor /var/log/yum.log to see packages being upgraded.

[root@overcloud-controller-0 ~]# tail -f /var/log/yum.log
Jun 03 11:51:52 Updated: erlang-otp_mibs-18.3.3-1.el7.x86_64
Jun 03 11:51:52 Installed: python2-rjsmin-1.0.12-2.el7.x86_64
Jun 03 11:51:52 Updated: python-django-compressor-2.0-1.el7.noarch
Jun 03 11:51:53 Updated: ntp-4.2.6p5-22.el7.centos.2.x86_64
Jun 03 11:51:53 Updated: rabbitmq-server-3.6.2-3.el7.noarch

Once the cluster starts to come back online and services start then you know that major_upgrade_controller_pacemaker_2.sh is being executed.

After the stack is UPDATE_COMPLETE, you can check the rpc pin is set on nova.conf on all controllers:

[root@overcloud-controller-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade compute and ceph nodes

This uses the upgrade-non-controller.sh script, to execute the tripleo_upgrade_node.sh on each non controller node, for example:

[stack@instack ~]$ upgrade-non-controller.sh --upgrade overcloud-novacompute-0

On both node types you can check that the yum update has been executed successfully. Note that the tripleo_upgrade_node.sh script is customized for each node type, so they will be different between computes and ceph nodes for example. However in all cases there will at some point be a yum -y update. See the major_upgrade_compute.sh and major_update_ceph_storage.sh for more info on how else they might differ.

For compute nodes you can check that the upgrade_levels is set for the nova rpc pinning in /etc/nova/nova.conf (which in the case of computes is used by nova-compute itself, api/sched/conductor etc are on controller).

[root@overcloud-novacompute-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade converge - apply config deployment wide and restart things.

The last step in the upgrade workflow is where we re-apply the deployment-wide config as specified by the tripleo-heat-templates used in the deploy/upgrade commands. It is instigated by including the major-upgrade-pacemaker-converge.yaml environment file, for example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

For both major-upgrade-pacemaker-init.yaml (upgrade initialisation) as well as major-upgrade-pacemaker.yaml (controller upgrade) we specify for the resource registry:

OS::TripleO::ControllerPostDeployment: OS::Heat::None
OS::TripleO::ComputePostDeployment: OS::Heat::None
OS::TripleO::ObjectStoragePostDeployment: OS::Heat::None
OS::TripleO::BlockStoragePostDeployment: OS::Heat::None
OS::TripleO::CephStoragePostDeployment: OS::Heat::None

which means that things like the controller-config-pacemaker.yaml do not happen for controllers during those steps. That is, application of the overcloud_**.pp manifests does not happen during upgrade initialisation or controller upgrade.

However for converge we simply do not override this in the major-upgrade-pacemaker-converge.yaml environement file so that the normal puppet manifests get applied for each node, delivering any config changes (e.g. updates to liberty had to deal with a rabbitmq password change causing issues such as this).

Since we are applying new config we need to make sure everything is restarted properly to pick this up so we use the pacemaker_resource_restart.sh after the normal puppet manifests are applied.

So during this step, the pacemaker cluster will first go into an “unmanaged” state and this is to be expected and not a cause for alarm. This is because as a matter of practice, before applying the controller puppet manifest, we set he cluster to maintenance mode (as we are going to write to the pacemaker resource definitions/constraints to the cib) like this which uses the script here.

After the manifest is applied we unset maintenance mode here.

You should then see services restarting as pacemaker_resource_restart.sh is being executed. Seeing all the services running again at this point is a good indication that the converge step is coming to an end successfully.



blog comments powered by Disqus
RSS Feed Icon site.xml
RSS Feed Icon tripleo.xml