Async Drink

Error handling with Ansible

Sep 25, 2019

Sometimes things go wrong and no matter what safe guards we build in could prevent it. Handling errors is perhaps just as important as a programmer as trying to prevent them in the first place.

Take for example a configuration file stored on a user's hard drive. If you just assume this file always exists you are in for a surprise when an adventurous user decided to remove it. All of a sudden your program could crash and the adventurous user will probably blame you!

Ansible is no different in this regard. Things may go wrong and you may need to handle them or recover from them. I use the word 'may' very liberally here as you'll find in this article that sometimes just letting errors happen is the best course of action.

Manually detecting errors

What exactly constitutes an error might not be obvious to Ansible. For example you could have a custom command or shell script running on your host with very specific output describing success or failure.

Let's say we have a program called check_system which does a system check. If all is well the message 'OK' will be written to stdout. If something is wrong the message 'ERROR' followed by a list of error information will be written to stdout.

Ansible has no way of detecting information this specific so we have to let Ansible know when something is wrong.

The below example describes how to do this:

- name: Perform a system check
  command: check_system
  register: output
  failed_when: "'ERROR' in output.stdout"

This basic task will execute the check_system command. The output streams will be written to the output variable. Next we use the failed_when operator to describe when this task has failed.

Using the failed_when operator we can detect errors Ansible would otherwise never have detected.

Manually detecting errors is only needed when Ansible cannot do so itself. Ansible will follow the convention of treating non-zero exit codes as errors. This covers most errors automatically.

How and when to handle errors

Being able to detect when something goes wrong is great. Being able to react to it is even better.

Before we get into this however we should first ask ourselves if we really want to recover from an error. Wouldn't it make much more sense to just let the playbook fail? This way you are forced to investigate what went wrong.

I've found two scenarios in which reacting to an error is useful. The first simply ignores the error given a specific condition. The second recovers from an error to prevent leaving the host in an undesirable state.

Ignoring an error

One way to recover from an error is simply by ignoring it. If the error is detected we simply let Ansible know it is ok. My advise is to only do this when an error is expected and of no further consequence.

As an example let's write a set of tasks which manually interacts with Docker:

- name: Kill Docker container
  command: docker kill my_container

- name: Update Docker image
  command: docker pull my_app

- name: Start docker container
  command: docker run -d --name my_container my_app

When running this task the Docker container my_container is killed. Next it pulls the image so it has the latest version. Lastly it starts the Docker container again.

We know Docker will throw an error when trying to kill a container which does not exist. If you'd try to run the above tasks (without the Docker container my_container available) your playbook would stop at the first task. This might not be desired as it might also mean the container was already killed earlier. The Docker container not being available could be considered expected and of no consequence.

We want Ansible to ignore the errors thrown by the first command if the container was not found. To do so we change the first command:

- name: Kill Docker container
  command: docker kill my_container
  register: output
  failed_when: 
    - "'my_container' not in output.stdout"
    - "'No such container' not in output.stderr" 

We again use the failed_when operator but this time as an array. Only when all conditions are true is the result considered a failure.

We first check if the stdout stream does not contain the container name. We know if the Docker command was successful that it should. Next we check the stderr stream. Only if it does not contain the text hinting at the container not existing do we consider it a failure.

When either one of the conditions is false we can safely assume everything went right (as any error encountered is expected and of no consequence) and move on with the other tasks.

Recovering from an error

Another scenario in which you wish to handle an error is to prevent an undefined state. An undefined state could for example be the result of an invalid configuration file leaving an application in an unpredictable (and usually not very stable) state.

In such scenarios you'd at least wish to revert to a known state before stopping the playbook. I've found Ansible blocks very useful in these cases.

Blocks will allow us to group a set of tasks together and run error handlers should anything go wrong (truthfully you can do a lot more with blocks; I'll probably discuss them in a future article).

Take a look at the set of tasks below:

- name: Copy Nginx config
  copy:
    src: config
    dest: /etc/nginx/nginx.conf

- name: Restart Nginx
  service:
    name: nginx
    state: restarted

The above tasks will update the Nginx configuration and next restart the Nginx service. However if we should ever have made a syntax error in our Nginx configuration file the restart will fail. Even worse: Nginx will not even be running at this point! If this is a production system your end users will not be able to reach your website.

It would be nice if we could recover from this by making a backup of the old configuration file and reverting to this backup in case of failure.

- block:
    - name: Backup Nginx config
      command: cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup

    - name: Copy Nginx config
      copy:
        src: config
        dest: /etc/nginx/nginx.conf

    - name: Restart Nginx
      service:
        name: nginx
        state: restarted
  rescue:
    - name: Restore Nginx backup config
      command: cp /etc/nginx/nginx.conf.backup /etc/nginx/nginx.conf

    - name: Restart Nginx
      service:
        name: nginx
        state: restarted

First we create a backup of the Nginx config. Next we copy the new one to the host. Lastly we try to restart the Nginx service. If any of these commands fail we put back the backup of the Nginx config and we again try to restart the Nginx service.

There is one problem with this approach however. Right now we have recovered from an error scenario but it doesn't change the fact an error did occur. You wouldn't want the playbook to continue and you would want to be notified of this. So let's make a small adjustment to this playbook:

- block:
    - name: Backup Nginx config
      command: cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup

    - name: Copy Nginx config
      copy:
        src: config
        dest: /etc/nginx/nginx.conf

    - name: Restart Nginx
      service:
        name: nginx
        state: restarted
      register: output # Register error output
  rescue:
    - name: Restore Nginx backup config
      command: cp /etc/nginx/nginx.conf.backup /etc/nginx/nginx.conf

    - name: Restart Nginx
      service:
        name: nginx
        state: restarted

    - name: Force failure
      fail:
        msg: "Could not update Nginx config: {{ output.stderr }}"

By using the Ansible fail module we can manually trigger failures. In this case we display a custom error message with the error output appended to it.

This will prevent the playbook from continuing while also making sure the Nginx service keeps running with a stable configuration.

In all other cases

In my experience it is rare to recover from errors and continue with the playbook as if nothing happened. The above two examples illustrate the most common scenarios I have encountered: expected errors which can be ignored and recovering from errors which would leave the host in an undefined state (and then not continuing the playbook after).

My advice for other scenarios is to just let the errors happen. Your playbook will stop running prompting you to fix whatever is the problem.

Think carefully when ignoring errors is safe and how you can recover from them. Doing so incorrectly can do more harm than good

Got any feedback or questions? Or just want to send me a message? You can contact me at info@asyncdrink.com
© Copyright 2019 by Mathyn