Error handling with Ansible
Sep 25, 2019Sometimes things go wrong and no matter what safe guards we build in could prevent it. Handling errors is perhaps just as important as a programmer as trying to prevent them in the first place.
Take for example a configuration file stored on a user's hard drive. If you just assume this file always exists you are in for a surprise when an adventurous user decided to remove it. All of a sudden your program could crash and the adventurous user will probably blame you!
Ansible is no different in this regard. Things may go wrong and you may need to handle them or recover from them. I use the word 'may' very liberally here as you'll find in this article that sometimes just letting errors happen is the best course of action.
Manually detecting errors
What exactly constitutes an error might not be obvious to Ansible. For example you could have a custom command or shell script running on your host with very specific output describing success or failure.
Let's say we have a program called check_system
which does a system check. If all is well the message 'OK' will be written to stdout. If something is wrong the message 'ERROR' followed by a list of error information will be written to stdout.
Ansible has no way of detecting information this specific so we have to let Ansible know when something is wrong.
The below example describes how to do this:
- name: Perform a system check
command: check_system
register: output
failed_when: "'ERROR' in output.stdout"
This basic task will execute the check_system
command. The output streams will be written to the output
variable. Next we use the failed_when
operator to describe when this task has failed.
Using the failed_when
operator we can detect errors Ansible would otherwise never have detected.
Manually detecting errors is only needed when Ansible cannot do so itself. Ansible will follow the convention of treating non-zero exit codes as errors. This covers most errors automatically.
How and when to handle errors
Being able to detect when something goes wrong is great. Being able to react to it is even better.
Before we get into this however we should first ask ourselves if we really want to recover from an error. Wouldn't it make much more sense to just let the playbook fail? This way you are forced to investigate what went wrong.
I've found two scenarios in which reacting to an error is useful. The first simply ignores the error given a specific condition. The second recovers from an error to prevent leaving the host in an undesirable state.
Ignoring an error
One way to recover from an error is simply by ignoring it. If the error is detected we simply let Ansible know it is ok. My advise is to only do this when an error is expected and of no further consequence.
As an example let's write a set of tasks which manually interacts with Docker:
- name: Kill Docker container
command: docker kill my_container
- name: Update Docker image
command: docker pull my_app
- name: Start docker container
command: docker run -d --name my_container my_app
When running this task the Docker container my_container
is killed. Next it pulls the image so it has the latest version. Lastly it starts the Docker container again.
We know Docker will throw an error when trying to kill a container which does not exist. If you'd try to run the above tasks (without the Docker container my_container
available) your playbook would stop at the first task. This might not be desired as it might also mean the container was already killed earlier. The Docker container not being available could be considered expected and of no consequence.
We want Ansible to ignore the errors thrown by the first command if the container was not found. To do so we change the first command:
- name: Kill Docker container
command: docker kill my_container
register: output
failed_when:
- "'my_container' not in output.stdout"
- "'No such container' not in output.stderr"
We again use the failed_when
operator but this time as an array. Only when all conditions are true is the result considered a failure.
We first check if the stdout
stream does not contain the container name. We know if the Docker command was successful that it should. Next we check the stderr
stream. Only if it does not contain the text hinting at the container not existing do we consider it a failure.
When either one of the conditions is false we can safely assume everything went right (as any error encountered is expected and of no consequence) and move on with the other tasks.
Recovering from an error
Another scenario in which you wish to handle an error is to prevent an undefined state. An undefined state could for example be the result of an invalid configuration file leaving an application in an unpredictable (and usually not very stable) state.
In such scenarios you'd at least wish to revert to a known state before stopping the playbook. I've found Ansible blocks very useful in these cases.
Blocks will allow us to group a set of tasks together and run error handlers should anything go wrong (truthfully you can do a lot more with blocks; I'll probably discuss them in a future article).
Take a look at the set of tasks below:
- name: Copy Nginx config
copy:
src: config
dest: /etc/nginx/nginx.conf
- name: Restart Nginx
service:
name: nginx
state: restarted
The above tasks will update the Nginx configuration and next restart the Nginx service. However if we should ever have made a syntax error in our Nginx configuration file the restart will fail. Even worse: Nginx will not even be running at this point! If this is a production system your end users will not be able to reach your website.
It would be nice if we could recover from this by making a backup of the old configuration file and reverting to this backup in case of failure.
- block:
- name: Backup Nginx config
command: cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup
- name: Copy Nginx config
copy:
src: config
dest: /etc/nginx/nginx.conf
- name: Restart Nginx
service:
name: nginx
state: restarted
rescue:
- name: Restore Nginx backup config
command: cp /etc/nginx/nginx.conf.backup /etc/nginx/nginx.conf
- name: Restart Nginx
service:
name: nginx
state: restarted
First we create a backup of the Nginx config. Next we copy the new one to the host. Lastly we try to restart the Nginx service. If any of these commands fail we put back the backup of the Nginx config and we again try to restart the Nginx service.
There is one problem with this approach however. Right now we have recovered from an error scenario but it doesn't change the fact an error did occur. You wouldn't want the playbook to continue and you would want to be notified of this. So let's make a small adjustment to this playbook:
- block:
- name: Backup Nginx config
command: cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup
- name: Copy Nginx config
copy:
src: config
dest: /etc/nginx/nginx.conf
- name: Restart Nginx
service:
name: nginx
state: restarted
register: output # Register error output
rescue:
- name: Restore Nginx backup config
command: cp /etc/nginx/nginx.conf.backup /etc/nginx/nginx.conf
- name: Restart Nginx
service:
name: nginx
state: restarted
- name: Force failure
fail:
msg: "Could not update Nginx config: {{ output.stderr }}"
By using the Ansible fail module we can manually trigger failures. In this case we display a custom error message with the error output appended to it.
This will prevent the playbook from continuing while also making sure the Nginx service keeps running with a stable configuration.
In all other cases
In my experience it is rare to recover from errors and continue with the playbook as if nothing happened. The above two examples illustrate the most common scenarios I have encountered: expected errors which can be ignored and recovering from errors which would leave the host in an undefined state (and then not continuing the playbook after).
My advice for other scenarios is to just let the errors happen. Your playbook will stop running prompting you to fix whatever is the problem.
Think carefully when ignoring errors is safe and how you can recover from them. Doing so incorrectly can do more harm than good