Nobody is perfect. A human error is inevitable in any process that involves people. The impact of an error depends on how you handle it and what processes are in place. What can you do to prevent something terribly wrong from happening? I will talk you through some scenarios that can happen when specific procedures are not followed. With every scenario a procedure will be given that may reduce the impact or prevent it at all.
With the first scenario a code change has been made by a developer. Let us say you are working on a financial system which is running 24/7 and some bug is noticed on a Friday. The bug is not impacting business directly, but it seems easy enough to be fixed just before the weekend. You decide to fix the bug, but it takes just a little longer than expected and everyone already left the office. Because you are confident and sure of your business, you deploy your fix directly to production. You pack your things and go home.
The best-case scenario is that your fix is just fine, and everyone is happy. But worst case is that your fix broke something else, and the company lost a lot of money and customers.
Which procedures should be followed or implemented to reduce the impact of the change in case of the worst-case scenario:
Code review (four-eyes principle)
As a developer your code should always be reviewed by another developer. This not only reduces the change for a mistake but also increases code quality.
Your code should should be fully covered by unit-tests. These tests should run successfully before every deployment. Most of the mistakes with big consequences should be caught by the tests.
For this procedure, a production like environment (acceptance) is required. Everything that you deploy to production should be first deployed and tested on the acceptance environment. Based on your organization customers, developers or dedicated testers should test all changes. In general, the tester should not be same person as the one that made the change.
You are a system administrator, and your responsibility is to create and maintain a stable cloud environment based on a microservice architecture. To deliver 24/7 support, there is always one colleague on standby outside the office hours. You are working on a recovery script that restores all microservices to a new cluster and moves all connections over from the existing cluster to the new one. Before you test this on the acceptance environment one of your colleagues reviews the script and approves it. With all pre-requirements fulfilled you run the script. But by accident you run the script on the production environment instead of acceptance. Unfortunately, there was a bug in the script, and it destroyed the existing production environment completely.
This scenario is difficult to prevent completely. As a system administrator you should have full access to the environment, and you should be able to solve an issue by yourself for the 24/7 support. For this scenario we will investigate the options that help to recover as much as possible.
One of the most obvious options for recovery is the availability of a backup of the systems. This is most useful if data is lost. Data that has been removed cannot be easily recovered unless a backup is available. But to just create a backup every once in a while does not solve the issue for everyone and even might not be necessary. Important questions to answer to decide what type of backup you need or if you even need a backup at all, are:
• What kind of data is involved?
• What is the impact if recent data is lost?
• Is the data available from another source?
• How often is this data required?
How your Continuous integration (CI) and continuous delivery (CD) is organized can also have an impact on recovery. If your applications are developed in house, a proper CI/CD pipeline is not only useful for day-to-day work but also for recovery. To release a new version of the application to production you should not need a lot of knowledge of the product itself. In case something goes wrong, the applications can be released again by a system administrator. Also, a new build from the main branch should deliver a stable version of the application in case the original build got lost.
You are working on a reporting tool where a lot of data from different sources is needed. For your development you are running a local database with some dummy data. After some time this local database contains a lot of unwanted records and you decide to clean it up. To do so you open a SQL tool and write a truncate query for some tables. But a production issue demands your attention where you need to make some changes in the production database. For this you open a new window in your SQL tool and execute the required changes to production. After you are done, you continue the clean-up of your local database and execute the truncate query. Unfortunately, the query is executed on the production database instead of local.
With the procedures described in scenario 2 you can reduce the impact if something happens. But there are options to reduce the chance of this kind or error from happening.
Distinguish production from local
To reduce the change of mixing up an environment there are tools and procedures you can follow. It is recommended to use different tools for every environment. As an example, you can use Chrome and SQL developer for production and Firefox and IntelliJ for development.
For some web-browsers it is also possible to change the look of the browser based on the webpage you visit (a plugin might be required). By doing so you can easily see on which environment you are working.
Close production connections
The simplest solution is to just close the tools and browsers as soon as you are done with the production issue. With this you will never have a connection open to production when you are working on something else. Therefore, the chance of an error is massively reduced.
As mentioned in the scenarios above it is not always possible to prevent human error from breaking your environment. Where humans work, errors will be made. Especially since they need specific rights to do their job. But with the use of the right processes and recovery protocols it is possible to reduce he chance of a human error from happening and if an error is made, reduce the impact.