For large and high scale applications, the promise of “enterprise grade” availability and high reliability levels are key to customer confidence on the applications. Continuous delivery pipelines for such scaled out applications typically consist of multiple environments.
DevOPS enables faster & automated delivery of changes, thereby helping customers with the most advanced set of features. In theory, any change to a production system has risks. Safe deployment guidelines help in managing this risk for large scaled out applications, thereby fulfilling the customer promise.
In this blog post, we shall share the safe deployment guidelines that we follow in Microsoft and how do we configure the pipelines or release definitions using Visual Studio Team Services to enforce the guidelines.
Gradual rollout to multiple environments
For applications under discussion, the production environment would comprise of multiple scale units (one per region).
You may want to deploy the changes a test or staging environment before deploying to any of the production environments (as a final quality validation) and a canary environment that interacts with production environments for the dependent services serving some synthetic load.
Also, it is recommended to not deploy to all production environments in one go, exposing all the customers to the changes. A gradual rollout that exposes the changes to customers over a period, thereby implicitly validating the changes in production with a smaller set of customers at a time.
In effect, the deployment pipeline would look like the following:
As an example, for an application is deployed in 12 regions with US regions (4) having a high load, European regions (4) having a medium load and Asian regions (4) having a relatively lighter load, following would be the order of rollout.
Run final functional validation on the application
Process synthetic load on the application, interacting with production instances of dependent services
Pilot customers (Internal and early adopter customers) are onboarded to a separate scale unit. Deploy after deployment to Canary succeeds.
Asian regions 1, 2, 3 & 4
Asian regions have a lighter load. Deploy to all regions in parallel after deployment to Pilot succeeds.
European regions 1, 2, 3 & 4
European regions have medium load. Deploy to all regions in parallel after deployment to all Asian regions succeed.
US regions 1, 2, 3 & 4
US regions have high load. Deploy to all regions in parallel after deployment to all European regions succeed.
In a release definition, we use environment triggers to configure the environment deployment conditions as follows.
If required, you can configure to deploy to the four scale units in each region sequentially for additional level of control.
Uniform deployment workflow for all environments
As discussed above, we are deploying the application to each scale unit independently. A deployment and validation process is defined for each of the scale units. As a best practice, you should follow the same procedure to deploy those bits in every environment. The only thing that should change from one environment to the next is the configuration you want to apply and the validation steps to execute.
To enforce the deployment procedures to be same across environments, we define a task group for the deployment procedure and include the same in each of the environments. The different configurations are parameterized and the values are managed using environment variables in the release definition.
The deployment workflow for each of the environments in our release definitions looks like the following.
Manual approval for rollouts
There are various reasons due to which you may not want the application to be updated at some points of time. It could be due to an upcoming major event for which you want to avoid all risks, or a known issue with a dependency that needs changes to your application to be deferred.
Configuring manual approvals before the pipeline begins ensures that we get two pairs of eyes ensure the application is not going through these special circumstances and can be updated.
Moreover, there might be urgent hotfixes or special changes that do not apply to all the scale units. For such changes, we need to bypass the pipeline and directly deploy to specific environments only. We still would like to get approvals in such scenarios. However, in case of a pipeline flow we do not want to get multiple approvals, one for each of the environments.
So, in a nut shell, we are looking for one approval at the start of the deployment sequence. The sequence may or may not start from the Test environment.
We configure approvals for each of the environments in our release definitions to fulfil these requirements. The approvals for the production environments are configured like the following.
Segregation of roles
As we discussed above, we would like to have two people analyze every deployment and ensure that all’s well for deploying to the environment.
It is possible that the approver mentioned in the release definition is same as the person requesting the deployment. In such a scenario, the requirement of segregating the roles of deployment submitter and approver between two different users does not get fulfilled, thereby risking a wrong deployment due to manual oversight.
Release management provides an option to deny the submitter from approving a deployment, thereby helping enforce segregation of roles.
Health check during roll out
With the above environment and approval settings, validation phase for the environments play a key role in ensuring the environments are healthy after the deployments. It may always not be possible to fully automate the validation and health monitoring of environments.
In such circumstances, adopting a “Wait and auto-promote” criteria for the production environments are recommended. The pipeline is paused for a certain duration of time, during which team members monitor various health indicators of the service and can abort the pipeline if it is not appropriate to continue. Users can manually re-start the pipeline from the next environment once the issue is analyzed.
Including a manual intervention task in the validation workflow for production environments helps us configure the release definitions to be in “wait and auto-promote” mode with a 24 hours wait between environments.
Branch filters for deployments
With extensive use of Git as the version control system for development, developers commit changes in various branches. All the changes come together in master or release branches. To ensure completeness of features being deployed, it is recommended to restrict deployments of artifacts generated from these branches to the production environments.
Secure the pipelines
We have now configured our pipelines that ensure we safely deploy the changes to all the environments. Team members can now create releases and start the deployments. To avoid any issues, we need to ensure that the configurations are not disturbed and the checks out in place are not by passed by users.
We configure Release permissions on the release definitions to avoid any unwanted changes. Specifically, we control the users who are allowed “Administer release permissions”, “Delete release definition”, “Delete release environment”, “Delete releases”, “Edit release definition”, “Edit release environment”, “Manage release approvers” and “Manage releases” permissions.