Application Production Readiness Guide
Deploying an application in production is not the end state but a very crucial stage of software development lifecycle. There are a lot of factors that could go wrong especially when you’re developing a new service and deploying for the first time or releasing a major feature update. In order to avoid any incident, this guide can help you in creating a blueprint for deploying and managing your application in production.
Production Readiness
- Architecture Review: I know this is the very first step before starting the development. But most of time as development progress, lot of things change and the architecture on document becomes outdated. It’s alway a good practice to keep your document updated and verify once before final release.
- Security: Security is another important pillar of a healthy service. This guarantees safety of the data stored and shared across various sub systems. Verifying and making sure all good security practices are followed before releasing the service will help in longer run.
- Service Dashboard: Set of service monitoring dashboard consolidated at one place to provide a holistic view of service health and performance. This will help in understanding various components and usage of the application.
- Alarms: There are few standard alarms like memory, CPU, GPU etc. (threshold at 70% or 80%) which will let your team know about the potential issue and help mitigating it proactively. This even will help you in putting right methodologies to scale your application as required. There are certain data points which are unknown at the time of releasing and it is fine to put and arbitrary number to raise alarm for those kind of metrics.
- Scaling, Caching and Latency: Every application is build for certain set of users and certain number of transactions to support. But often times we need to be ready to scale up/down based on usage. It’s always best practice to put proper scaling factors to upscale and downscale the application based on various parameters to avoid any downtime or customer impact.
Ensure proper caching mechanism is implemented to cache and invaldiate the cached data. This will help you in maintaining your SLA for low Latency. - Beta Testing: Beta testing provides a developer environment to test the flow, interactions and end to end working of the application. This helps in building the confidence.
- Gamma Testing or UAT (User Acceptance Testing): It’s always better to test your application/feature with certain set of braveheart users who can provide feedback based on real usage. This will also help you to test your application before opening it to wider audience.
- Runbooks: There are scenarios where certain set of manual operations are required in order to perform certain set of tasks e.g onboarding an users, migrating existing customers etc. These should be properly documented to avoid any issue.
- UT coverage: Define a minimum coverage criteria say 80% or 90% and follow that to write unit tests. Better the coverage lesser the chance of bug.
- Integration Tests: UT is a great way to capture any issue with your function or a unit of code but, Integration test will provide you to test an functionality as a whole. This will also ensure any future changes should not be breaking existing functionality.
- CM (Change Management) Process: Point# 8 talks about Runbooks. While Runbooks define the steps to perform the manual steps, Change Management process will ensure proper steps are followed and document these changes. This will establish a best practice for any manual change and avoid any issue in production
- CI/CD Pipeline: Avoid manual intervention as much as you can. A full CI/CD pipeline will ensure the changes are properly reviewed, unit tested and integration tested.
- Dev Testing: Perform as much of dev testing as you can for all the possible scenario you can think of. This will raise the quality of deliverable and ensures safety of the code deployed in production.
- Incident Mitigation Plan: However better the system is designed and developed, there will always be some known or unknown risks associated with it. It is always a good practice to document a risk mitigation plan in case of any incident happen. For e.g what if the web server is crashed, what is web page redirection is broken etc. A mitigation plan will act as a guide for any developers in the team to act quickly in case of an incident. “Develop of best, prepare for worst!”
- Dependency Documentation: If your application is dependent in some other application or applications, review it, document it and create a mitigation plan for use cases like what if that service start throttling, what if the service is down etc.
- Logging: Ensure proper logging is implemented for all the scenario like info, error, warn or debug. Also ensure no critical data is printed in log and proper log level is implemented in different stages.
- Load Testing: Performing load test on your application will provide you a number when the application might break. This will help you in implementing proper scaling mechanism to avoid downtime.
- Rollback Mechanism: What if a faulty deployment reach to production? What if some untested code reach to production? What if the deployed code is not behaving right in production? In order to mitigate any such issues, a proper rollback mechanism should be established and documented to roll back the application to previous good known state. This will help you to quickly restore applications previous good known state.
- A/B Testing: In case of some major feature update, It’s better to collect metrics based on A/B customer. These metrics will provide insight of the feature released and it’s adoption. This will help you make decision to roll out to wider audience or roll it back.
- Roll Out Mechanism: Alpha, beta, gamma, UAT and Production. These are various stages that you could set up to target different kind of customers. But in order to release your application or a major feature to wider audience, define a proper roll out strategy like rolling out to certain geographic user-base, rolling out to certain percentage of users, rolling out to certain number of users etc. Once the initial roll out is a success then how to roll out to wider audience.