How did we not catch it before it went on production?
Oh, that never happened on Staging before, and we can’t seem to reproduce it on your test environments.
You’ve been there before - After spending hours on hours in the trenches, you figure out that the staging database has a different version than production, or that your “stateless” app is not 100% stateless, but how would you tell when it’s running on just one instance.
Staging, Pre-Production, or whatever you call the last line of defense before you deploy to production, is there to emulate production.
So how come you still have new production issues popping up from time to time?
Yes, you can’t predict user behavior with 100% percent accuracy, as traffic behaves differently, and well, chaos will find a way. But there is another reason, one that you can avoid - trying to save money, or team resources on staging environments.
Here is a scenario that I see in almost every company I visit - They have a “Dev” environment, sometimes a QA environment, and at least one Staging environment that they use before production. In recent years, they even stared using different cloud accounts for each, which, in my book, entitles them to a participation trophy. But then you take a closer look and discover that underneath, things are not optimal. Yes, the application seems similar on Staging, but the underlying architecture is a bit different:
- Instead of having multiple application instances, they only have one.
- There is just one Elasticsearch server.
- One Kafka Broker
- No load balancers
- The database is running with a different configuration than production.
- Sizing is off These are the outcome of trying to save money. Why keep all these resources running when there is no live traffic?
Let’s go back to the original purpose of staging environments. They allow you to mimic production, and that includes the infrastructure. Another reason is that once you do have a production issue, you can go back and reproduce it.
To be a good problem solver in our line of work, one of the first things you do, is figuring out what changed - what are the differences between states. You don’t want to have too many moving parts, and if possible, leave these changes to the application and not the underlying infrastructure.
Mirroring the production environment is expensive, but then again, you went to all this trouble to make sure that the business is 100% available.
After all, it’s all about risk management, and you may end up taking the risk that production issues will happen from time to time, but that should be a mindful decision.
The cost is not always “direct”, but priority and time investment that causes these differences. You may think that your team’s efforts are better used elsewhere, but skimping on resources, the human kind, may not be your best play here.
Is there a middle ground?
Again, it’s a matter of risk tolerance. If your AWS bills are eating up your runway, you may have to consider saving cost on the staging environment. So here are a few things to keep in mind that may help you avoid production issues by keeping some similarity between environments, where things tend to fail:
Application Versions - Make sure that the supporting services are all running on the same exact version. That includes your Database servers, Message Brokers, Spark, underlying Kubernetes, etc.
Sizing - If you have applications running on different servers in production, make sure to do the same in Staging. Don’t run your application and the database on the same instance, as you may not experience network-related issues that may happen in production.
Security - Just because this environment is not accessible to the world, it doesn’t mean that you can avoid putting the same safeguards in place. I’m not talking just about the security side of things. If you use self-signed certificates in one place, and then full certificate chains in another, it’s going to cause you trouble in the long run. Allowing root users in Kubernetes in one cluster and restricting that in production leaves room for errors.
Redundancy / High Availability - System behaves in a different way when running in clusters, or with multiple instances. Is your stateless app really stateless? Do your applications really lock files and resources? What happens when the application instance dies? Keep the minimum recommended number of resources that are recommended by vendors. For your services, keep at least two.
Load Balancers - Similar to the above, use Load Balancers for the staging environment. Keep them with the same configuration. You don’t want any “Stickiness” issues in production.
Application profiles and configuration - In development environments, it is common to have different, more permissive configuration profiles.
Log levels or open JMX ports are required during development, among other things that developers (sometimes) rightfully request to do their work. This “leeway” allows developers to use shortcuts that would not work in production. For example, the Kafka brokers in development didn’t require client certificates, so the developer never bothered implementing it.If you catch such issues in production, you are going to have a bad time.
Deployment strategies and code - This is your last chance to check your code. Use the same deployment strategies as you would in production.
Autoscaling - This is a tough one, as traffic patterns are very different from production. It takes time to fine-tune Autoscaling, and there will be differences between the two environments, but you want to have it in place for two reasons:
- Be able to test things out when performing stress testing (you are stress testing, right?).
- It forces you to think of resources as dynamic, and encounter such behavior before going to production is vital.