On a bit of an AWS Whitepaper binge as of late. This post catalogs some of the important highlights and takeaways I’ve had reading through a number of them. Despite the fact that it’s all presented in the context of AWS products and services, there’s a lot of information that I think is generally applicable to any cloud architecture. Reading these are a great way to get familiar with the space as no doubt other cloud providers (Google Cloud, Microsoft Azure, etc.) will have similar offerings now and in the future.
Check out the References section at the bottom of this post. I’ve linked to some specific whitepapers that I found the most interesting/generally applicable.
Another quick tip: You might want to just cruise over to https://aws.amazon.com/whitepapers/ and work your way down. The most recent papers are first. Things move so fast you’ll want to read the papers published in the last year or so (unless the only papers on a particular topic you’re interested in are older). I plan to keep visiting the site periodically. As new offerings come out, papers will be released covering best practices for them.
Know & Understand Your Workloads
Match workloads to the right instance types. Is your workload.. CPU Heavy? RAM heavy? I/O Heavy? Disk heavy? etc.
Pick the right instance size for the workload and re-evaluate periodically, as new instance types come out all the time. Load test different instance types and compare performance. For example, sometimes using fewer larger instances can be faster and more cost effective than using more smaller instances. Periodically re-visiting available instance types can help you identify whether other types are better suited to your workload or better match how your workload has evolved over time.
Use The Right tool for the job
Consider your use case carefully. For example, don’t just default to an RDBMS. Take a step back, and again, think about your workload. What are your access patterns? primary key only? what about read/write workloads? read-heavy? write-heavy? Do you need strong consistency? What’s your tolerance for data loss?
In much the same vein as the advice on your server workloads, you should pick the right tool for the job. For example, a key-value store may make a lot more sense for your particular use-case. Polyglot persistence is an important ingredient in your cloud strategy. It’s also important not to look at a choice of database as a one-time fixed decision. You should look to evolve over time, if needs/requirements change.
Prefer Serverless
Go to higher levels of abstraction where possible to reduce costs and management overhead. Move beyond just the idea of letting someone else manage the VM’s or physical servers: let them manage the whole software stack beneath your business logic so you can focus on what actually brings your users value. It’s a cloud provider’s core competency, let them do it so you can focus on yours. In terms of costs, some of the cost differences I’ve seen online are staggering.
In serverless architectures, you get additional economies of scale by minimizing idle, under-utilized or redundant resources. You won’t need to provision extra capacity to scale to high availability since it’s managed automatically. That can mean lower costs per transaction. In addition, all kinds of cross-cutting concerns can be handled for you including: caching, authentication, monitoring, auto-scaling, network latency optimizations with edge locations, etc. This can also potentially help you get better end-to-end latency of your entire system from the end user’s perspective.
Leverage Managed Services
These are designed for high availability and scalability, so they can lower your risk. They can reduce organizational complexity, and the need for in-house expertise on specific technologies.
ElasticCache with Redis/Memcached, Amazon RDS with pretty much any popular DB of your choice, DynamoDB, SQS, SES, EMR, Kinesis.. the list goes on and on. Take the infrastructure you already use and move it into the cloud so the cloud provider can manage it for you at scale. Many managed services are designed with a multi-availability zone architecture for disaster recovery. This means you can achieve high availability with no additional effort.
I think one key point that really can’t be overstated is that you can take more risks because you aren’t invested in something built in-house or which had prior large up-front costs. Design decisions can be revisited more objectively and aggressively and with less organizational baggage.
Similar to instance types, always be on the lookout for new managed service offerings, as they can also lower costs and alleviate operational burden.
Apply Cost Optimization strategies
Don’t rely on forecasting, measure what you actually use and make decisions based on what you need. By repeatedly analyzing your application & workloads as they evolve, you can re-size your instance types to save money. Of course, you’ll need to leverage tools that let you analyze your expenditure/usage over time.
Tagging, tagging, tagging. Segregate resources with tags, so you can directly attribute costs to specific areas/business units. Then you can accurately assess the costs/ROI of these areas in usage reports. You can get pretty creative with tags and categorize things in all kinds of interesting and cross-cutting ways. It’s really critical to getting the most out of cost-related reporting. And if you raise visibility of cost across groups, you can encourage more efficient usage across all teams as cost savings at a team level can be well understood.
Make liberal use of auto-scaling (up, down and off!) where ever possible, it will save money and allow you to effectively handle spikes in resource usage.
Back to workloads again… are your workloads relatively predictable over the long-term? Predictable long-term workloads are an excellent candidate for using Reserved Instances. Un-predictable workloads may be better suited for on-demand instances, but there are other options as well. Even if you do pay up-front for reserved instances and change your mind, you can re-sell your compute on the open market and re-coupe some of the costs.
Can the workload be flexibly scheduled? can it tolerate interruptions? Think about spot instances. There are a few different options when it comes to spot instances, so there is even some flexibility here (e.g. bidding for fixed duration instances).
Quoting from this great blog post by Jeff Barr:
Build Price-Aware Applications – I’ve said it before: cloud computing is a combination of a business model and a technology. You can write code (and design systems) that are price-aware, and that have the potential to make your organization’s cloud budget go a lot further. This is a new area for a lot of technologists; my advice to you is to stretch your job description (and your internal model of who you are and what your job entails) to include designing for cost savings.
This really resonates with me. I like the idea of being cost-aware as a fundamental tenet of the cloud programming model. Plus, as a programmer, it’s another fun little optimization area which can have significant impact on the business.
Think about Security Everywhere
Just because you are in the cloud, doesn’t mean security is an after-thought. AWS’s shared responsibility model should really hammer this home. The nice thing about the cloud is that many of the tools (Security groups, ACL’s, user & group controls, key management, auditing/traceability etc.) are there for you, but you still have to configure them in a way that makes sense for your applications.
And as with security in any system, there are some first principles which always apply. For example, The principle of least privilege; you should control access with groups, users, roles and permissions to all your resources. Apply access specificity where possible and leverage the types of controls that are available to you: user-level, group-level, specific resource-level, or capability (permission) level. Don’t allow users to share identities, abstract types of access/usage into distinct roles, use multi-factor authentication where possible, and avoid using the root account for day-to-day activities.
Also critically important is protecting your data in transit and at rest – this includes making sure data is configured to only be transmitted over HTTPS, using encryption on sensitive data at rest where appropriate, etc.
Using managed services can offer benefits in terms of security as well. For example, AWS is responsible for the security configuration of its managed services. These include things like securing the operating system, performing security patches, upgrades and firewall configuration.
Conclusion
As cloud computing becomes ever more ubiquitous and moves to higher levels of abstraction, the move to increasingly leverage cloud services just becomes that much more compelling. More and more, you get to focus on what makes your business unique, and invest your efforts there. I’ll be keeping an eye on the whitepapers site from time to time as more cloud offerings come out. By the way, Google and Microsoft both have whitepapers online as well and are totally worth checking out too:
https://cloud.google.com/s/results?q=whitepaper&p=%2F
https://azure.microsoft.com/en-us/resources/
Have the members of your team read about best practices. Everyone should have a high-level understanding of the offerings and capabilities. If you’re working on a brown-field project, explore your options for a hybrid cloud strategy. It can really help lower the risk in migrating to the cloud. You can always start with a few simple non-mission critical services and work your way up from there as you get more confident.
Got some best practices to suggest or feedback? Leave a comment!
References
Architecting for the Cloud: AWS Best Practices
AWS Well-architected Framework
AWS Serverless Multi-Tier Architectures
Dynamo: Amazon’s Highly Available Key-value Store
Cost Optimization with AWS: Architecture, Tools, and Best Practices
AWS Security Best Practices