Reflecting on the AWS Outage and Cloud Disruption Risks

6 min read

December 9, 2021 at 4:38 PM

Cloud computing and cloud hosting popularity has skyrocketed over the past several years, and the trend is likely only to continue to grow. It began as a means to a more efficient way to host data in the cloud rather than on-premises, and COVID-19 amplified the growth and adoption of cloud hosting to make remote working, collaboration, and anytime access to data not only a possibility, but the reality and the norm.

With this trend continuing to boom, the threat of service disruptions, outages, data corruption, and loss of ability to work and utilize these hosted services is also growing and becoming more and more disruptive to businesses and consumers alike.

Everyone loves and has come to rely on the convenience of the internet. These days, it isn’t really a convenience; it's become a necessity in our daily lives whether at work or play. Daily activities can include FaceTiming with a friend or family member, watching Disney+ on the train, making a grocery or gift purchase on the Amazon app, turning on the Roomba or smart oven on the way home from work, logging into your bank or investing account, or simply switching on your video game console to play your new game. The list is endless, but what happens when many or all of these services we rely on are unavailable or unusable all at once?

When the Internet is unavailable, alternatives to conduct our daily affairs may be available but not always convenient or readily available. Recent outages have shown us that many organizations have not implemented disaster recovery plans to account for unexpected outages. We may all be able to think of some backup or additional way to get things done, but we the consumers shouldn’t need to have disaster recovery plans for these situations; the companies offering these services need to!

On the morning of December 7th, 2021, Amazon reported seeing impacts to multiple AWS APIs in the US-East Region. Within minutes, many common and widely used websites including social media, gaming, home video security, and entertainment sites went down. Users were unable to login, place orders, track shipments, access online banking, watch movies, or trade stocks. Because of this outage, large parts of the internet were either partially or fully down, including many financial institutions such as the Social Security Administration.

While outages in various data centers do happen, this particular outage is a prime example (no pun intended) of the risk many companies accept if they are utilizing only one AWS or other cloud-hosted data center and do not take advantage of multiple AWS cloud locations simultaneously. One well-known financial institution, T. Rowe Price, was partially down for most of the outage duration. Although users were able to access the website, they could not log in to their personal investing or trading accounts, view balances, make trades, withdrawals, or adjust contributions. During the outage period, the only error that a user would receive when logging in was a generic, “We are unable to log in to your account at this time. Please try again later.” The site provided a contact number and when called, a user could verify their identity through a series of prompts, be put on hold, and subsequently informed that there are technical difficulties, and the line would end. There was no notice on the website’s main page, nothing posted to Twitter, no statement released, and no human interaction via phone. While outages to social media, gaming, and entertainment do interfere with our daily lives and may cause financial hardship for some, the inability to access your home security system, investments, or bank account is a significant interference everyone should be prepared to handle.

When companies don’t have disaster recovery, incident response, and business continuity plans in place, there is an increased risk of business/website outages, damage to public perception and reputation, loss of client trust, financial losses, and potential litigation.

During the outage, one user on Downdetector.com posted, “If their brokers use the same TRP website, we investors may all have lost the gains from stock market today. TRP should compensate their clients accordingly.....” As of December 8, 2021, no litigation has been filed against T. Rowe Price or any other sites that had issues due to the outage. Though the possibility does exist that litigation may ensue. When major services go down, plans to respond, react, and remediate should be something service organizations are fully prepared to handle. Three of the most important plans for organizations to consider are: disaster recovery, incident response, and business continuity plans.

Disaster Recovery Plans:

Disaster recovery (DR) plans and policies can often be the factor that determines whether an organization’s services go down for a short period of time with clients and stakeholders efficiently advised of the status and a short return to normal service, or an organization whose services go down for many hours to days, or even go down permanently. Disaster recovery plans are typically developed by senior management with extensive input from the information technology and information security teams. These plans should focus on redundancy options, communication, and the recovery of technology assets. Disaster recovery plans will usually need activation when a loss of infrastructure or data has occurred or is likely to occur. Procedures on how to recover purged or corrupted information from previous backups and archives (such as power restoration instructions for electrical failures, details regarding parallel technology and infrastructure spin-up, or high availability and automatic failover technology asset configurations) are all likely to be documented and updated on an annual basis and incorporated into the disaster recovery plan so that your clients and stakeholders can feel confident that your services and offerings will be available when needed.

Incident Response Plans:

With cyber threats becoming more frequent, intelligent, and prevalent, it is the responsibility of all organizations to develop a comprehensive solution to combat them in a proactive manner. However, sometimes proactivity does not prevent a threat from causing its intended damage. For these situations where edge security and personnel awareness have failed, an incident response (IR) plan is the most relevant and effective program to activate. Like the disaster recovery plan, the incident response plan is often-times established by senior leadership but influenced more by the information, security, forensic, and cybersecurity teams. Determining the source, vector, and target of an attack on internal systems is paramount to identifying the correct course of action to take after an incident has been identified. Incidents can be observed and reported by anyone in the organization. However, as with disaster recovery operations, the crisis management team or incident response team must enforce incident response procedures. An incident is best described as any situation, occurrence, or anomaly that may have an adverse impact on the security or confidentiality of protected information, assets, or business processes. Incident response policies must be practiced, tested, and regularly reviewed to keep them up to date with the ever-changing landscape.

Business Continuity Plans:

The business continuity plan (BCP) is a key facet of the disaster recovery and incident response process, and many parts of these plans can be found or referenced within a business continuity plan. Since a business continuity plan is designed to issue guidance on the key components, objectives, and processes around continued operations during a business interruption (such as a cloud hosting outage), it is most frequently used as a blanket response plan for most types of events that can occur, which is not the correct course of action. A business continuity plan requires extensive analysis of business objectives and tolerances, such as a business impact analysis (BIA). The business continuity plan is developed by key executive team members and almost always requires their consent and authorization along with strict adherence to the procedures to ensure the most cost-effective continuance of operations. All facets of the organization are included in a business continuity plan to identify and categorize the organization’s most critical assets. Involvement from the smallest internal business units to partners, stakeholders, and vendors makes the plan's value and necessity immeasurable when needed.

If more organizations had robust plans in place earlier this week, the outages may not have lasted as long as they did, financial losses may not have occurred, public perception may have been less damaged, and potential litigation may not occur. While there is an underlying assumption that all large corporations already have response plans in place, recent widespread news-worthy outages are revealing that many company plans are not adequate to deal with or respond to the current cyber threats and vulnerabilities.

As cloud storage gains popularity worldwide, outages resulting from these services being down will also likely increase. Does this mean onsite hosting will make a comeback? Most likely not, but many experts in the industry are starting to recommend hybrid hosting solutions, which is a combination of on-premise and cloud services running in conjunction with each other. Whether your company is fully in the cloud or running a hybrid model, periodic cloud security assessments and other related services are crucial in understanding the level of risk your organization is willing to accept, mitigate, or transfer. Well-known cloud services such as Amazon Web Services, Microsoft Azure, and Google offer strong environments that allow users to leverage a modern well-secured hardware environment. However, just because their physical environment is certified as “secured” does not mean that your environment is secure or compliant. Securing what is contained in the cloud is just as important as making sure the cloud infrastructure is also secure. Most importantly, developing a plan of action to ensure unexpected down time is kept to a minimum can help ensure an organization is protecting its tangible and intangible assets (e.g. reputation)!