What the hack happened? A CISO perspective on the Cosmos DB vulnerability

Recent uproar on the Microsoft Azures database (Cosmos bug) hit the boardroom. A lot of major companies use Microsoft Cloud, so Azure customers were in for a rough surprise. Wiz’s Chief Technology Officer Ami Luttwak (his company found the vulnerability) describes it as “the worst cloud vulnerability you can imagine.”

Bloomberg says Microsoft warned thousands of its cloud computing customers, including some of the world’s largest companies, that intruders could have the ability to read, change or even delete their main databases. In this blog, I don’t describe the incident or ‘chase the ambulance.’ I give my personal take on other industry experience and elaborate on what I would do if I were a Chief Information Security Officer of a global pharmaceutical company using Azure and if the CEO asks me “What the hack happened and what do we need to do?”

Everything is under control, right?

Although we don’t yet know all the facts, there are some things that we can be sure of in trying to evaluate this potential crisis. Most of the global organizations have some form of risk framework in place to help them decide whether sensitive information of a certain type can be stored safely in particular cloud environments and what controls and policies are needed. Azure Cosmos appears to have ticked all boxes for thousands of organizations. Most of them probably have some form of cloud security and compliance products in place, using the native tools of the cloud vendor or products such as Palo Alto Networks’ Prisma Cloud. So, everything is under control, right?

Is this what I will tell my CEO?

That is until your cloud vendor makes a configuration change that introduces a vulnerability in the backend of the cloud environment. This is what appears to have happened here, judging from the facts we have so far. So, is this like a building collapsing despite all building code requirements being met and the construction company being one of the biggest in the world? Is that what I, as a CISO, will tell the CEO? Is this the type of accident waiting to happen even though we agreed to use Cosmos only after a thorough evaluation?

The CEO will want to know what was in the database and whether the data was accessed. The Israeli cybersecurity firm that discovered the bug claims that the data from thousands of databases has been exposed for several months. Microsoft’s email to customers said it had fixed the vulnerability and that there was no evidence that the flaw had been exploited. “We have no indication that external entities outside the researcher (Wiz) had access to the primary read-write key,” says a copy of the email seen by Reuters.

I have to take Microsoft’s word for it?

In other words: thousands of customers went through the eye of a needle because the Israeli firm found the bug before it was used by a malicious actor. Microsoft says they fixed it. As a CISO, you need to believe it. You have no way of doing your own research and testing in the backend of Azure. And you assume that Microsoft would face a tsunami of legal battles if they tried to sweep things under the carpet and have this surface in a future investigation.

So, I tell my CEO that we have no way of knowing whether company data was accessed other than having Microsoft’s word for it. And that we will intensify our efforts with our security partners by searching for suspicious activities on the dark web that might involve our data.

Take a step back and consider this:

You ask yourself what can be learned from this? Another round of evaluations of the cloud vendors? A new assessment of the risk frameworks and the relevance rating and required controls for our crown jewels? Continuous and automated compliance and vulnerability testing of our cloud properties (would they have found this one?). A whole host of questions go through my mind during my first assessment. But what other points of view and relevant possibilities can I consider while formulating my storyline and action plan before I go to the CEO?

Fact(ory) of life: IT glitches like misconfigurations or poorly performed changes happen all the time in large IT factories. Complex technologies and processes need to be released under severe market pressure to maintain a competitive advantage. Even Bigtech firms find it tough to stick to disciplined processes such as life cycle management and key management ceremonies.

Lessons from other industries: We have learned from other sectors that making safety part of the design, lifecycle, and assurance process raises the sanity of the entire industry. HACCP [1], IATA[2], and FDA approvals are examples where entire industries mature over decades and sometimes over an entire era. Large IT cloud providers could learn from military practices such as checklists, sign-offs, and hours of training and exercises before you are allowed to go into combat.

Craftmanship: Bad engineering practices make every solution vulnerable. The IT and Security industry lacks something we observe in surgery, aviation, and the military. There is nothing that compares with med-school residency (practical hands-on work as co-assistant), bar exams, or the ‘10,000-hour rule’. I see inexperienced people with insufficient miles on the clock taking up the CISO role. This is due mainly to talent and labor scarcity. HR finds it hard to assess the right IT skills. In practice, I still see engineers doing security on the side or as an afterthought. I don’t know of any car manufacturer installing the airbags as an afterthought!

Technical errors occur all the time. Like what we often see in High-Reliability Organizations (HRO) such as oil and gas, when health and safety guidelines are disregarded. This immediately leads to sanctions, adjustment of policies, notifications, and extensive briefings on how to improve.

The role of the CISO has rapidly changed over the last years. It has gone from being purely a technician to an orchestrator capable of managing a wide variety of third parties and collecting near-real-time data on their organizational performance (even where it is outsourced). Which of these CISO archetypes applies to you is described in an earlier blog and might help to determine what the company needs when dealing with hybrid clouds.

Known and unknown bads: Zero Day issues of (un)known security issues will always be a fact of life. But why is it that our well-known Common Vulnerabilities and Exposures mechanisms used by security professionals don’t exist in Cloud environments [3]. Of course, cloud providers have CVEs, but they’re hidden in a shady cloud. This means that the CISO’s role is different and require anticipating unknown bads, such as this CosmosDB issue – equipped and quickly informed if something has happened and a response plan can be developed, before the issue explodes in the CISO’s face.

Big Tech & cloud providers invest a lot of money in improving engineering capabilities, including Cybersecurity. AWS, Google, and Microsoft visited the White House recently to follow up on the Presidential Executive Order with additional investments in Cybersecurity [4]. This is a good thing. Invest in security engineering training, phase out old stuff like SMB protocol and train the cloud teams in the right mindset for working in the IT factory of Bigtech firms. Invest in bug-bounty programs and the right audit and assurance capabilities. Gear up the internal and external audit teams with new capabilities for more software defined environments that run on algorithms. My previous blogs already looked ahead towards emerging roles in the digital domain.

Bug bounty programs: Microsoft paid Wiz a US$40,000 bounty. I personally regard large scale bug bounty programs, but also smaller scale ones, as highly effective in ridding apps and systems of childhood diseases. After Heartbleed and Rowhammer [5] memory issues, we now have chip manufacturers gearing up their security by design processes. Some of these bug bounty hunters were rewarded with US$100,000.

Multiple Cloud strategy: Outsourcing your data to multiple cloud vendors is still a way of spreading the technical risk as well as the vendor, product, platform Lock-In risk. It can help to put more eggs in separate baskets. Looking at innovative technologies like confidential computing, supports this data portability concept. By creating encrypted enclaves at chip and memory level, you can port and process data (for example via AI algorithms) without disclosing the content of the data. Vendors like AWS and Microsoft are far ahead in developing this technology to avoid data leakage via clouds. AMD, Intel and Bigtech are working on this together through the confidential computing consortium [6].

My eight steps before I go to the CEO

So, taking these considerations into account, I develop my storyline as the CISO of a publicly listed global pharmaceutical company (so fact-finding, wording, and openness is key), and prepare my action plan and briefing to the CEO, on how we can deal with such facts of (cloud) life.

1. Get threat and incident news that is weighted and last-minute up-to-date

Before you get caught out by such an issue, you need to ask yourself whether you are well and real-time informed about news like this. I have been surprised by threats and problems in the past, and they made my life as a CISO extremely stressful. These days, I use our inhouse-developed, AI-based newsfeed – comparable to Reuters – that digests thousands of CVE-updates, proprietary threat intel database updates, vendor updates, articles and tweets a day. These are analyzed and weighted daily to set actions for our SOC team, legal department, or other departments that need to act. This is my AWACS [7]. It helps me navigate through the sense and non-sense of cyber-incident news notifications. It helps me as a CISO, and it is also invaluable if our CIRT-team is escalating a customer incident. In some cases these newsfeeds are however also too late. At the time news came out about the Kaseya hack reports from some companies about the time their breach started was even before the news came out.

2. Transparantly formulate a first-version high-level assessment

People might tell you this Cosmos bug is a theoretical risk that is highly unlikely to manifest itself. But you don’t know this until you’ve thoroughly investigated it. So, that’s what’s needed initially – a high-level assessment and then drill down to determine the real impact. And be very open and transparent along this path, so that it is clear to everybody what has happened. I have learned from medical and aviation mishaps that transparency first helps to understand the issue, and then to learn and improve yourself and each other.

3. Determine the classification of data involved

Of course, I need to determine what kind of data was residing in that Cosmos database and the level of data classification. Highly confidential drug research? Or less confidential information? And how much was it to determine the materiality of the event and the level at which we should “brace ourselves” for impact.

4. Check basic release hygiene: what did we update/activate

Did we really activate the feature (in this case the Azure Notebook function) in the first days of its release and were we notified about this new feature being released?

5. Double check: are we sure we don’t use this product (shadow IT)?

One answer during this triage might be “we don’t use Cosmos” or “we don’t use Azure”. But you and I won’t be the first CISOs to have been surprised by rogue assets or other types of shadow IT that could damage the business. You might not run it officially, and it might not be on the asset list, but I know from experience that you can double that official amount of assets with rogue assets utilizing services you have never seen. One way to tackle this, and I will do that immediately – never waste a good incident – is to do a full sweep of the entire corporate cloud consumption – of rogue, violating, and harmful apps.

6. Fox and the henhouse: do own research to check vendor promises

As in the case of any event, don’t take Microsoft’s response for granted but do your own investigation and take additional action where needed. Apart from regenerating the keys like Microsoft suggest, I recommend reviewing all past activities in the Azure Cosmos DB account.

7. Check: Was there a feasible technological alternative that would have stopped this (i.e. confidential computing)

Equally important, I would investigate Confidential Computing opportunities to see if this might be a solution. The good thing is that big tech embraces this initiative and is willing to develop labs and proof of concepts with industries, including healthcare and pharma. So, the timing might be spot on. Just like every security measure, this might only be a temporary solution that will be vulnerable in the future (think about Meltdown and what Mimikatz can do).

8. Develop a storyline/visual that easily explains the technical detail for board level execs

Finally, I will develop an easy-to-understand storyline. From experience, I know that many senior managers like CEOs are visually oriented. “a picture tells a thousand words”. One way to prepare a storyline – which is non-technical – is to use visualization, since that helps stakeholders to better understand the root cause and the impact. This GIF visualizes the attack as a burglar breaking into your house using the primary key.

I use the Golden Circle to take boards along on the Why, What, and How of the company’s digital journey and whether you need to regularly train or keep them informed. My experience with boards is that they support the CISO rather than point at the CISO.

Maybe it’s a wake-up call that there is a limit to the current approaches in cloud security posture management. There is one thing you know for sure. There is no easy answer. Automobiles brought environmental challenges and traffic casualties. The aviation industry evolved by learning from accidents. The best we can do as an industry is to put pressure on these cloud providers to be more transparent about their internal capabilities and practices. And I believe that we ourselves need to take responsibility for providing non-tech people with a clear picture of what has happened during a security breach or similar event.

By: Yuri Bobber & Mark Butterhoff

P.S. a possible conspiracy theory: Former role of the person that found the vulnerability was a CTO of Microsoft’s cloud security group. Was this issue really just found, or was this a long known vulnerability used for own company marketing? 🙂

¹ Hazard analysis and critical control points, or HACCP, is a systematic preventive approach to food safety from biological, chemical, and physical hazards in production processes that can cause the finished product to be unsafe and designs measures to reduce these risks to a safe level.

² The International Air Transport Association (IATA) represents 290 airlines or 82% of total air traffic. IATA supports many areas of aviation activity and helps to formulate industry policy on critical aviation issues.

³ The Common Vulnerabilities and Exposures (CVE) system provides a reference-method for publicly known information-security vulnerabilities and exposures. … The Security Content Automation Protocol uses CVE, and CVE IDs are listed on Mitre’s system as well as in the US National Vulnerability Database.

⁴ LinkedIn message by Satya Nadella on his visit to the Whitehouse https://www.linkedin.com/feed/update/urn:li:activity:6836412665846472704/

⁵ Row hammer is a security exploit that takes advantage of an unintended and undesirable side effect in dynamic random-access memory (DRAM) in which memory cells interact electrically between each other by leaking their charges, and possibly changing the contents of nearby memory rows that were not addressed in the original memory access.

⁶ The Confidential Computing Consortium (CCC) brings together hardware vendors, cloud providers, and software developers to accelerate the adoption of Trusted Execution Environment (TEE) technologies and standards.