When the Agent Breaks the Environment: What the AWS Outages Tell Us About AI Identity and Operational Risk

Ryan Rowcliffe
7 hours ago
6 min read

AI agent identity risk causing operational failure due to misconfigured non-human identity permissions

Amazon just handed the industry a case study it didn't ask for. Most organizations are reading the wrong part of it.

By now you've probably seen the reporting: at least two recent AWS outages were traced back to Amazon's own internal AI coding tools, as first reported by the Financial Times and covered by TechRadar. A 13-hour interruption in December 2025 was linked to Kiro, Amazon's agentic AI IDE, which reportedly deleted and recreated an environment without sufficient human oversight. A separate 15-hour outage in October 2025 hit harder, taking down public-facing apps and services across the board. Amazon's official response was carefully worded: "user error, not AI error." The root cause in both cases? Misconfigured access controls and permissions granted to AI agents at the same level as human workers, but without the approval gates a human change would require.

Call it whatever you want. The identity problem is what broke things.

The AI and NHI Identity Access Problem Nobody Is Ready to Admit

When Amazon says these outages were caused by misconfigured access controls, what they're actually describing is an identity governance failure playing out at machine speed. The AI agent had permissions. It used them. It did exactly what it was permitted to do, and the environment came down as a result.

This is where most post-mortems go sideways. The conversation immediately pivots to AI guardrails, approval workflows, and change management process improvements. Those conversations matter, but they skip something more fundamental: in both incidents, nobody had real-time visibility into what those non-human identities were actually doing with their access until the damage was already done.

Think about what incident responders had to work through once those outages started. They had to figure out which service accounts, API keys, or agent credentials were involved in the chain of events. They had to trace the blast radius across machine-to-machine connections to understand what touched what, and when. In a system as complex as AWS's internal infrastructure, that discovery process is not trivial. It's time-consuming, manually intensive, and often relies on logs that were never designed to answer the question: "Show me every NHI involved in this workflow and what it did in the last four hours."

Amazon's engineers are among the best in the world, and even they spent 13 hours on one incident and 15 hours on another. The question worth asking isn't just what caused the outage. It's how much of that recovery time was consumed by identity discovery rather than actual remediation.

The Scale Problem Is Only Going to Get Worse

Here's what makes these incidents a leading indicator rather than isolated events. The wave of agentic coding tools, from Kiro to Cursor to GitHub Copilot Workspace and dozens of others, is accelerating software output at a pace that traditional identity governance was never built to absorb. Every new application shipped creates new service accounts, API connections, and machine-to-machine trust relationships. Every agentic workflow deployed introduces non-human identities operating with credentials, permissions, and behavioral patterns that most organizations cannot map, let alone monitor.

According to Entro Labs' NHI and Secrets Risk Report for H1 2025, non-human identities already outnumber human users by 144 to 1, up from 92 to 1 just a year prior. That's a 56% jump in the ratio in a single year, and it happened before agentic coding tools hit mainstream enterprise adoption. The move from 92:1 to 144:1 is not a rounding error. It's a structural shift in what enterprise identity actually looks like, and it's happening faster than most governance programs can track.

That number is about to move again. As organizations race toward the kind of AI adoption rates Amazon is reportedly targeting, the application sprawl that follows will create an identity surface area that dwarfs anything we've seen before. Service meshes, microservices, agentic pipelines, API integrations, cloud-native workloads: all of it communicates through NHIs. Each one is a credential. Each one is a potential point of misconfiguration. Each one can cause an outage, exfiltrate data, or traverse your environment in ways your traditional IGA platform will never catch.

The same Entro research found that 8.7% of NHIs are overprivileged and idle, and that over 5.5% of AWS machine identities hold full administrator privileges, often by default rather than by design. These aren't edge cases. They're the normal state of enterprise NHI management today, and they exist at 144 times the scale of your human identity population.

Traditional identity governance platforms were built around human identity lifecycle management. They were not designed to handle the volume, velocity, or behavioral complexity of NHIs operating at machine speed. They cannot tell you in real time that a service account just modified a permission boundary it has never touched before, or that an AI agent is making API calls to a service it has never previously accessed. That visibility gap is where the next wave of operational failures will live.

What Identity Observability Changes for AI and Workload Protection

This is precisely the problem that identity observability was built to address. The concept is straightforward: rather than relying solely on governance controls configured at some point in the past, you maintain continuous visibility into what every identity is actually doing, in real time, across your entire environment.

For the AWS scenarios specifically, consider what that visibility layer could have provided. An identity observability platform maintains a behavioral baseline for every service account and agent credential in scope. When Kiro began executing the sequence of actions that led to the environment deletion, anomaly detection against that baseline could have surfaced the deviation before the change was complete. At minimum, it would have immediately answered the "what happened and who did it" question the moment the outage started. The NHI graph shows exactly which credentials were involved, what services they touched, what the dependency chain looked like, and where the misconfiguration lived. Discovery time compresses from hours to minutes. Remediation can begin from a position of knowledge rather than uncertainty.

This isn't theoretical. It's the operational advantage that identity observability delivers when implemented as a continuous monitoring layer across both human and non-human identities. At AuthMind, we built our platform specifically to close this visibility gap, particularly for the NHI population that traditional tools leave dark. When an AI agent takes an action that cascades across a complex service environment, you should not be spending the first half of your incident response window figuring out which identities were involved. You should already know.

The practical path forward for most organizations comes down to three priorities. First, build a complete inventory of your NHIs, including service accounts, API keys, agent credentials, and OAuth tokens. Most organizations find two to three times more NHIs than they expected once they actually look. Second, baseline normal behavior for those identities. What services do they typically call? What permissions do they typically exercise? What does a normal day look like for each identity? Third, implement continuous monitoring that flags deviations from that baseline in real time, with enough context to understand the scope and potential impact of any anomalous activity before it becomes a 13-hour incident.

A Reminder That Applies to Everyone - Identity protection is fundamental

Amazon isn't a cautionary tale about a company that moved carelessly. It's a cautionary tale about what happens to any organization, regardless of sophistication, when identity visibility fails to keep pace with the rate of change in the environment. Their engineers "let an agent resolve an issue without intervention," as one source told the FT. That sentence describes a pattern playing out in organizations everywhere, often without the transparency that a public post-mortem eventually forces.

As agentic coding tools proliferate and the software they produce populates enterprise environments with new identities and new connectivity, every organization will eventually face a version of this scenario: an outage that results from an AI agent acting on permissions it should not have had, or should not have been able to exercise without approval, or whose impact nobody could trace quickly enough to contain. The question isn't whether it will happen. The question is whether you'll find out in minutes or in hours.

Identity is no longer just a security concern. It's a core operational dependency. The AWS incidents make that impossible to argue against. Whether you're a CISO thinking about your NHI attack surface, a platform engineer building the infrastructure that agentic tools will run on, or a business leader trying to understand why a 13-hour outage happened on your watch, the answer will increasingly trace back to identity. Build the visibility layer now, before the agent decides to recreate the environment.

When the Agent Breaks the Environment: What the AWS Outages Tell Us About AI Identity and Operational Risk

Recent Posts

Comments