What is the principle of least privilege and why it's such a hard thing to achieve
Why systems tend to get less secure over time
What is least privilege
In an access control scheme, policies can allow and deny operations. It's a generic concept, but in the case of AWS, this control is implemented using IAM policies so that you can attach an admin user a policy that allows creating new S3 buckets. Then you can attach a policy to another user that prevents it from accessing a DynamoDB table. The less people can do in an account the more secure it is. "Least privilege" is a configuration of permissions where everybody is allowed to do what is required to fulfill their jobs but nothing else.
Let's say a user needs access to an S3 bucket. You can solve this in multiple ways, such as attaching the AdministratorAccess
jolly joker policy that
grants access to everything inside the account which includes reading objects from the S3 bucket. This definitely ticks the box of "give access" but it feels a
bit extreme.
To make things stricter, you can give only access to the S3 service which is a lot more restricted than the administrator policy. Going forward, you can give access to a single bucket and only for read operations. But there are still things to chip off from this permission set. For example, you can restrict the IP address the request must come from.
You might also want to give time-sensitive constraints, such as limiting access to office hours. But this is not supported by AWS, so you can't configure it. Least privilege is dependent on the service you are using.
It's easy to see why giving less access is a good idea. In the hands of a competent administrator, new users and services start with a small set of permissions.
Why systems tend to get less secure
But over time, permissions tend to accumulate and no longer reflect what is strictly necessary.
When somebody gets less permissions than the least privilege point defined above, they can not fulfill their jobs. A webserver might not be able to connect to the database, or a back-office worker is unable to access some documents. This is an operational problem.
On the other hand, if the permissions allow more than strictly necessary, it's a security problem. If there is a breach, this leniency gives the hacker more options. Maybe a system can access other databases in addition to the ones it uses, which makes a limited attack surface wider.
These two types of problems behave differently in two important ways: what is the probability they occur, and how easy it is to attribute the problem to a person. These differences determine how permissions naturally evolve over time.
Probability of failure
There is an asymmetry between how likely an operational vs a security problem causes problems. When a policy is too strict and it prevents systems or users from doing their work, it is immediately apparent. Visitors to the website get error responses, or someone is complaining to their boss. There is a 100% chance this happens shortly after the permissions are modified.
When a policy is too lenient, it enables or exacerbates a potential future security breach. One that might never come or might happen far in the future. Think about all the companies that go for years with virtually no cybersecurity incident and only get breached when they got into the spotlight. Such as in this case where an SQL injection vulnerability allowed deleting the database behind the webserver after a user opened a ticket in the Firefox issue tracker complaining about the browser warning users that the login form is served via HTTP. The site was working like this for a long time until this message appeared and the site got breached in hours:
We have our own security system, and it has never been breached in more than 15 years.
The longer something is not broken the more secure it seems and it also happens when security is a lot more important. For example, a memo from the space shuttle program observes this problem:
The argument that the same risk was flown before without failure is often accepted as an argument for the safety of accepting it again.
Security problems appear later and with a low probability. And we humans are particularly bad at assessing this. It is the "turkey problem" Nassim Taleb wrote in Black Swan:
"Consider a turkey that is fed every day. Every single feeding will firm up the bird's belief that it is the general rule of life to be fed every day by friendly members of the human race 'looking out for its best interests,' as a politician would say.
"On the afternoon of the Wednesday before Thanksgiving, something unexpected will happen to the turkey. It will incur a revision of belief."
The process of devaluing low probability incidents also leads to "normalization of deviance" where in the absence of negative outcomes the unacceptable becomes acceptable. In 2014 an airplane overran the runway and crashed. During the investigation, it became clear that the pilot disregarded the safety checks before the flight, those that should be done before every takeoff. It became the norm because it did not result in any consequences until it did.
The same thing applies to policies and IT security. A permission that allows too much access might never lead to a security incident. Day after day people get more accustomed to allowing more lenient security controls and at the same time get more confident that their systems are secure.
Accountability
The second factor is accountability. When an employee complains to their boss that they are no longer allowed to access a system and it hinders their work, a short investigation points to an IT admin who revoked that permission. But when a security incident years down the road happens, there is no single person to blame. Responsibility is smeared to the whole company.
Security is in the hands of people who define access control on a daily basis. When they get the blame for accidentally removing essential permissions they will err on the side of their own jobs' safety and leave excessive accesses alone. And after some time, it becomes the norm, and people who think otherwise will be the minority.
When the majority accepts leniency as the norm, it transforms into peer pressure against anyone who thinks otherwise. In an extreme case, whenever someone raises doubts they get ridiculed, saying that "Our security is amazingly good". With no negative effects, stifling security-related concerns will be "business as usual" and everybody who thinks otherwise is labeled as overly apprehensive.
An unsolved problem
Unfortunately, money apparently does not solve this problem. The Twitch incident where they allowed public writes to an S3 bucket is a prime example that the big players are not immune to this. And as Twitch is a tech company, they can't even say that they are not familiar with technology.
In another case, healthcare data got exposed in a similar way. In this case, strict regulations and expensive audits that are needed to store sensitive data were not enough to stop this leak.
Both of these cases happened because of exposed S3 buckets, and there are many other examples as this is happening regularly (just Google "S3 bucket negligence award"). But how can this happen? S3 buckets are private by default. Somebody had to explicitly grant public read permissions (and even disable the public access block more recently put in place by AWS).
Worse still, there are only two ways to make a bucket public: either by attaching a bucket policy or setting the public-read
ACL for
the objects. Either of these are easily detectable by tools and most AWS security-related ones detect these misconfigurations. Public S3 buckets are both hard to
configure and easy to detect, yet there is a constant stream of new breaches.
Some solutions
What are the possible solutions? While there is no "do this and you'll be fine" advice, adopting certain practices helps with security.
Striving for simplicity is one of the most effective ones. Instead of giving permissions to individual users, using Role-Based Access Control or Attribute-Based Access Control greatly reduces the number of policies in a system. This makes it easier to reason about what each user can do and which ones are the outlying cases that need further investigation.
Apart from consolidating permissions, you can also impose hard boundaries and thus reduce the number of connections between principals and resources. In AWS, if you move production resources to a separate account along with the people who need access to them, then you cut the number of fine-grained permissions to a fraction. If you have 8 developers, and 2 people responsible for operations, you can reduce the number of users in the production account from 10 to 2.
Another good practice is to strive for observability. When policies are scattered around an account and possible across multiple accounts, it's hard to get a "big picture" overview. As a result, everybody will have tunnel vision, focusing on small parts of the access control. An intuitive dashboard that shows who can do what is a great way to let anybody spot errors.
Then automation helps enforce established security constraints. You can add tools that inspect the resources and spot problems, such as Config in AWS, and linters that prevents non-valid resources from deploying, such as tfsec. Automation works like tests in programming as they codify what would otherwise be an informal decision. You need to decide and code things once and they will be enforced automatically.
And finally, conducting reviews, preferably by people who are not influenced by the "things have always been like this" mentality is a more costly but an effective way of keeping systems secure. Red team vs blue team wargame exercises are commonplace in cybersecurity, and you can also order third-party penetration testing. Just make sure that these are not one-time exercises but are redone periodically.