Apr 9, 2025

Securing Production: A Deep Dive into AWS IAM Best Practices with Rowan Udell

TLDR;

Visibility of IAM principals, permissions, and activities is key to defining the right IAM policies for the organization. CloudTrail and AWS Access Analyzer help tremendously for this need.
Permissions management is a continuous process. Assign permissions based on need, monitoring them, analyze the usage, and optimize based on the usage. Leveraging some of out of the box IAM capabilities of AWS like AWS Identity Center, SCPs, RCPs, Access Analyzer, Data Perimeter, Session Policies helps in this regard.
Implementation of different IAM best practices like Least Privilege, JIT, etc. depend on organization’s maturity level. These are advanced-level IAM security implementations.

Transcript

Host: Hi, everyone. This is Purusottam, and thanks for tuning into ScaleToZero podcast. Today's episode is with Rowan Udell.

Rowan is an AWS specialist with over 10 years of hands-on experience, which helps customers to get to a clarity in the complex cloud security landscape we are in. He's the author of the Practical AWS IAM Guide, and he's also an AWS Community Builder.

He helps organizations implement effective security practices without sacrificing developer productivity. His practical approach to serverless architectures and identity management has made him a respected voice in the AWS community.

Thank you so much, Rowan, for joining with me today in the podcast.

Rowan: Thanks, Puru. It's nice to finally get here. Took a little bit of organizing, but we did it.

Host: Yeah. Anything you want to add to your journey? Like I read just four bullets, but do you want to add anything? How did you get into it? Anything that keeps you motivated even today to work in security landscape?

Rowan: Yeah. Yeah, look, it's interesting. I think it's one of those things where I didn't, I started in cloud kind of, as you mentioned, kind of over a decade ago, which is both good and bad. You I think it gave me the benefit of learning the cloud as the cloud kind of developed itself. You know, these days I can only imagine what it's like to someone starting, you know, brand new to technology or, you know, to software development in the cloud, because there's just so much more to learn.

So I had the benefit of kind of like a nice smooth learning curve, which I think really helped me. And, you know, I think for myself, it's been interesting to see in the early days of cloud, we spent a lot of time talking about, should we use cloud, should we use public cloud, there's this whole public and private cloud thing going on. These days, thankfully, I don't have to have that conversation anymore. Everyone's using all the clouds. And now it's just a matter of how you do it and also how you do it in a way that's secure. Because obviously these over the years, there's been plenty of examples of how not to do things securely. And unfortunately, I don't think that's going to stop anytime soon.

So it keeps things interesting and keeps things kind of improving from a security perspective, is, you know, it's good work to do and hopefully you do it and you make things better than, you know, when you first arrived.

Host: Mm-hmm. Yeah, absolutely. Nowadays, cloud has become de facto, right? Nobody even thinks about building about their own servers unless there are very specific use cases that you are trying to solve for.

Rowan: Exactly. Exactly. It's not to say everything has to be in the cloud, but everyone kind of recognizes that the cloud is really good at some stuff, you know, especially those on demand workloads or scaling to, you know, ridiculous scale, if you're training models and things like that. So it's, it's nice, like I said, to not have to have that conversation.

We used to have to actually really say, you know, it's public cloud, but it doesn't mean everything is public, you know, and we have to spend a lot more time kind of having these very basic conversations. And like I said, we've moved beyond those now. So we have more interesting conversations like this one.

Host: So now speaking of interesting conversations, we generally ask this question to all of our guests and we get unique answers based on the geography, the persona, the role that you are in. So the question is, what does a day in your life look like?

Rowan: Yeah. So at the moment, I'm an independent consultant. I used to work for a consulting company for an AWS partner for many years, but now I do my own consulting. I work with a number of clients at the moment, probably more on the startup side of things and just really helping them use AWS in a secure way. know, some of them are going for their compliance certifications, things like SOC 2, PCI, that kind of thing. So I help a lot in that space.

And look, you know, for me, I just enjoy using the AWS services. That's not to say they're all perfect and they don't have any flaws or anything like that, but I do still enjoy using them, particularly in the server space. You know, I used to have to patch servers and, you know, splice cables and things like that. And I don't have to do that anymore. And I don't miss that kind of work. So, yeah, that's day to day. You know, these days I'm pretty hands on, which is fun and yeah like I said I'm just helping out startups at least this month.

Host: Yeah, hands-on is always fun. So one of the things that you touched on, one of the keywords that you used is public cloud versus everything being public. It's not the same, right? Even though you say that you are hosting your services in public cloud, that doesn't mean everything is public.

So one of the key aspects of that is identity and access management. And today's topic is also focused on that, right? So let's dive into it. So one of our past guests, Chad Lawrence, he works for AWS, said in our podcast that humans do not need access to production cloud accounts. And this should be one of the golden rules for all the organizations. But in practice, what we have seen is many organizations even today have human accounts or access keys in production environments.

Even AWS has stopped recommending that you create access keys, you should use roles and things like that. What have you seen? Why do you think organizations are not able to follow it or achieve this?

Rowan: Yeah, look, I mean, you can't argue with the statement. I think he's right to say those things, but I think there is bit of nuance in there where, I mean, it's a pretty advanced state to be in to say, hey, no humans touch production. So for a lot of organizations, it is an aspirational goal. They just have bigger problems than someone having access to production, probably more business problems about, we can't release our software in a timely manner or, data is public that shouldn't be public. Like these are bigger problems. These are immediate problems that kind of take precedent over that kind of best practice. Hey, humans shouldn't be touching production.

So I definitely agree that everyone should be aiming for that. But I think you got to realize that in practice, you don't just go from zero to one. In this case, you don't just go, you don't just wake up one day and turn off production access to all humans because it probably won't have a good outcome.

So think organizations need to focus on, okay, well, what's the steps that we can take to get closer to this point? And so the big one for me, and I've done this with a lot of customers and I will continue to do with a lot of customers, I'm sure, is firstly getting visibility into, you know, why are humans accessing? So a production, you know, what is the need that they're doing? you know, so this is a great example where, you know, AWS has recently or relatively recently released AWS Identity Center.

And it gives you a way to centralize your identities across your entire organization. And this is, I think, a really important first step. There's really not any reason not to do something like this where you say, okay, look, I'm not going to stop you accessing production just yet. I want to eventually, but first of all, I'm going to centralize your access. So at least then I have that central point, you know, in my cloud trail logs where I can say, Hey, I can see that this person did this thing at this time in production or anywhere else for that matter.

And then you can ask why, why did they do this? Why did this person take this action at this time? And the answer might be, hey, our software requires manual changes in order to deploy it. Like, okay, well, that's a problem you're going to have to solve first. know, now maybe you need infrastructure as code, maybe you need more automation.

The answers, you know, will depend upon the actual, you know, answer to that why question. And once you have that, then you have a chance of getting to that of no human access to production but that's probably going to be a long role you know and probably the older the organization the longer that journey is going to be. If you're a brand new company that's only been around for a year or two yeah probably you can get there you know in a couple of weeks this month that kind of thing if you're if you're really kind of cloud native.

So yeah agree with the sentiment but you know kind of in practice you need to build that muscle and work your way up to that point.

Host: Yeah, I think one of the things that you touched on is business comes first, right? Just because you want to follow best practices, you cannot just get rid of all the production access and then you would have very difficult conversation with engineering and DevOps and other parts of the organization, right?

Rowan: Definitely. Totally. mean, think about it. If you're in AWS, you can literally go there. I couldn't go to any customer. And I can turn off their human access to production in about 10 minutes. And that's not even rushing. That's doing it slow. But if that means they can't deploy any changes, if they can't fix any bugs, if they can't produce any reports, it's going to be a bad time for them. And they're like, hey, we can't do business. We can't actually do the job that we're here to do. I'm like, yeah, but you're following best practice.

Host: Mm-hmm. Yeah, and true. And one of the things that you touched on around identity center, I think AWS also has recently rolled out like access analyzer, where you can see what type of access and who has been accessing what and things like that. So that that also gives you some additional ammunition to take that call.

Rowan: It's not really valid. Yeah, I mean, it comes back to that idea of visibility that I mentioned. That's the key thing that you need to start with kind of understanding because then you can improve. You you can't can't manage what you can't measure. It's kind of like that same concept where if we know what we're dealing with here, then we can improve things. And like I said, chances are that's going to be a gradual thing. Hopefully there's some, you know, value can be delivered before you get to that final endpoint of no human access, you know.

So the first step is often going to be something like, hey, let's get rid of long-term access keys and let's move to short-term access keys so that we can at least know that if anything goes wrong, well, there's only a small window in which it's an issue. And again, Identity Center does this very nicely, which is why AWS is pushing it so hard and quite rightly.

Host: Yeah, so we'll come to the short-term access versus long-term innovate because that is another area I want to touch on as well.

So what I'm hearing is it's all about your maturity level of the organization, right? Based on what maturity level you are in, you might have different set of goals. Like on day one, you cannot say that I will not have anyone having access to production, but as you mature in your adoption and security curves, you can slowly get to that end state.

One of the things that you highlighted is identity center. Any other recommendations that you have so that organizations can set this as a goal and achieve it.

Rowan: Yeah, I think the other thing, the other thing which jumped to mind are… If you take advantage of, for example, AWS organizations, so in the early days of AWS, it was really hard to create accounts and things and manage the connections between those.

That's kind of what IAM Access Analyzer is doing. It observes those trust relationships and gives you a nice report to say, this thing over here, trust these things over here, are you okay with that? Is that a good thing or a bad thing?

Before AWS organizations came out, it was really hard and really painful to do that. We spent a lot of time doing it. Now that organizations is here and you have things like service control policies and more recently resource control policies, you really should be taking advantage of these things.

So it's kind of not in these, the policy aspect, but just grouping things into accounts. know, AWS accounts are kind of like an ownership group to say, hey, this is these resources are owned by this.

And from an IAM perspective, it provides that blast radius around it where it's like, okay, if someone has admin access in this account, then they can only affect things in this account. So if your production environment is not in this account, well, then you don't need to worry about it. you see a lot of organizations, especially pre, sorry, lot of AWS customers, pre-organizations, they don't have enough accounts. They've put too many things in one account. They've got multiple teams, they've got multiple environments in there.

And that's just a really painful situation. There is ways to deal with it with IAM. It's just hard and it's complex and complexity rarely makes things more secure other than the fact that no one understands them. And if you can't understand your system or maybe the attackers can't understand it either. So therefore you're kind of protected. It's not a good place to be.

So yeah, once you have the accounts in place, then you layer on the service control policies and the resource control policies, which are really nice because they apply at that organization level, they override individual account settings.

So I can say, hey, the root account in this particular AWS account is disabled and it can't leave the organization and it can't delete the backups and all this kind of stuff that doing within that account is hard to impossible. Like if you say, hey, I don't want anyone to delete this, but you're an administrator, so now you can delete it anyway because you can delete the resource policy that's protecting it.

That's where SCPs and RCPs come in and override that. So that's another… really powerful tool that, especially in the case of RCPs, which are only a few weeks old at this stage, organizations definitely need to be assessing that and hopefully incorporating that in their kind security posture.

Host: I'm glad that you touched on some of these areas. I still remember there was time when folks used to create VPCs to have different VPCs to segregate their workloads rather than having different accounts. We have come a long way from that. And with AWS organization, AWS identity center, then you have like SCPs and RCPs that helps you define best practices.

And also have guardrails around who can access what.

And then from a monitoring perspective or visibility perspective, you have access analyzers where you can see, compared to what you have set up, what exactly is happening. So these are two areas which practitioners can use to get to a better state in a way. So, let's say I have a team which does the IAM management, policy management and things like that.

So one of the questions that we got from Steven is what strategy would you recommend teams to maximize their return on investment in these areas? Like you hired a few folks and they're doing some of this. How do you make sure you get the maximum value out it?

Rowan: Yeah, look, it's an interesting question. It's kind of a tough one. Because I guess the thing that I come back to is when I just have my security hat on, security is a non-functional requirement. It's not something that's going to change the functionality of your system.

So getting it back into that kind of business terminology is a challenge. You know, you have to end up trying to kind of quantify, okay, well, what's the cost of an outage? Well, it depends, you know, it's how long is a piece of string kind of thing where, you know, if it's a long outage, well, it can make all the difference to a business.

Even a small outage costs you in terms of customer trust, you know, let alone any SLAs that you have in place with your customers. But yeah, giving an actual number to that, like that's going to be very case by case basis. I think for myself, when I think about it, you know, coming to that idea of measuring and managing things from an IAM perspective, I like to focus on things like, you know, well, how many kind of users in the environment can we get that down to zero or near enough to zero from a IAM user perspective?

You can even go so far as to say, let's look at how many permissions were actually giving people access to, know, because if you give someone access to administrator access, then they have all of the permissions and that's potentially not great. So really trying to link the metrics that you can get out of IAM and your security program back to a number that you can then improve on, whether that number needs to go down from a permissions perspective or maybe up from a logging perspective, I'm not sure.

Again, it's gonna depend upon the application. It is hard to kind of translate that into ROI. I guess the times when it isn't hard is when there's actually real issues going on. It's like, okay, we know we're getting taken out. something happens every couple of weeks because one team breaks a role that someone else depends upon.

So if you have… already existing problems and you can look at those and say, okay, well, this is what normal is now. Let's get rid of it. Let's get to a stage where that doesn't happen. That's a little bit of an easier kind of ROI equation to solve for. But, you know, on its own, it's really hard to actually quantify that for the business. know, it's an ongoing challenge that I would love to hear if Stephen or if any of the other ones has a better answer. I'd be very, very open to hearing it.

Host: So we will be meeting with Steven soon. So yeah, we'll definitely pose that question to him and we'll see what are his thoughts.

So one of the things that you touched on is Identity Center, right? Which sort of lets you connect to your identity provider and sort of streamline who has access, like what type of users have access to cloud.

Rowan: Cool. I'll stick back.

Host: So we had another podcast guest, Joseph South. He is a principal security cloud security engineer, and he focuses on IAM. One of the things that he said, and I quote, the ability to create users and roles at will is an advantage in the cloud, and at the same time, it's a downfall of the cloud IAM. What are your thoughts on that?

Rowan: Yeah, look, guess the first thing comes to mind is, you I am is a very powerful tool and like any tool, it can be used to build things, you know, or it can destroy things. You know, can build a house with a hammer or you can demolish a house with a hammer.

So at the end of the day, it comes down to kind of how you use it. And I guess how deliberately you use it. And I know this is something that I was very much guilty of in the early days, especially when serverless became a little bit more of a common offering. You know, I can remember when Lambda first appeared in the Sydney region and I was, you know, happy to not have to build AMIs and, you know, manage and patch servers and things like that.

But it took me a long time before I started to pay attention. Like, oh, look, every time I deploy this function, I deploy this role and this role tells it what it has access to do. I was like, oh, maybe I should pay more attention to that instead of just giving it, you know, all of the permissions or nearly all of the permissions.

So, you know, it wasn't till I kind of really gave it, it's the, the importance that it deserved that I kind of went down this path of writing my book and things like that. but like I said, it took me an embarrassingly long time to actually realize that.

So, you know, when it comes to IAM access, you know, it really is the, the important thing in your permissions kind of, design and set up, you know, and we talk a little bit more. think we touched on a bit that, you know, identity is a new perimeter. You know, we touched on networks before and it used to be the network was a perimeter and, you really spent a lot of time designing your networks, protecting your networks, ensuring things that weren't publicly accessible.

And I don't feel like we've done the same thing to our identities yet. I think it's getting better. I think we're seeing it a lot more in terms of like data perimeters now that everyone's like, you know, really paying attention to their data.

Now we're starting to apply some of those similar concepts, but to the identity, because the reality is in the cloud, the right identity can change the network. The right identity can do whatever it wants. And in AWS terms, that's IAM. That's saying, hey, if you have the ability to create IAM resources, if you have the ability to attach policies to things, you can escalate your privileges and do pretty much whatever you want. Yeah, you might have to jump through a few hoops to get to it, but you can do it.

And, you know… specifically in AWS terms, this is what think permissions boundaries are really meant for. You can use them to do other things, but I the primary use of permissions boundaries is to allow a subset of IAM functionality. Say, hey, you can create a role, for example, for your serverless function, but the role can only do these things.

So it's that kind of… second layer of defense instead of saying, here you go, you can just create IM functions because I know you need to do it for your service application. So there are ways to mitigate this. They're not necessarily easy to do.

In this case, in this example specifically, I'd say a much easier solution would be to give developers or teams their own AWS accounts and just say, hey, it's your account. If you do something silly in it, well, you'll have to clean up your own mess.

Unfortunately, if you're in an environment where you have multiple teams, you can't really do that or you won't have much fun if you do do that. So yeah, this is the thing where you have to treat it with the respect it deserves and probably spend some time setting it up in a way that makes sense.

Host: So I think, you touched on again, like the boundary, defining your boundary and also maybe like having data parameter and things like that so that you are, even though you are providing permissions to someone to, let's say, create role or something like that, they have a boundary even for that, like what type of roles they can create or what type of permissions they can assign to those roles. So that in a way helps you keep a check on what's going on in your IAM.

So one of the things that you touched on earlier around like short-term access and limited access and things like that. And the term that folks use is just-in-time versus having permanent access or just-in-case access. And AWS, I think, also has recently open-sourced a library. I think it's called Smart or something, which does just-in-time as well.

Rowan: think it's TEAM. It's an acronym, T-E-A-M.

Host: TEAM, correct, not smart, correct, I was thinking about something else. Thanks for correcting it. Yeah, team. So can you help our audience understand what JIT is and how does that help in the cloud context?

Rowan: Yeah, look, so just-in-time access is, in my mind, a very advanced strategy. Like it's really about reducing that standing access down to as little as possible. And you really have to have like a high level of automation in order to be able to do just-in-time access just-in-time. And so there's very few organizations out there that I have seen that are really ready for this. There are some.

And those are often the ones that are writing blog posts and sharing things like that. You just want to double check the team solution. I don't think it's been shown much love recently. So just double check the GitHub issues before you go deploying that one. I'd be happy to be wrong about that if it has gotten some love recently.

yeah, look, so being able to say, hey, I don't have access to environment, you know, maybe my production environment, but then I'll put in a request, I'll raise a ticket, you know, there's a really nice kind of audit trail to say Rowan requested access to production. Was he approved by someone? Yes, he was. Okay. Now he has access for this, you know, predefined window. you know, that's a really nice solution. think, you know, coming back to one of our earlier points, a lot of organizations have other problems that they need to solve before they get to that stage.

But again, it's a, it's a good goal to aspire to and to work towards, you know, over time, because, know, it really is that kind of sweet spot between not having access provision just dramatically reduces the chance of things going wrong. You know, like a lot of issues are caused by, the developer thought that they were in this particular account, but guess what? They were in the production account. so when they deleted that database, it was the wrong database.

You know, I'd say that's an unfortunately large number of incidents and outages So you very effectively protect yourself against that the audit trail as well. Like I touched on before is is really nice, you know, it just makes Understanding hey who why did you do this? you've put a little description in your request saying I need to go and change this thing.

And then you see, know, then you can alert on saying hey if you access the environment without this. Well, that's probably a red flag. That's a bad thing, you know, and that would definitely catch a lot of the kind of attacks that we're seeing out there now, you know, where people took keys and did something they shouldn't have done, or, you know, they got them off a developer's machine. You can protect yourself against a lot of those, you know.

So yeah, it's definitely a very nice thing to have, but it's also very advanced, you know, like once you get to that stage, I'm not really sure if there's much else left for you to do on a… from a security perspective.

Host: I love how you connected it to maturity level again, right? Like maybe on day one, you should not think about GIT because it goes back to what you said earlier, that business comes first, right? Like if you are a one-year-old or two-year-old startup, you want to move fast and you may not want to implement a just-in-time kind of a solution. But as you mature, maybe just-in-time could be one of those areas where you invest and you limit access.

Rowan: Yeah. I mean, exactly what you've identified there, which I didn't say, is there's a cost associated to doing these things. You don't get just-in-time access for free. You don't just click your fingers and do that, just like with the production app thing that we talking about. It takes effort.

There's this really good quote. I think it's by one of the principal engineers. Is it Eric Brandwine? I think it is. I could be getting it wrong, but. He says, “Least privilege equals maximum effort”. know, so in the context of security, especially IAM, everyone talks about doing least privilege. Yeah, we just do least privilege, do least privilege.

And again, very few customers or customers of AWS, very few clients I work with actually do least privilege because it is very hard. It takes a lot of effort. You need to know exactly what you need to do in AWS in order to get to that stage. And I think a lot of people, maybe security practices, maybe compliance auditors just say, no, just do least privilege. It's just not that easy and it's a lot of effort.

Host: And it's also not that you just do it once and you are done with least privilege. You have to constantly keep monitoring and optimizing permissions and revising the permission assignments. It's a continuous activity. So that means there is effort.

Rowan: Definitely. Yeah, exactly. know, like if everything was static, maybe you don't have to, but I mean, in the cloud, the cloud itself is changing like underneath you, you know, for example, RCPs just came out. So now you've got to kind of incorporate that and that's a good thing. Um, but from a business perspective, you know, to bring it back to that business thing, it's like, well, we don't have to change if the business doesn't change.

But most businesses are going to change over time. You know, some will change faster than we'll change slower, slower, I would argue that almost every business that wants to keep being a business will have to change and evolve and improve over time. And that then flows onto, okay, well, this was least privilege according to last year, but it's not least privilege according to this year.

So, you know, got to take that into account. I think a lot of people forget about it, maybe don't realize it, don't appreciate it. And the effort that goes alongside that.

Host: Yeah, yeah. No, I love that thing. Least privilege has the maximum effort. And that is very, very true. This is one of the things that happened with one of our customers also, that someone had permanent access and they deleted some CloudFormation stacks, which had some resources attached to them and they got deleted. And they were like, what just happened? Some of their services stopped working. And when they analyzed, they found out that they had very high privileges given to some users, and they had permanent access as well.

So that's where, as you said, as you mature, you would start looking into some of these areas. And things like just-in-time can help you with optimizing the permissions and reducing the attack surface.

One of the questions that often comes when it comes to Cloud IAM. Like what you touched earlier, Like earlier, network was one of the focus. So folks were optimizing the network architecture and things like that. Now in the cloud, identity is one of those areas. And it's one of the core components of Zero Trust or Data Perimeter and things like that.

So we got this question from a security leader of a healthcare startup. They have a remote first culture. And since they are in healthcare, they have to follow some of the strictest compliance regulations, audit regulations, and things like that.

How should they provide access to the data to their employees in a secure manner? And how do you see IAM play a role?

Rowan: Yeah, look, I'm working with a health tech startup at the moment. So I definitely know the challenges in this space. Look, the thing for me, think, and it kind of comes back to the visibility I was talking about earlier from an access perspective. But in this case, it's more of that data focus, know, talk about a data perimeter than obviously data is the actual thing that you're concerned about.

And so for me, the very first thing to do, which unfortunately, I still think few organizations do do is know exactly where all of your sensitive data is. And I don't mean mostly, I mean exactly like, and from a technical perspective, you know, you should be to go to anyone on the team and say, Hey, where's the PII data? Where's the person identifiable information or where is the sensitive information? And they should know off the top of their head where that is. Is it this bucket? Is it that bucket? Is it this database? Is it that DynamoDB table?

Because unless you know where that is, you really can't do a good job of securing it. Because when it comes to practical, know, kind of hands-on tools, you know, actions around protecting data, you're going to have different options depending upon where it is. You know, for, so for example, I've recently rediscovered the AWS database encryption libraries.

There's these client-side SDKs that AWS provides that integrate things like DynamoDB and KMS and say, we'll encrypt this on the client side. So this is nice because it means that even an administrator in the AWS account, in the AWS console looking at a DynamoDB table only sees masked values. But when a customer, when the end user maybe does have access to that data or should have access to that data, when they request it, gets dynamically decrypted and presented to them in a secure and compliant way.

And so they've actually made this really quite easy to do, which is nice because it's not like it's a unique problem. know, it's common to almost all, you know, health tech businesses and even other industries as well. You know, especially when look at things like PCI and things like that. So there's a lot of ways to solve this problem and you don't need to reinvent the wheel to solve this problem because other people have already done it.

But like I said, you have no chance of doing it well if you don't know where your data is. And so I would definitely say to all organizations, if you have any questions in your mind about what the answer to that question is, you should go off and find that out straight away. That's the very first step.

Host: Which means like labeling your data, whether they are PII, PHI, and different sensitivity, and that helps you in understanding what type of data the sites where. One of the things that we have seen with some customers is they use tags. But not everybody uses tags. Do you think tags is one of the solutions for this, or there are other solutions that you recommend generally?

Rowan: Yeah, look, tags is not a bad solution on AWS. You know, most resources now can be tagged, especially the ones that store data. So I would use that for this data classification thing. I have some reservations with tags for Azure based access control, but that's a very different topic. Obviously still security related, but you just have to remember that originally tags on AWS were used for cost allocation and still are used for cost allocation.

That's really their primary function. Can we use that also for data classification? Yeah, it turns out we can. You just have to be careful because once you start relying on this for more than just cost allocation, then need to be a lot more careful about, well, who has access to change these tags and rewrite these tags. And that's why I have issues with it from an attribute based access control, because now you're not depending just on IAM. You're now depending upon the tagging functionality of all the AWS services, which is unfortunately, somewhat inconsistent in terms of when you can do it, how the permissions are handled for it, and things like that, mainly for historical reasons.

But yeah, data classification for tags is fine. You know, at the very worst case scenario, you would classify things at a resource level and say, hey, this database has sensitive information in it, this bucket has information in it, and maybe that's in the naming convention or something like that. And that way you can instantly know. You know, this is a sensitive resource.

Host: It's interesting that you touched on like tags is a good way, but who has access to modify the tags? I had never thought about it from that perspective. Like I always used to think, yeah, you apply tags and you use tags for filtering and finding out different types of resources and things like that.

But yeah, I mean, if you have tagging permissions to everyone, it's like somebody who somebody, some insider can also make those changes, right? So that could lead to.

Rowan: Yeah. And look, it did it unfortunately even worse than that. Like that, that would be the scenario if tagging was consistent across resources. But what you have to remember is that some resources in AWS, you can only create tags when you create the resource and then you can never come back and edit them.

Or there are other ones where the tagging call is a separate API call. And that's great because you can say, Hey, do you have permission or not for that? For some of them, it's included in the update command, you know, it says like, I'm updating this resource.

And I just happen to be updating the tags too, but you can't tell that in advance. There's no condition that prevents that kind of thing. yeah, look, it's definitely a gotcha in terms of using tags for things other than cost allocation. And I have a very old blog post that still gets a lot of visits about my feelings about this, but that's probably the topic for another podcast.

Host: Now, I see your point, right? That it's not consistent even at an API level. Like even if at an organization level, you decide that, we want to have tags and we want to enforce it and we want to limit access to who can add tags, but it's not uniform for all the APIs. So it becomes a little difficult to sort of implement and then manage it long term.

Rowan: And unfortunately you've triggered me now. Now I'm thinking about the organization's tag policy, which you think would be used to enforce tags, but unfortunately it only checks that if a tag that is already set, it hears to the policy that you've set. So it doesn't actually force you to do any tagging. So you can't say, hey, I want a tag policy that says everything should have an owner tag as an example. It doesn't do that. It only can say, well, if there is an owner tag, I want it to be one of these three values kind of thing.

Unfortunately, a lot of people get caught out without they're like, just use the organization's tag policy that'll solve this problem. Won't it? Like, no, it unfortunately won't.

Host: I think we went too deep into the tags, I guess.

Rowan: We'll change the topic of the podcast, you know.

Host: Haha! So far, we were talking about access, let's say, human's roles and things like that. But when it comes to cloud, let's say my primary purpose is to build some apps and deploy it. And as part of that, we generally do service-to-service interaction analysis and see which service is talking to which service.

And when we bring in cloud, sort of that becomes an additional attack surface. Now you need to figure out, let's say you have a serverless Lambda running, talking to S3, what type of permissions are there. And not just that, like if you are deploying your applications in Kubernetes with microservices architecture, you need to also have guardrails between a service to service communication and things like that.

So in some of these scenarios, What have you seen? How do organizations maintain, keep the attack surface minimum? IAM is definitely used in that case, because these services need some permission on the cloud.

Rowan: Well, from a network perspective, I actually started making my own tool for visualizing that. It's called yourpublic.cloud. That's just basically trying to find... The challenge I had was that all the different AWS services have different things that could or could not be public. And I just wanted to see it all in one place.

And whether or not it's because of the whole two pizza team thing that used to be happening in AWS. Just wasn't consistent and there was no one way to see, okay, how is everything configured from an external perspective? Bringing it back to the identity side of things, this is where I think we've touched on the main tools that are there for you, things like access analyzer, things like changing your account privilege, sorry, your account setup.

The key thing to remember with accounts is that within an account, you don't need a resource policy to talk to a resource and to interact with it. You do need an identity policy, so the principal who's taking the action, whether it's a human or a Lambda function or an EC2 instance, it needs to be given permission to do things, but there doesn't need to be anything on the resource side.

Once you cross that account boundary in AWS, that's where you need not only an identity policy for the principal, but also a resource policy for the resource to say, yes, I explicitly trust this principle or this particular account.

So it gives you an extra control point that you didn't have before necessarily. And that's why resource control policies are so relevant because they enable you to put guardrails at an organization level, just like you had service control policies from the identity and principle perspective there, saying, hey,

You can have these kind of identity policy actions, and you can't have these other ones. Resource control policies do the same thing for resource policies and say, hey, you can be shared with these kind of things, but not other things.

So those are the main tools. And obviously, IAM Access Analyzer to kind of review the actual resulting configuration. So you might have your intention being one thing, but if that's not what's actually running in your environment, then better that you note exactly what that is.

So, you know, that's from an identity perspective, there's that aspect of it. And yeah, like I said, from the network perspective, you know, that public attack surface, it is unfortunately hard. There's a lot of different services that can be configured in a lot of different ways. There are the main ones, you know, like load balances, CloudFront distributions, Lambda function URLs. I think people are generally kind of aware of those things depending upon how many accounts they have. But yeah, this is something that I definitely have some feelings about.

Host: So we spoke about IAM, the usage of it, how do you limit and things like that. Now, when it comes to organizations, I am curious to understand what have you seen? Like what are some of the common misconfigurations that you have seen folks do, which can be maybe fixed quickly.

Rowan: The biggest one that you see kind of again and again is those organizations, so those customers that predate AWS organizations, you know, and they haven't necessarily gone back. And again, it kind of comes back to that non-functional requirement. The AWS environment is operating, you know, their applications are running in AWS, but they're really missing out on a lot of the newer features that have come and are just not taking advantage of those things.

I guess, one of my big things about security on AWS is I try and simplify things as much as possible because complexity doesn't generally result in more security. In fact, the opposite is true. If you're struggling to understand and visualize what you've got because it's too complex, then there's a pretty good chance you haven't secured it either.

And in the world we live in, it's only a matter of time until this kind of comes to light. So as much as possible, I try and simplify things and doing those organizational level controls. So for example, SCP service control policies, they're the only way to limit the root account in an AWS account. So every account has a root account credential. It's kind of a holdover from the original setup of accounts. Actually, just recently in the last couple of months, since last re-invent, you can now delete those credentials, which is good too. And that obviously reduces the chance of someone guessing your password.

Before that, there was like, some 50 character passwords that were set automatically when you create an account through organizations. But this way you can either delete it entirely and then still layer on that service control policies that says, hey, I don't want the root user in this account to do anything because it's what enables an account to leave an organization as an example.

So you kind of want to put these guardrails in place because they dramatically reduce your attack surface and they don't take too much effort. You know, once you've set up these service control policies, they can apply to every account you ever had.

So you might have one account today and a hundred accounts next year. You don't actually have to do anything extra. They just apply. yeah, taking advantage of those features can really simplify your security posture while not compromising on how well it protects you.

Host: So speaking of features, what are some of the lesser known features that you think are super helpful? Like, like SCPs, RCPs, things like that are like industry-wide known, right? What have you seen some like small tricks which, or maybe not tricks, like some lesser known features which can be helpful?

Rowan: yeah, yeah, the ones I don't see used enough. so for example, guardduty is one thing that I still don't see turned on all the time, because especially it's like, think people may be afraid of the cost, but you know, at smaller scales, it's, it's very cost effective. And when you see some of the things that can pick up, for example, it can pick up things like credential exfiltration. So someone might grab some short-term keys off your EC2 instance, and then start using it somewhere else. Guard duty is going to flag that. And that's just a one example.

But I find it really effective and I find not many people actually using it in practice, in production. From an IAM perspective, you you've got things like session policies. Now, they're kind of the very last policy that gets evaluated and they're a little bit different in that they can only remove permissions. They can't grant any permissions.

So, you you as a principal, you might be assuming a role. The most permissions you can have are already in that role. You can't say, hey, give me administrator as well, please. But what you can do is limit certain permissions.

So again, it's a pretty advanced technique, but you might say, hey, I'm Rowan. I have administrator access, but I don't want to use that right now. So what I'll use is my session policy. And you can do this through how you manage your sessions. When you start them off, maybe if you have a script or something, you say, hey, this script is going to run. It's going to assume a role, which is totally fine, it gets some short term credentials, which is great, but it's also going to limit itself to just the permissions that it needs. Even though it theoretically could have admin access, I'm not going to use it because I know that's bad. That's not the right thing to do.

So I'm going to do the right thing and limit myself and say, hey, this thing is only doing these actions. So I'm going to limit myself to those actions. And that's a way to do it dynamically that doesn't require kind of infrastructure changes. You know, I don't need to redeploy my or anything like that, which is not something you should be doing on a regular basis.

Host: Yeah, that's a pretty neat one. So let's say you are an admin, but you are trying to just change one, let's say some configuration in S3, you got some script from ChatGPT or something like that, and you're running it. You might want to limit it, right?

Rowan: Great example, yeah. You might not trust it 100%, you know, so you don't want to say, hey, give it access to everything. Yeah, you can apply that when you do the assume role command. could say, yep, just put some guardrails on it. Maybe just give it access to S3. Don't give it access to everything. Or maybe just this particular bucket. Just scope my actions to this particular bucket. And you've reduced that blast radius of anything going wrong with not much effort really.

And especially where I find myself using it mostly is in like a scripting kind of examples. I liked your example there where it's like, okay, I know I'm gonna run this script multiple times. I don't necessarily wanna provision a role just for this script. That seems like a bit too much work. So I will just declare up the top of my script what it is I expect to do. And that way, if you do run into any permissions errors, you can be like, okay, either I'm… doing too much or maybe I didn't understand what I was actually intending to do, know, and that unfortunately happens quite lot, know, I oh, I needed this permission as well.

But it forces you to deal with that and be aware of that rather than just saying, hey, I'll just give you administrator access and hope for the best.

Host: Yeah, true, true. No, this is a pretty neat feature. I'll definitely give it a try myself. I'm looking forward to trying it out. So speaking of ChatGPTs, and we cannot end the podcast without talking about LLMs because we live in that era, right? So with the LLMs, their capabilities and their ever-changing and ever-evolving possibilities,

Rowan: Of course not.

Host: How do you see these helping or causing issue to let's say an IAM engineer?

Rowan: Yeah, look, we're still not yet at the stage where I think you can really trust it to do things like write policies. Now, it can help, especially in some of those really common scenarios, like, I've got a Lambda function with an API gateway. Some of those scenarios that have been seen very much obviously exist in the training, know, data set that these models were trained on. But I have definitely noticed it myself. As soon as you get out of those very common scenarios, it starts to hallucinate things and just not do a great job of it. So that is often where you end up when you're trying to do something like least privilege, know, where you're getting really specific about, well, I'm doing these update actions, but I'm not doing the delete actions and stuff like that for very specific services or resources. Unfortunately, it just doesn't work very well for that now.

I did see some things from an AWS engineer just in the last few days around MCP into the programmatic reference for services, which just announced resource and condition support as well. So it will probably get better in the very near term. But at the same stage, at the moment, you still need to watch it pretty closely from an IAM perspective.

I think more generally in the industry, I think we're probably going to start to see with all the vibe coding going on, there's probably going to be some vibe coding security incidents in the not too distant future, you know, which was probably going to happen anyway, whether people used vibes or not to code their solutions. But I think it's only a matter of time before that becomes a bit of a news item.

But I guess part of me is hopeful that it will help people do a better job of that kind of baseline security, at least kind of asking the questions and things like that. But unfortunately, I don't think it's going to solve all of our security problems anytime soon.

Host: So yeah, like nowadays with agent-ic AI, we're looking at autonomous services running. And maybe in the IM space also, you might see someone building agent-ic solution, having all the guardrails that we spoke about, like your SCPs, RCPs, and your data parameter, things like that. And like the session policies, those should help organizations limit the blast radius, even in the Gen-A era we are in.

One last question that I have for you on this is, like you have been in the IAM space for a while, and especially we have LLMs and all of those, a new technological shift happening. What do you think users should expect, like IAM users should expect in 2025 or beyond?

Rowan: Yeah. Yeah. Look, I mean, like I said, other than the potential kind of security incident that I suspect will happen soon. Yeah, look, I think, I think again, I just touched on it there, but to like give it a bit more detail, like I hope that it will make the kind of kind of raise the floor of the security posture here, because what you can do, you know, what I've definitely done is to say, hey, give me a security review of my code.

And even if that's not perfect, just going through that process, know, and especially, you know, I imagine someone maybe who's less experienced with secure and secure coding practices, you know, they may not be able to get the attention of a senior to developer or someone to go through their code with a fine tooth comb and, you know, point out, Hey, look, you're just accepting this user input and you're just, you know, putting that in your database. That's probably not a good idea.

You know, AI is quite good at looking at that code and going, yep, you should fix this, you should fix this, you know, and you might say, hey, assess my code against the overall top 10. So I think if people do that, you know, and we're seeing more kind of like static code analysis tools that are based on LLMs that do actually do a pretty good job of giving decent suggestions and, and like, you know, might even suggest, Hey, here's how you can solve it. Because guess what? I've got a lot of examples of how that, you know, top 10 issue was addressed in this code base. It's very similar to yours. You know, it's again, it's something you don't need to reinvent the wheel for. And I think that's a good fit for code.

So I would love to see that become part of people's workflow and just say, Hey, because it costs us so little from a time and effort perspective to get the AI to, know, the LLM to review this code. Let's do it. You know, and kind of just enhance that security review and awareness space. So that's what I hope it gets to.

I think, you know, like with any new technology, you know, I think we're still in the trust but verify stage, you know, still need to be reviewing things before they go into production. You know, in the case of, you know, again, IAM policies is where I've probably spent most of my time, you know, the...

A lot of what I've seen it do is just hallucinate actions and just put actions in there that don't exist that don't, you unfortunately I'm going to mention tagging again. You know, it'll say, well you want to have the update tag permission for this particular service. I'll say no, because it doesn't exist. You know, I don't, I can't set that permission. So, you know, that's just an example where things do go wrong. Now, is that a big deal? You know, if you're granting people permissions that don't exist, it's not that big of a deal.

But yeah, you do need to still pay attention to it because ultimately you're to have to deal with the result of whatever it spits out. But I think that's a broader concern than than just security or IAM policies.

Host: No, and you're right, like often when you get a response from Claude or like chat GPT or IM policy, if you ask the same question again in a way that, hey, I don't think it works, you'll see that, hey, I'm sorry, this is not right. Let me tell you another way. Like there is that hallucination impact for sure with LLMs, but hopefully that will get better in your future.

And one of the things that you touched on was around that you would want to see folks maybe using AI to do some of the security scans, right? As part of their pipeline or something like that. Because like with GenAI models, they have seen the common weakness patterns or was top 10 based attacks and things like that. They would be able to go through it pretty quickly versus a human security engineer. So that's definitely a… value add in a way that we can get from JNI tools.

So that sort of brings us to the end of the podcast. But before I let you go, I have one last question. Any reading recommendation that you have for our audience? Like it can be a blog or a book or a podcast or anything.

Rowan: Well, I guess one that's kind of in the IAM space that isn't my book is the, I like to follow these, there's these repos where they actually track all the changes to the managed IAM policies. Now I find it interesting. don't know if I've ever heard of it. There's one, think it's, what's it? M-A-M-I-P, managed AWS. I can't remember what the acronym stands for. But it basically tracks, it's by Zof. It monitor AWS managed IAM policy changes. Another one is called track IAM, but basically it will show you all of the changes that AWS is making to their managed policies. And so one of the funny things about this is you can actually see when services are about to be released because before they can put the blog post up announcing it, they need to make sure that know, the AWS permissions exist so that you can then jump into your account and use the service that you've just read the blog post about.

So especially around kind of reinvent time, you get to see kind of what's coming. But it just kind of shows you kind of where the changes are happening from an AWS side of things, because whenever they do these changes, you obviously need permissions to use those changes. So I kind of… keep an eye on those every couple of days just to kind of get a feel for what's going on. So yeah, that's what I read. I don't know if it's for everyone though.

Host: Okay. Yeah, no, I think you're right. Also, if you follow like security researchers who focus a lot of AWS, they also talk about like new permissions or permissions in like staging environments and things like that. So yeah, absolutely makes sense. So yeah, so that that sort of brings us to the end of the podcast.

It was a very engaging conversation I could I could, I learned a few things from it, which I'm looking forward to trying it out. So thank you so much, Rowan, for coming to the podcast and sharing your experience and knowledge.

Rowan: No worries. Thanks for having me.

Host: Absolutely. And to our audience, thank you so much for watching. See you in the next episode.