Managing ACLs on Gen 2 Data Lake
Hello!
Authorising access to data can become a very difficult task to manage if not enough thought is given to how to manage authorization at the beginning of using data, and this is especially true of something like Azure Data Lake, because it’s going to store a large amount of data across a lot of directories. The access control model to data presents a number of options, the most straightforward to use is RBAC. But these are called “coarse-grained” for a reason; they do not allow for specific access to directories, just the entre storage account. If you store your data across many storage accounts and cna manage access this way then this is not really a problem, but it is highly unlikely anyone can effectively use the RBAC permissions on their own.
So the next step is to incorporate the use of Access Control Lists (ACLs), which will contain an entry for an Azure AD object and the combinations of permissions (read, write,execute) it has on the relevant directory/file.
credit Microsoft Docs
This is the “fine-grained” approach to managing data access, because the permissions need to be set at the file level, including all the folder from the root to the directory; this is what the “execute” permission authorises.
credit Microsoft Docs
Access Control Lists in Azure Data Lake Storage Gen 2 is a concept it is important to understand, and as it is always worth reading and re-reading the documentation on this, because not only is the documentation very thorough and explains in details the concepts, but also because it is more complex than the RBAC model it can get a lot more difficult to manage, which brings us back to the beginning of this post: authorizing access to data is hard and can go wrong very easily if not managed properly.
With that in mind, some time last year I published a PowerShell module that can be used to create folders and apply ACLs to the folders. The readme contains all the info you need to get started. The concept is that you want to keep the access control lists as shallow as possible and to populate the default ACLs at the right level so that everything underneath the directory inherits the right permissions, and to add users to groups that already have an entry.
Anyway, lots and lots of tests have been written but it is hidden in a private Azure DevOps project because of the risk of exposing secrets. Hopefully something clever with managed identities can be done but that will mean using self-hosted build agents.