Significance of User and Groups in HDFS

     User and Groups significance in HDFS

Before we even start lets take a look back on how user and groups are handled in Linux.
The key take away from the previous articles were

lets Look at the various relationship that exists
1. Every group has a group id.
2. Every user has a user id
3. In linux its not possible to have user without a group id.(by default when a user is created , it has a group with same name)
4. A user can have one primary group and multiple secondary groups.
5. A group can have multiple users.
6. Authentication is done based on username and password.
7. Authorization is done based on groups as unix follow POSIX permission  for   user : group : others
8.A user cannot exist without a group.
9. A group can exist without a user.
10. A file can only have usernames and groups which are part of the Linux OS (Local or Remote service)
11. A file ownership can never be changed to a non existent user ( Create a file and try chown XXXXXX fileName ).
12. Linux is applying authorization policy not only during reading the file but also while creating the file.
13. In linux system there can be no resource which is being handled by a random user which the OS is not aware of.
14. OS maintains(locally or LDAP) a table of user and groups, and will never allow a user outside of this mapping to create, delete or own a file. 

Lets try creating a file on hdfs.
1. Lets change our current user to hdfs on Locally by   sudo su hdfs.
2. hdfs is the superuser for HDFS filesystem , like root is the super user on Linux file System.
3.  Lets create a dir in hdfs.  hadoop dfs -mkdir /tmp/testDir
4.  Change the ownership of /tmp/kunal to random user and group by hadoop dfs -chown XXXX:YYYYY /tmp/kunal
5.  Doing a ls on  hdfs  “/tmp” by hadoop dfs -ls /tmp  | grep testDir which will display
     drwx-xr-x   –   XXXX YYYY    0       2018-02-20 11:00 /tmp/testDir

Key take aways.
1.  hdfs user is the super user in hdfs.
2. HDFS has no strict policy regarding user and groups like your linux OS.
3. You interact with HDFS through hdfs client, hdfs client takes the username of the user through which it was run on the linux OS.
4. HDFS always checks for permissions while reading a file, while creating or chown it does no check who is creating the files.
5. Your linux OS users in a way are related to the user on HDFS, as your hdfs clients pickup the Linux user through which it was run.
6. HDFS provides two kind of security mapping POSIX and ACLS: and its for ACLS that it requires user to group mapping to be made available to it.
7. In the HDFS file system user and group are not as tight coupled as Linux.
8. User Identity is never maintained with the HDFS, the user identity mechanism is extrinsic to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.



The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model. Each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.In contrast to the POSIX model, there are no setuid or setgid bits for files as there is no notion of executable files. For directories, there are no setuid or setgid bits directory as a simplification. The sticky bit can be set on directories, preventing anyone except the superuser, directory owner or file owner from deleting or moving the files within the directory. Setting the sticky bit for a file has no effect. Collectively, the permissions of a file or directory are its mode. In general, Unix customs for representing and displaying modes will be used, including the use of octal numbers in this description. When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).HDFS also provides optional support for POSIX ACLs (Access Control Lists) to augment file permissions with finer-grained rules for specific named users or named groups. ACLs are discussed in greater detail later in this document.Each client process that accesses HDFS has a two-part identity composed of the user name, and groups list. Whenever HDFS must do a permissions check for a file or directory foo accessed by a client process,

  • If the user name matches the owner of foo, then the owner permissions are tested;
  • Else if the group of foo matches any of member of the groups list, then the group permissions are tested;
  • Otherwise the other permissions of foo are tested.

If a permissions check fails, the client operation fails.

User Identity

As of Hadoop 0.22, Hadoop supports two different modes of operation to determine the user’s identity, specified by the property:

  • simpleIn this mode of operation, the identity of a client process is determined by the host operating system. On Unix-like systems, the user name is the equivalent of `whoami`.
  • kerberosIn Kerberized operation, the identity of a client process is determined by its Kerberos credentials. For example, in a Kerberized environment, a user may use the kinit utility to obtain a Kerberos ticket-granting-ticket (TGT) and use klist to determine their current principal. When mapping a Kerberos principal to an HDFS username, all components except for the primary are dropped. For example, a principal todd/[email protected] will act as the simple username todd on HDFS.

Regardless of the mode of operation, the user identity mechanism is extrinsic to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.

Group Mapping

Once a username has been determined as described above, the list of groups is determined by a group mapping service, configured by the property. See Hadoop Groups Mapping for details.