By default, there is a gap between the security model of Kubernetes and Hadoop. Specifically, Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography to ensure any clients accessing the cluster are who they claim to be. In order to avoid frequent authentication checks against a Kerberos server, Delegation Tokens, a lightweight two-party authentication method, was introduced to complement Kerberos authentication. The Hadoop delegation token by default has a lifespan of one day and can be renewed for up to seven days. Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs. Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.
To allow for Kubernetes workloads to securely access HDFS, we built Kube2Hadoop, a scalable and secure integration with HDFS Kerberos. This enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account. Headless accounts are oftentimes used to denote a virtual team that is working on projects that would share the same data within the team. The data acquired can then be used in their model exploration and training with KubeFlow components such as the tf-operator and mpi-operator. In this blog, we will describe the design and authentication model of Kube2Hadoop.
Read on to see how it works and a link to the GitHub repo.