Press "Enter" to skip to content

Category: Hadoop

RBAC in Hadoop with Kudu and Ranger

Attila Bukor takes us through the process of setting up role-based access controls on Impala tables:

After setting up the integration it’s time to create some policies, as now only trusted users are allowed to perform any action; everyone else is locked out. Resource-based access control (RBAC) policies can be set up for Kudu in Ranger, but Kudu currently doesn’t support tag-based policies, row-level filtering or column masking.

Click through for the process, as well as current limitations.

Comments closed

Join Algorithm Selection in Spark

The Hadoop in Real World team takes us through the selection criteria for join types:

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.

Spark has the following 5 algorithms to choose from –

1. Broadcast Hash Join
2. Shuffle Hash Join
3. Shuffle Sort Merge Join
4. Broadcast Nested Loop Join
5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Read on to learn which join types are supported in which circumstances, as well as rules of precedence.

Comments closed

Synchronizing Metadata between Spark Tables and Serverless Pool

Charl Roux takes us through one back-end integration mechanism between tables in Azure Synapse Analytics Spark pools and serverless SQL pool:

Synapse provides an exciting feature which allows you to sync Spark database objects to Serverless pools and to query these objects without the Spark pool being active or running.  Synapse workspaces are accessed exclusively through an Azure AD Account and objects are created within this context in the Spark pool. In some scenarios I would like to share the data which I’ve created in my Spark database with other users for reporting or analysis purposes. This is possible with Serverless and in this article I will show you how to complete the required steps from creation of the object to successful execution. 

Click through for the demonstration.

Comments closed

Clustering with Apache NiFi

Brent Segner and Ryan O’Donnell explain how clustering works with Apache NiFi:

Although it is entirely possible to deploy NiFi in a single node configuration, this does not represent a best practice for an enterprise graded deployment and would introduce unnecessary risk into a production environment where scaling to meet demand and resiliency are paramount.  In order to get around this concern, as of release 1.0.0, NiFi provides the ability to cluster nodes together using either an embedded or external Apache Zookeeper instance as a highly reliable distributed coordinator. While a simple Google search shows there is plenty of debate around whether it is better to use an embedded or external Zookeeper service as both sides have merit, for the sake of argument and this blog, we will use the embedded flavor in the deployment. 

Click through for more information, including a walkthrough on configuration.

Comments closed

Time-Saving Tips for Databricks

Robert Blackburn has a few tips for us:

Adding bigger or more nodes to your cluster increases costs. There are also diminishing returns. You do not need 64 cores if you are only using 10. But you still need a minimum that matches your processing requirements. If your utilization looks like this, you must increase the size of your cluster.

Click through for several good tips.

Comments closed

Secure Cluster Connectivity in Azure Databricks

Abhinav Garg and Premal Shah have an announcement:

We’re excited to announce the general availability of Secure Cluster Connectivity (also commonly known as No Public IP) on Azure Databricks. This release applies to Microsoft Azure Public Cloud and Azure Government regions, in both Standard and Premium pricing tiers. Hundreds of our global customers including large financial services, healthcare and retail organizations have already adopted the capability to enable secure and reliable deployments of the Azure Databricks unified data platform. It allows them to securely process company and customer data in private Azure Virtual Networks, thus satisfying a major requirement of their enterprise governance policies.

Read on fore more detail about how this works.

Comments closed

Connecting Confluent and Databricks on Azure

Angela Chu, et al, take us through a streaming data ingestion process:

How do you process IoT data, change data capture (CDC) data, or streaming data from sensors, applications, and sources in real time? Apache Kafka® and Azure Databricks are widely adopted technologies in the industry, but they require specific skills and expertise to run. Leveraging Confluent Cloud and Azure Databricks as fully managed services in Microsoft Azure, you can implement new real-time data pipelines with less effort and without the need to upgrade your datacenter (or set up a new one).

This blog post demonstrates how to configure Azure Databricks to interact with Confluent Cloud so that you can ingest, process, store, make real-time predictions and gain business insights from your data.

Click through for a detailed demonstration.

Comments closed

Using Spark Pools in Azure Synapse Analytics

Rahul Mehta shows how to create and use an Apache Spark pool in Azure Synapse Analytics:

In the last part of the Azure Synapse Analytics article series, we learned how to create a dedicated SQL pool. Azure Synapse support three different types of pools – on-demand SQL pool, dedicated SQL pool and Spark pool. Spark provides an in-memory distributed processing framework for big data analytics, which suits many big data analytics use-cases. Azure Synapse Analytics provides mechanisms to use SQL on-demand pool to query data as a service, SQL dedicated pool for data warehousing using distributed data processing engine, and Spark pool for analytics using in-memory big data processing engine. This article shows how to create a Spark pool in Azure Synapse Analytics and further how to process the data using it.

Click through for a demo on setup and a sample notebook to get started.

Comments closed

A Review of AWS Athena

John McCormack updates an older review:

AWS’s own documentation is the best place for full details on the Athena offering, this post hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. It’s not designed for regular querying of tables in a way that you would with an RDBMS. Performance is geared around querying large data sets which may include structured data or semi-structured data. There are no licensing costs like you may have with some Relational Database Management Systems (RDBMS) such as SQL Server and costs are kept low, as you only pay when you run queries in AWS Athena.

Click through for an overview of product benefits.

Comments closed