Press "Enter" to skip to content

Day: February 17, 2021

Join Algorithm Selection in Spark

The Hadoop in Real World team takes us through the selection criteria for join types:

There are several factors Spark takes into account before deciding on the type of join algorithm to use to join datasets at runtime.

Spark has the following 5 algorithms to choose from –

1. Broadcast Hash Join
2. Shuffle Hash Join
3. Shuffle Sort Merge Join
4. Broadcast Nested Loop Join
5. Cartesian Product Join (a.k.a Shuffle-and-Replicate Nested Loop Join)

Read on to learn which join types are supported in which circumstances, as well as rules of precedence.

Comments closed

RBAC in Hadoop with Kudu and Ranger

Attila Bukor takes us through the process of setting up role-based access controls on Impala tables:

After setting up the integration it’s time to create some policies, as now only trusted users are allowed to perform any action; everyone else is locked out. Resource-based access control (RBAC) policies can be set up for Kudu in Ranger, but Kudu currently doesn’t support tag-based policies, row-level filtering or column masking.

Click through for the process, as well as current limitations.

Comments closed

Data Quality Monitoring with SQL

Ryan Kearns and Barr Moses walk us through key principles for monitoring data quality in a relational database:

Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data NULL”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.

This covers a couple examples around data freshness and completeness, and I appreciate the level of detail in here. Nothing is earth-shattering, but at the same time, it’s important to have a catalog of the sorts of issues which can pop up. H/T Mark Hutchinson.

Comments closed

Updates to SQL Server Big Data Clusters

Rahul Ajmera fills us in on what they’ve been doing with SQL Server Big Data Clusters:

Today, we’re announcing the release of the latest cumulative update (CU9) for SQL Server Big Data Clusters, which includes important capabilities:

– Support to configure BDC post deployment.
– Improved experience for encryption at rest.
– Ability to install Python packages at Spark job submission time.
– Upgraded software versions for most of our OSS components (Grafana, Kibana, FluentBit, etc.) to ensure Big Data Clusters images are up to date with the latest enhancements and fixes.
– Miscellaneous improvements and bug fixes.

This announcement highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get started.

Click through for more detail on a few of the items.

Comments closed

SPN Registration and dbatools

Jess Pomfret takes us through some SPN pains:

But instead of getting a quick answer to my question, I just got the following error:

WARNING: [15:19:49][Get-DbaDatabase] Error occurred while establishing connection to dscsvr1 | The target principal name is incorrect. Cannot generate SSPI context.

Just reading the article brought back some bad troubleshooting memories for me… But as usual, I’m impressed that dbatools has a cmdlet or two to help with that troubleshooting.

Comments closed

Memory-Optimized Table Variables and tempdb Contention

Erik Darling notes that memory-optimized table variables can be useful in specific circumstances:

First, yes, they do help relieve tempdb contention if you have code that executes under both high concurrency and frequency. And by high, I mean REALLY HIGH.

Like, Snoop Dogg high.

Because you can’t get rid of in memory stuff, I’m creating a separate database to test in.

Been there. When tempdb object creation causes massive contention, this certainly alleviates the stress.

As Erik notes, there are some tradeoffs to this, meaning that you have a real decision to make rather than simply using memory-optimized user-defined table types as a starting point.

Comments closed

Azure Data Factory and JSON Array Hand-Offs

Rayis Imayev wants to pass a JSON array from one Azure Data Factory pipeline to another:

This next post came out of an error message during my attempt to pass a hard-coded array value between pipelines. Strangely, this use-case worked well in the pipeline that was already deployed in ADF, however, I was getting an error message while trying to test and execute this very same pipeline in a Debug mode.

Click through for the explanation.

Comments closed

Renaming All Column Names on All Tables in One Power Query Statement

Soheil Bakhshi has achieved mass production:

previously wrote a blog post explaining how to rename all columns in a table in one go with Power Query. One of my visitors raised a question in the comments about the possibility to rename all columns from all tables in one go. Interestingly enough, one of my customers had a similar requirement. So I thought it is good to write a Quick Tip explaining how to meet the requirement.

Click through to see how to build an expression which iterates over all columns in all tables.

Comments closed