Press "Enter" to skip to content

Author: Kevin Feasel

Impala Improvements in CDH 5.15.0

Michael Ho, et al, share some improvements in Apache Impala’s scalability in the Cloudera Distribution of Hadoop:

Kudu RPC (KRPC) supports asynchronous RPCs. This removes the need to have a single thread per connection. Connections between hosts are long-lived. All RPCs between two hosts multiplex on the same established connection. This drastically cuts down the number of TCP connections between hosts and decouples the number of connections from the number of query fragments.

The error handling semantics are much cleaner and the RPC library transparently re-establishes broken connections. Support for SASL and TLS are built-in. KRPC uses protocol buffers for payload serialization. In addition to structured data, KRPC also supports attaching binary data payloads to RPCs, which removes the cost of data serialization and is used for large data objects like Impala’s intermediate row batches. There is also support for RPC cancellation which comes in handy when a query is cancelled because it allows query teardown to happen sooner.

Looks like there were some pretty nice gains out of this project.

Comments closed

Azure Data Factory Data Flows

Joost van Rossum takes a look at data flows in Azure Data Factory:

2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.

Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.

Comments closed

Changing Red Hat’s SSH Port On An Azure VM

Paul Randal has a post showing you how to change the default SSH port on a Red Hat Enterprise Linux VM hosted in Azure:

The steps that need to be performed are:
– Allow the new port in the RHEL firewall
– Change the SSH daemon to listen on the new port
– Add an incoming rule in the VM network security group for the new port
– Remove the rule that allows port 22

The Ubuntu process will be pretty close to this as well.

Comments closed

Stream Analytics And Power BI

Brad Llewellyn gives us a demo on connecting a Stream Analytics stream to Power BI for data analysis:

We understand that streaming data isn’t typically considered “Data Science” by itself.  However, it’s are often associated and setting up this background now opens up some cool applications in later posts.  For this post, we’ll cover how to sink streaming data to Power BI using Stream Analytics.

The previous posts in this series used Power BI Desktop for all of the showcases.  This post will be slightly different in that we will leverage the Power BI Service instead.  The Power BI Service is a collaborative web interface that has most of the same reporting capabilities as Power BI Desktop, but lacks the ability to model data at the time of writing.  However, we have heard whispers that data modeling capabilities may be coming to the service at some point.  The Power BI Service is also the standard method for sharing datasets, reports and dashboards across organizations.  For more information on the Power BI Service, read this.

Brad has a nice demo, so check it out.

Comments closed

Using SWITCHOFFSET

Doug Kline has a video and T-SQL script around date/time offsets and particularly the SWITCHOFFSET function:

— so, before SWITCHOFFSET existed, …

SELECT SWITCHOFFSET(SYSDATETIMEOFFSET(),'-05:00') AS [EST the easy way], TODATETIMEOFFSET(DATEADD(HOUR, -5, SYSDATETIMEOFFSET()), '-05:00') AS [EST the hard way]

— so, thinking of a DATETIMEOFFSET data type as a complex object

— with many different parts: year, month, day, hour, time zone, etc.

— it looks like SWITCHOFFSET changes two things: time zone and hour

This was an interesting video. I typically think entirely in UTC and let the calling application convert to time zones as needed, but if that’s not an option for you, knowing about SWITCHOFFSET() is valuable.

Comments closed

Azure Kubernetes LoadBalancer External IP Woes

Andrew Pruski writes up some issues he had with creating a LoadBalancer service in Azure Kubernetes:

I logged a case with MS Support and when they came back to me, they advised that the service principal that is spun up in the background had expired. This service principal is required to allow the cluster to interact with the Azure APIs in order to create other Azure resources.

When a service is created within AKS with a type of LoadBalancer, a Load Balancer is created in the background which provides the external IP I was waiting on to allow me to connect to the cluster.

Because this principal had expired, the cluster was unable to create the Load Balancer and the external IP of the service remained in the pending state.

There were a lot of steps here; click through to see just how many.

Comments closed

Password Protect Everything, Including Hadoop

George Leopold summarizes a recent Securonix report:

The malware spreads via brute-force attacks on weak passwords “or by exploiting one of three vulnerabilities found on Hadoop YARN Resource Manager, Redis [in-memory key-value store service] and ActiveMQ,” Securonix said. Once logged into database services, the malware can for example delete existing databases stored on a server and create another with a ransom note specifying a bitcoin payment.

The security analyst recommends continuous review of cloud-based services like Hadoop and YARN instances and their exposure to the Internet. Along with strong passwords, companies should “restrict access whenever possible to reduce the potential attack surface.”

It’s pretty standard advice: secure your data, password-protect your systems, and minimize the number of computers that get to touch your computers.

Comments closed

Preparing Text Data For Natural Language Processing

Shirin Glander takes us through the process of preparing natural language data for machine learning using Keras:

As with any neural network, we need to convert our data into a numeric format; in Keras and TensorFlow we work with tensors. The IMDB example data from the keras package has been preprocessed to a list of integers, where every integer corresponds to a word arranged by descending word frequency.

So, how do we make it from raw text to such a list of integers? Luckily, Keras offers a few convenience functions that make our lives much easier.

This is a very nice tutorial if you’re new to the process.

Comments closed

Daylight Savings Time Calculations In Power BI

Fred Kaffenberger shows us how to convert UTC to local time zones with daylight savings time:

Quick tip for DST Refresh Date function Power BI Service. I’ll put the code up front, and explain it below. I’ll also say a bit about how to use it at the end. The United States and other places, like Australia, have a pesky thing called Daylight Savings Time. This means that in Central Time US, the offset from Universal Time Coordinated (UTC) is sometimes -6 and other times it’s -5. While Power Query can convert time zones, it doesn’t handle DST. And, my users like to see when the reports were refreshed as a step in evaluating data quality. In 2019, US DST is from March 10 – November 3 (2 AM local time). So, the functions here need to be updated every year.

As promised, here’s the custom function. 

Click through for the custom function and a nice explanation of how it works.

Comments closed

Copy Measures Between Power BI Files

Matt Allington shows us how you can copy measures between PBIX files:

Warning!
Ok, here is the warning.  This is not supported by Microsoft.  If you do this and it breaks your model, you will not get support from Microsoft (or me for that matter).  So, back everything up and keep the backups – don’t delete them.  Consider yourself suitably warned 🙂 .

This warning just makes me more likely to do it…

Comments closed