Press "Enter" to skip to content

Curated SQL Posts

Vectorized R I/O in Apache Spark 3.0

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.

This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed

Troubleshooting Kafka Remote Connections

Robin Moffatt explains common errors people run into when trying to connect to remote Kafka clusters:

In this example, my client is running on my laptop, connecting to Kafka running on another machine on my LAN called asgard03:

The initial connection succeeds. But note that the BrokerMetadata we get back shows that there is one broker, with a hostname of localhostThat means that our client is going to be using localhost to try to connect to a broker when producing and consuming messages. That’s bad news, because on our client machine, there is no Kafka broker at localhost (or if there happened to be, some really weird things would probably happen).

As usual, things boil down to “Configure it correctly and it works.”

Comments closed

Kafka + Kotlin

Unni Mana shows how to create a Kafka consumer and producer in the Kotlin language:

We are using KafkaTemplate to send the message to a topic called test_topic. This will return a ListenableFuture object from which we can get the result of this action. This approach is the easiest one if  you just want to send a message to a topic.

Generally, when we talk about the Hadoop ecosystem and functional programming languages on the Java Virtual Machine, we think Scala. But this is an example showing that Kotlin is in that discussion too.

Comments closed

Azure Active Directory and the DatabricksPS Library

Gerhard Brueckl has updated the DatabricksPS library:

Databricks recently announced that it is now also supporting Azure Active Directory Authentication for the REST API which is now in public preview. This may not sound super exciting but is actually a very important feature when it comes to Continuous Integration/Continuous Delivery pipelines in Azure DevOps or any other CI/CD tool. Previously, whenever you wanted to deploy content to a new Databricks workspace, you first needed to manually create a user-bound API access token. As you can imagine, manual steps are also bad for otherwise automated processes like a CI/CD pipeline. With Databricks REST API finally supporting Azure Active Directory Authentication of regular users and service principals, this last manual step is finally also gone!

If you do use Databricks and haven’t tried out DatabricksPS, I highly recommend it. I think it’s a much nicer experience than hitting the REST API directly, particularly because it deals with continuation tokens and making multiple calls to get your results.

Comments closed

Returning Multiple Values in Power BI with ConcatenateX

Nick Edwards shows how you can use the ConcatenateX DAX function to combine values:

In this blog post we’ll take a quick look at using ConcatenateX function to view a concatenated string of dates where the max daily sales occurred for a given month.

I came across this function whilst going through the excellent “Mastering DAX 2nd Edition Video Course” by the guys from SQLBI.com. So credit to Marco and Alberto for sharing this.

So how does it work? If we had a list of dates ranging from 01/01/2020 to 31/12/2020 and we wanted to see which days we achieved maximum sales for each given month in a year we could use the ConcatenateX function to return these dates in a single row per month.

Click through for the demo.

Comments closed

Quick Powershell Tips

Shane O’Neill has a few Powershell tips for you:

If you spend a lot of time in a PowerShell console, it’s not rash to presume that you’re going to be running some of the same commands over and over again.

That’s where PowerShell’s history comes into play.

By using the command Get-History or even its alias h , you can see the commands that you’ve run before:

Click through to see how it works, as well as a few other tips.

Comments closed

Diagram Visualization with Graphviz

Mikey Bronowski walks through an introduction to the Graphviz diagramming language:

I came across Graphviz which is an open-source graph visualization software initiated by AT&T Labs Research. It can process the graphs that are written in the DOT language.

What is the DOT language?

In short, it is a graph description language that has few keywords like graphdigraphnodeedge. You cannot miss it has something to do with graphs.

I’ve used the R implementation of this as well. It doesn’t create beautiful diagrams, but it is fast, easy, and the output makes sense.

Comments closed

Alternatives to Circling Elements on a Page

Cole Nussbaumer Knaflic has some alternatives to circling an item you want people to notice:

You’ve seen it before: a circle on a slide or graph that is meant to highlight something of note. People tend to be surprised when I express admiration towards this approach. I love that it means someone took the time to consider the data and the viewer and thought, “I’d like people to look here” or “I want to make sure my audience doesn’t miss this.” Then they took an action—adding the circle—to help ensure it.

That said, the circle is a blunt tool. It’s better than nothing: if you are facing such a time constraint that you don’t have a minute to spare for anything beyond quickly adding a circle, do it. If you do have more than a minute, however, there are other eloquent solutions you can employ. This will typically involve making changes to how you design the way the data or supporting elements are formatted.

Cole then lists out several alternatives. When I circle (or wrap with a rectangle), it’s usually one of two scenarios: either I’ve just grabbed a screenshot (or have frozen the screen in ZoomIt) and that’s my primary tool available, or I’m working with a pre-generated image and can’t change it. But when you have a chance to alter the base graph or image, Cole has several excellent techniques to make certain items stand out in contrast to others.

Comments closed

Installing TensorFlow and Keras for R on SQL Server 2019 ML Services

I have a post on using TensorFlow and Keras in R on SQL Server 2019 Machine Learning Services:

What I’m doing is building a new virtual environment named r-reticulate, which is what the reticulate package in R desires. Inside that virtual environment, I’m installing the latest versions of tensorflow-probabilitytensorflow , and keras. I had DLL loading problems with TensorFlow 2.1 on Windows, so if you run into those, the proper solution is to ensure that you have the appropriate Visual C++ redistributables installed on your server.

Then, I switched back to the base virtual environment and installed the same packages. My thinking here is that I’ll probably need them for other stuff as well (and don’t tell anybody, but I’m not very good with Python environments).

Please continue not to tell anybody that I’m not very good with Python environments. I tend to dump things in the base environment, forget which one I’m in, and all kinds of other bad practices. I think I’m secretly undermining myself in Python, but I don’t have enough proof yet.

Comments closed