Press "Enter" to skip to content

Day: June 10, 2020

Vectorized R I/O in Apache Spark 3.0

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.

This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed

Troubleshooting Kafka Remote Connections

Robin Moffatt explains common errors people run into when trying to connect to remote Kafka clusters:

In this example, my client is running on my laptop, connecting to Kafka running on another machine on my LAN called asgard03:

The initial connection succeeds. But note that the BrokerMetadata we get back shows that there is one broker, with a hostname of localhostThat means that our client is going to be using localhost to try to connect to a broker when producing and consuming messages. That’s bad news, because on our client machine, there is no Kafka broker at localhost (or if there happened to be, some really weird things would probably happen).

As usual, things boil down to “Configure it correctly and it works.”

Comments closed

Kafka + Kotlin

Unni Mana shows how to create a Kafka consumer and producer in the Kotlin language:

We are using KafkaTemplate to send the message to a topic called test_topic. This will return a ListenableFuture object from which we can get the result of this action. This approach is the easiest one if  you just want to send a message to a topic.

Generally, when we talk about the Hadoop ecosystem and functional programming languages on the Java Virtual Machine, we think Scala. But this is an example showing that Kotlin is in that discussion too.

Comments closed

Azure Active Directory and the DatabricksPS Library

Gerhard Brueckl has updated the DatabricksPS library:

Databricks recently announced that it is now also supporting Azure Active Directory Authentication for the REST API which is now in public preview. This may not sound super exciting but is actually a very important feature when it comes to Continuous Integration/Continuous Delivery pipelines in Azure DevOps or any other CI/CD tool. Previously, whenever you wanted to deploy content to a new Databricks workspace, you first needed to manually create a user-bound API access token. As you can imagine, manual steps are also bad for otherwise automated processes like a CI/CD pipeline. With Databricks REST API finally supporting Azure Active Directory Authentication of regular users and service principals, this last manual step is finally also gone!

If you do use Databricks and haven’t tried out DatabricksPS, I highly recommend it. I think it’s a much nicer experience than hitting the REST API directly, particularly because it deals with continuation tokens and making multiple calls to get your results.

Comments closed

Returning Multiple Values in Power BI with ConcatenateX

Nick Edwards shows how you can use the ConcatenateX DAX function to combine values:

In this blog post we’ll take a quick look at using ConcatenateX function to view a concatenated string of dates where the max daily sales occurred for a given month.

I came across this function whilst going through the excellent “Mastering DAX 2nd Edition Video Course” by the guys from SQLBI.com. So credit to Marco and Alberto for sharing this.

So how does it work? If we had a list of dates ranging from 01/01/2020 to 31/12/2020 and we wanted to see which days we achieved maximum sales for each given month in a year we could use the ConcatenateX function to return these dates in a single row per month.

Click through for the demo.

Comments closed

Quick Powershell Tips

Shane O’Neill has a few Powershell tips for you:

If you spend a lot of time in a PowerShell console, it’s not rash to presume that you’re going to be running some of the same commands over and over again.

That’s where PowerShell’s history comes into play.

By using the command Get-History or even its alias h , you can see the commands that you’ve run before:

Click through to see how it works, as well as a few other tips.

Comments closed

Diagram Visualization with Graphviz

Mikey Bronowski walks through an introduction to the Graphviz diagramming language:

I came across Graphviz which is an open-source graph visualization software initiated by AT&T Labs Research. It can process the graphs that are written in the DOT language.

What is the DOT language?

In short, it is a graph description language that has few keywords like graphdigraphnodeedge. You cannot miss it has something to do with graphs.

I’ve used the R implementation of this as well. It doesn’t create beautiful diagrams, but it is fast, easy, and the output makes sense.

Comments closed