Paired RDDs in Spark

Ramandeep Kaur explains how Paired Resilient Distributed Datasets (PairRDDs) differ from regular RDDs:

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key.

Paired RDDs bring us back to that key-value pair paradigm which Hadoop’s version of MapReduce brought to the forefront.

Spark for .NET Developers

Ed Elliott has a long-form post covering spark-dotnet:

The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.

The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.

As much as I’ve enjoyed his series, getting it in a single-post format is great.

Parsing Rows Manually with Spark .NET

Ed Elliott shows how we can solve a challenging problem when newlines are in the wrong place:

So the first thing we need to do is to read in the whole file in one chunk, if we just do a standard read the file will get broken into rows based on the newline character:

var file = spark.Read().Option("wholeFile", true).Text(@"C:\git\files\newline-as-data.txt");

This solution is a bit complex. As Ed points out, you’re better off reshaping the file before you try to process it. If it’s a structured file like the example Ed has, a regular expression can do the trick.

SQL Server CTP 3.2 and Java Extensibility

Niels Berglund walks us through what has changed with Java support in ML Services in SQL Server 2019 CTP 3.2:

One of the announcements of what is new in CTP 3.2 was that SQL Server now includes Azul System’sZulu Embedded right out of the box for all scenarios where we use Java in SQL Server, including Java extensibility.

So, in this post, we look at the impact, (if any), this has to how we use the Java extensibility framework in SQL Server 2019.

This also affects PolyBase.

ClassNotFoundException and .NET Spark

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Adding Aggregates to Table.Profile

Chris Webb shows us how to add additional aggregates to Table.Profile in M:

A few years ago I blogged about the Table.Profile M function and how you could use it to create a table of descriptive statistics for your data:

https://blog.crossjoin.co.uk/2016/01/12/descriptive-statistics-in-power-bim-with-table-profile/

Since that post was written a new, optional second parameter has been added to the function called additionalAggregates which allows you to add your own custom columns containing aggregate values to the output of Table.Profile, so I thought I’d write a follow-up on how to use it.

Click through for that follow-up.

Keeping Bash Scripts Reusable

Kellyn Pot’vin-Gorman explains some of the concepts behind scripting for longevity:

I’m going to admit, that the reason I didn’t embrace Powershell at first, was most of the examples I found were of full of hardcoded values.  I found it incredibly obtuse, but I started to realize that it came from many sources who might not have the scripting history that those of other shells, (this was just my theory, not a lot of evidence to prove on this one, so keep that in mind…)  As Powershell scripts have matured, I’ve noticed how many are starting to build them with more dynamic values and advance scripting options, and with this, I’ve become more comfortable with Powershell.

I think the best way to learn is to see real examples, so let’s demonstrate.

Read on for those examples.

Hooking SQL Server to Kafka

Niels Berglund has an interesting scenario for us:

We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table.

In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database using, as we mentioned above, the SQL Server Extensibility Framework.

Click through for the scenario in depth and how to use Java to tie together SQL Server and Kafka.

Reading and Writing CSV Files with spark-dotnet

Ed Elliott continues a series on Spark for .NET:

How do you read and write CSV files using the dotnet driver for Apache Spark?

I have a runnable example here:
https://github.com/GoEddie/dotnet-spark-examples

Specifcally:
https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv

The quoted links will take you straight to the code, but click through to see Ed’s commentary.

How .NET Code Talks to Spark

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM:

4. Spark-dotnet Java driver listens on tcp port
The spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the Java VM but is in a separate process communitcating with the Java VM via that TCP postrt. The year is 2019, we serialize and deserialize data all the time and don’t even know it, hell notepad probably even does it.

It’s serialization & deserialization as well as TCP sockets all the way down.

Categories

September 2019
MTWTFSS
« Aug  
 1
2345678
9101112131415
16171819202122
23242526272829
30