Misc Languages – Page 21

So the first thing we need to do is to read in the whole file in one chunk, if we just do a standard read the file will get broken into rows based on the newline character:
var file = spark.Read().Option("wholeFile", true).Text(@"C:\git\files\newline-as-data.txt");

This solution is a bit complex. As Ed points out, you’re better off reshaping the file before you try to process it. If it’s a structured file like the example Ed has, a regular expression can do the trick.

Comments closed

SQL Server CTP 3.2 and Java Extensibility

Published 2019-08-07 by Kevin Feasel

Niels Berglund walks us through what has changed with Java support in ML Services in SQL Server 2019 CTP 3.2:

One of the announcements of what is new in CTP 3.2 was that SQL Server now includes Azul System’sZulu Embedded right out of the box for all scenarios where we use Java in SQL Server, including Java extensibility.
So, in this post, we look at the impact, (if any), this has to how we use the Java extensibility framework in SQL Server 2019.

This also affects PolyBase.

Comments closed

ClassNotFoundException and .NET Spark

Published 2019-08-05 by Kevin Feasel

Ed Elliott takes us through two causes for a ClassNotFoundException when running a Spark job with .NET Spark:

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.
To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:
spark-submit --class org.apache.spark.deploy.DotnetRunner

Click through to see what you should change this line of code to read. If that change doesn’t fix your problem, Ed has a broader solution.

Comments closed

Adding Aggregates to Table.Profile

Published 2019-08-05 by Kevin Feasel

Chris Webb shows us how to add additional aggregates to Table.Profile in M:

A few years ago I blogged about the Table.Profile M function and how you could use it to create a table of descriptive statistics for your data:
https://blog.crossjoin.co.uk/2016/01/12/descriptive-statistics-in-power-bim-with-table-profile/
Since that post was written a new, optional second parameter has been added to the function called additionalAggregates which allows you to add your own custom columns containing aggregate values to the output of Table.Profile, so I thought I’d write a follow-up on how to use it.

Click through for that follow-up.

Comments closed

Keeping Bash Scripts Reusable

Published 2019-08-01 by Kevin Feasel

Kellyn Pot’vin-Gorman explains some of the concepts behind scripting for longevity:

I’m going to admit, that the reason I didn’t embrace Powershell at first, was most of the examples I found were of full of hardcoded values. I found it incredibly obtuse, but I started to realize that it came from many sources who might not have the scripting history that those of other shells, (this was just my theory, not a lot of evidence to prove on this one, so keep that in mind…) As Powershell scripts have matured, I’ve noticed how many are starting to build them with more dynamic values and advance scripting options, and with this, I’ve become more comfortable with Powershell.
I think the best way to learn is to see real examples, so let’s demonstrate.

Read on for those examples.

Comments closed

Hooking SQL Server to Kafka

Published 2019-07-17 by Kevin Feasel

Niels Berglund has an interesting scenario for us:

We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table.
In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database using, as we mentioned above, the SQL Server Extensibility Framework.

Click through for the scenario in depth and how to use Java to tie together SQL Server and Kafka.

Comments closed

Reading and Writing CSV Files with spark-dotnet

Published 2019-07-16 by Kevin Feasel

Ed Elliott continues a series on Spark for .NET:

How do you read and write CSV files using the dotnet driver for Apache Spark?
I have a runnable example here:
https://github.com/GoEddie/dotnet-spark-examples
Specifcally:
https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv

The quoted links will take you straight to the code, but click through to see Ed’s commentary.

Comments closed

How .NET Code Talks to Spark

Published 2019-07-15 by Kevin Feasel

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM:

4. Spark-dotnet Java driver listens on tcp port
The spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the Java VM but is in a separate process communitcating with the Java VM via that TCP postrt. The year is 2019, we serialize and deserialize data all the time and don’t even know it, hell notepad probably even does it.

It’s serialization & deserialization as well as TCP sockets all the way down.

Comments closed

Spark and dotnet in a Single Container

Published 2019-07-11 by Kevin Feasel

Ed Elliott shows how you can combine Spark and .NET Core in a single Docker container:

This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.
In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.

Ed includes all of the scripts needed to test this out, too.

Comments closed

Scala 2.13 Changes

Published 2019-07-02 by Kevin Feasel

Anmol Sarna takes us through what’s new in Scala 2.13:

Last, but not the least, the team has invested heavily in compiler speedups during the 2.13 cycle which resulted in some major changes with respect to the compiler.
Compiler performance in 2.13 is 5-10% better compared to 2.12, thanks mainly to the new collections.

There are a lot of changes in this version. I wonder how long before Spark supports it fully.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Misc Languages

Parsing Rows Manually with Spark .NET

SQL Server CTP 3.2 and Java Extensibility

ClassNotFoundException and .NET Spark

Adding Aggregates to Table.Profile

Keeping Bash Scripts Reusable

Hooking SQL Server to Kafka

Reading and Writing CSV Files with spark-dotnet

How .NET Code Talks to Spark

Spark and dotnet in a Single Container

Scala 2.13 Changes