Press "Enter" to skip to content

Category: Misc Languages

SPLIT And FLATTEN In Snowflake DB

Koen Verbeeck takes us through a couple of Snowflake functions which behave quite differently from their SQL Server equivalent:

I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
Let’s start with SPLIT. Splitting string is something most of us have to do from time to time. In SQL Server, we had to wait until SQL Server 2016 when the table-valued function STRING_SPLIT came along. The syntax of both functions is exactly the same. You specify the string to split and the delimiter.

It turns out that SPLIT doesn’t quite split the same way that STRING_SPLIT does in SQL Server. It gets closer when you add FLATTEN but someone coming from a SQL Server background will still need to pay attention.

Comments closed

Getting CSV Row Counts

Dave Mason shares a few techniques for getting row counts of CSV files:

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.

Comments closed

Java With Visual Studio Code

Niels Berglund learns about Visual Studio Code and writing Java in VS Code:

I mentioned above how Maven is a build automation tool for primarily Java projects. There are other build tools as well, but in this post, I use Maven as it is – which I mentioned above – the de-facto standard for Java-based projects.

Above we saw how we installed Maven as well as the VS Code Maven extension, however before we start to use it let us talk a little about Maven archetypes and naming conventions.

If I were in Java day in and day out, I’d probably go with IntelliJ IDEA but for occasional work, I can see VS Code doing well. That’s the interesting use case for Visual Studio Code: with all of the extensions, it’s a really good multi-purpose IDE while still being worse than the “natural” IDE for almost any single language.

Comments closed

LISTAGG In Snowflake DB

Koen Verbeeck continues investigating Snowflake capabilities:

Since SQL Server 2017, you have the STRING_AGG function, which has almost the exact same syntax as its Snowflake counterpart. There are two minor differences:
– Snowflake has an optional DISTINCT
– SQL Server has a default ascending sorting. If you want another sorting, you can specify one in the WITHIN GROUP clause. In Snowflake, there is no guaranteed sorting unless you specify it (again in the WITHIN GROUP clause).

It looks like LISTAGG is the ANSI standard name, though SQL Server followed Postgres’s lead in calling their function STRING_AGG.

Comments closed

Miscellany On Java In SQL Server

Niels Berglund continues a series on SQL Server 2019 and Java support:

This post is the fourth post where I look at the Java extension in SQL Server, i.e. the ability to execute Java code from inside SQL Server. The previous three posts are:
SQL Server 2019 Extensibility Framework & Java – Hello World: We looked at installing and enabling the Java extension, as well as some very basic Java code.
SQL Server 2019 Extensibility Framework & Java – Passing Data: In this post, we discussed what is required to pass data back and forth between SQL Server and Java.
SQL Server 2019 Extensibility Framework & Java – Null Values: This, the Null Values, post is a follow up to the Passing Data post, and we look at how to handle null values in data passed to Java.
This fourth post acts as a “roundup” of miscellaneous “stuff” I did not cover in the three previous posts

If you haven’t seen the first three, check them out too.

Comments closed

Using Sqoop’s Logic To Improve Spark JDBC Performance

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster:

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.

Read on to see what in particular Sqoop does, and how you can use that in your Spark code.

Comments closed

Building A Full-Stack App With F#

Shanglun Wang has a tutorial on building a full application using F#:

In the United States, there is a popular saying: “It’s five o’clock somewhere”.
In some parts of the world, 5:00 pm is the earliest time when it is socially acceptable to have a drink, or a traditional cup of tea.
Today, we will build an application based on this concept. We will build an application that, at any given time, searches through the various time zones, find out where it is five o’clock, and provide that information to the user.

It’s a neat app.  

Comments closed

Passing Data To Java In SQL Server 2019

Niels Berglund continues looking at the SQL Server 2019 Extensibility Framework’s support for Java:

In Java, there are also helper components, (a topic for future posts), but the integration is not as tight, so when we want to pass data into and out of Java we need to code somewhat more explicit to make data passing possible.
In our Java code, we need to represent the data passed in and out as class member column arrays. You define in your classes, one array per column passed in, and one array per column returned. These column arrays are some of the “magic” members I mentioned above, and they are the equivalent to InputDataSet, and OutputDataSet.
The components that are part of the Java extension need to know about these members as the components either populate them when pushing data into Java or read from them when returning data from Java. The way that the components know about the members is based on a naming standard.

It’s definitely easier to pass data in and get data back from R and Python, but I suppose part of that is Java being a static, compiled language.

Comments closed

Working With Key-Value Pairs In Spark

Teena Vashist shows us a few of the functions available with Spark for working with key-value pairs:

1. Creating Key/Value Pair RDD: 
The pair RDD arranges the data of a row into two parts. The first part is the Key and the second part is the Value. In the below example, I used a parallelize method to create a RDD, and then I used the length method to create a Pair RDD. The key is the length of the each word and the value is the word itself.

scala> val rdd = sc.parallelize(List("hello","world","good","morning"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val pairRdd = rdd.map(a => (a.length,a))
pairRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at <console>:26
scala> pairRdd.collect().foreach(println)
(5,hello)
(5,world)
(4,good)
(7,morning)

Click through for more operations.  Spark is a bit less KV-centric than classic MapReduce jobs, but there are still plenty of places where you want to use them.

Comments closed