I mentioned above how Maven is a build automation tool for primarily Java projects. There are other build tools as well, but in this post, I use Maven as it is – which I mentioned above – the de-facto standard for Java-based projects.
Above we saw how we installed Maven as well as the VS Code Maven extension, however before we start to use it let us talk a little about Maven archetypes and naming conventions.
If I were in Java day in and day out, I’d probably go with IntelliJ IDEA but for occasional work, I can see VS Code doing well. That’s the interesting use case for Visual Studio Code: with all of the extensions, it’s a really good multi-purpose IDE while still being worse than the “natural” IDE for almost any single language.
Since SQL Server 2017, you have the STRING_AGG function, which has almost the exact same syntax as its Snowflake counterpart. There are two minor differences:
– Snowflake has an optional DISTINCT
– SQL Server has a default ascending sorting. If you want another sorting, you can specify one in the WITHIN GROUP clause. In Snowflake, there is no guaranteed sorting unless you specify it (again in the WITHIN GROUP clause).
Scalastyle is a style checker for Scala. It checks your Scala code against a number of configurable rules, and marks the code which violates these rules with warning or error markers in your source code.
Click through for the step-by-step process.
This post is the fourth post where I look at the Java extension in SQL Server, i.e. the ability to execute Java code from inside SQL Server. The previous three posts are:
SQL Server 2019 Extensibility Framework & Java – Hello World: We looked at installing and enabling the Java extension, as well as some very basic Java code.
SQL Server 2019 Extensibility Framework & Java – Passing Data: In this post, we discussed what is required to pass data back and forth between SQL Server and Java.
SQL Server 2019 Extensibility Framework & Java – Null Values: This, the Null Values, post is a follow up to the Passing Data post, and we look at how to handle
nullvalues in data passed to Java.
This fourth post acts as a “roundup” of miscellaneous “stuff” I did not cover in the three previous posts
If you haven’t seen the first three, check them out too.
Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.
Read on to see what in particular Sqoop does, and how you can use that in your Spark code.
In the United States, there is a popular saying: “It’s five o’clock somewhere”.
In some parts of the world, 5:00 pm is the earliest time when it is socially acceptable to have a drink, or a traditional cup of tea.
Today, we will build an application based on this concept. We will build an application that, at any given time, searches through the various time zones, find out where it is five o’clock, and provide that information to the user.
It’s a neat app.
In Java, there are also helper components, (a topic for future posts), but the integration is not as tight, so when we want to pass data into and out of Java we need to code somewhat more explicit to make data passing possible.
In our Java code, we need to represent the data passed in and out as class member column arrays. You define in your classes, one array per column passed in, and one array per column returned. These column arrays are some of the “magic” members I mentioned above, and they are the equivalent to
The components that are part of the Java extension need to know about these members as the components either populate them when pushing data into Java or read from them when returning data from Java. The way that the components know about the members is based on a naming standard.
It’s definitely easier to pass data in and get data back from R and Python, but I suppose part of that is Java being a static, compiled language.
1. Creating Key/Value Pair RDD:
The pair RDD arranges the data of a row into two parts. The first part is the Key and the second part is the Value. In the below example, I used a
parallelizemethod to create a RDD, and then I used the
lengthmethod to create a Pair RDD. The key is the length of the each word and the value is the word itself.
scala> val rdd = sc.parallelize(List("hello","world","good","morning"))rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD at parallelize at <console>:24scala> val pairRdd = rdd.map(a => (a.length,a))pairRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD at map at <console>:26scala> pairRdd.collect().foreach(println)(5,hello)(5,world)(4,good)
Click through for more operations. Spark is a bit less KV-centric than classic MapReduce jobs, but there are still plenty of places where you want to use them.
Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.
We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!
26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.
I’m not sure the test was totally fair, but the results comport to my biases… There is some good advice here: storing data in optimized formats (Parquet in this instance) can make a big difference, Spark is useful for ETL style operations, and Scala is generally the fastest language in the Spark world.
In SQL Server 2019 Microsoft added the ability to execute custom Java code along the same lines we execute R and Python, and this blog post intends to give an introduction of how to install and enable the Java extension, as well as execute some very basic Java code. In future posts, I drill down how to pass data back and forth between SQL Server and Java.
There may very well be future posts discussing how the internals differ between Java and R/Python, but I want to talk about that a little bit in this post as well, as it has an impact on how we write and call Java code.
The not-so-secret here is that Java itself is less interesting of a language than, say, Scala. And the reason you’d support Scala? To interact with an Apache Spark cluster. I think that’s a big part of why you’d want the installer to load Java 1.8 instead of 1.9 or later (which contain API changes which break Spark). Definitely give this a careful read, as there are more working parts and more gotchas than R or Python support.