Category: Misc Languages

Working With WebHDFS From Node.js

Published 2019-01-31 by Kevin Feasel

Somanth Veettil shows us how to use Node.js to work with the WebHDFS REST API:

There is an npm module, “node-webhdfs,” with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.

Click through for examples on how the package works.

Comments closed

Pure Versus Impure Functions

Published 2019-01-24 by Kevin Feasel

On the Knoldus blog, Siddhant describes pure and impure functions:

As we can see, a value is pure, if it conforms very strictly to its type. In case of pureFunctionValue the declared type said that it was a function which takes an Int and returns a String, and that was indeed what it did. It could take an Int and it gave us back a reference to a String value and did nothing else.
In case of impureFunctionValue the declared type said that it was a function which takes an Int and returns a String. Indeed we could feed it an Int, but when we did so, it did something else apart from returning us a String. It printed stuff out to the console. This, friends, was not expressed in the type, thus it is a side-effect of the function and thus the value in question is impure, and not exactly a function in the mathematical sense.

Pure functions are great because they’re easy to reason about: you have an input, you have an output, and you can guarantee that nothing else changes in between. Impure functions are great because if we only had pure functions, our programs would add zero value. Impure functions drive I/O, including the ability to see what those pure functions did. The trick in functional programming is to push as much logic into the pure space as possible, making it easier to focus on the impure space and make sure you didn’t goof up there.

Comments closed

Querying Cosmos DB Execution Metrics

Published 2019-01-24 by Kevin Feasel

Hasan Savran shows us how to retrieve execution metrics for a Cosmos DB call:

When I speak about CosmosDB, I always get questions like “How can I retrieve information about the execution plans?” or “Isn’t there a tool like SSMS which can show me what’s happening in the background?” Usually, questions like that comes from DBAs. If you have questions like that, I have good and bad news for you. Good news is, Yes you can get retrieve metrics from CosmosDB about execution plans. Bad news is, you need to know some programming to be able to do that because you need to use CosmosDB SDK.

The only way to access this information is from CosmosDB SDK 2.x. I couldn’t retrieve execution metrics by using SDK 3.x for custom queries. Here is the available metrics you can retrieve from CosmosDB queries.

I wonder if this is a “this is still new” thing, a “you don’t need these where you’re going” thing, or a “this is exactly how we envisioned implementation” thing. Especially around getting per-query metrics after the fact.

Comments closed

SPLIT And FLATTEN In Snowflake DB

Published 2019-01-21 by Kevin Feasel

Koen Verbeeck takes us through a couple of Snowflake functions which behave quite differently from their SQL Server equivalent:

I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
Let’s start with SPLIT. Splitting string is something most of us have to do from time to time. In SQL Server, we had to wait until SQL Server 2016 when the table-valued function STRING_SPLIT came along. The syntax of both functions is exactly the same. You specify the string to split and the delimiter.

It turns out that SPLIT doesn’t quite split the same way that STRING_SPLIT does in SQL Server. It gets closer when you add FLATTEN but someone coming from a SQL Server background will still need to pay attention.

Comments closed

Getting CSV Row Counts

Published 2019-01-18 by Kevin Feasel

Dave Mason shares a few techniques for getting row counts of CSV files:

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.

Comments closed

Java With Visual Studio Code

Published 2019-01-17 by Kevin Feasel

Niels Berglund learns about Visual Studio Code and writing Java in VS Code:

I mentioned above how Maven is a build automation tool for primarily Java projects. There are other build tools as well, but in this post, I use Maven as it is – which I mentioned above – the de-facto standard for Java-based projects.
Above we saw how we installed Maven as well as the VS Code Maven extension, however before we start to use it let us talk a little about Maven archetypes and naming conventions.

If I were in Java day in and day out, I’d probably go with IntelliJ IDEA but for occasional work, I can see VS Code doing well. That’s the interesting use case for Visual Studio Code: with all of the extensions, it’s a really good multi-purpose IDE while still being worse than the “natural” IDE for almost any single language.

Comments closed

LISTAGG In Snowflake DB

Published 2019-01-15 by Kevin Feasel

Koen Verbeeck continues investigating Snowflake capabilities:

Since SQL Server 2017, you have the STRING_AGG function, which has almost the exact same syntax as its Snowflake counterpart. There are two minor differences:
– Snowflake has an optional DISTINCT
– SQL Server has a default ascending sorting. If you want another sorting, you can specify one in the WITHIN GROUP clause. In Snowflake, there is no guaranteed sorting unless you specify it (again in the WITHIN GROUP clause).

It looks like LISTAGG is the ANSI standard name, though SQL Server followed Postgres’s lead in calling their function STRING_AGG.

Comments closed

Adding The Scalastyle Plugin To Eclipse

Published 2019-01-02 by Kevin Feasel

Unmesha Sreeveni shows us how to install the Scalastyle plugin in Eclipse:

Scalastyle is a style checker for Scala. It checks your Scala code against a number of configurable rules, and marks the code which violates these rules with warning or error markers in your source code.

Click through for the step-by-step process.

Comments closed

Miscellany On Java In SQL Server

Published 2019-01-02 by Kevin Feasel

Niels Berglund continues a series on SQL Server 2019 and Java support:

This post is the fourth post where I look at the Java extension in SQL Server, i.e. the ability to execute Java code from inside SQL Server. The previous three posts are:
SQL Server 2019 Extensibility Framework & Java – Hello World: We looked at installing and enabling the Java extension, as well as some very basic Java code.
SQL Server 2019 Extensibility Framework & Java – Passing Data: In this post, we discussed what is required to pass data back and forth between SQL Server and Java.
SQL Server 2019 Extensibility Framework & Java – Null Values: This, the Null Values, post is a follow up to the Passing Data post, and we look at how to handle null values in data passed to Java.
This fourth post acts as a “roundup” of miscellaneous “stuff” I did not cover in the three previous posts

If you haven’t seen the first three, check them out too.

Comments closed

Using Sqoop’s Logic To Improve Spark JDBC Performance

Published 2018-12-19 by Kevin Feasel

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster:

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.

Read on to see what in particular Sqoop does, and how you can use that in your Spark code.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31