Selecting a List of Columns from Spark

Unmesha SreeVeni shows us how we can create a list of column names in Scala to pass into a Spark DataFrame’s select function:

Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price") selectExpr: List[String] = List(Type, Item, Price)

Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()

Click through for a demo.

Case Classes In Scala

Shubham Dangare explains what case classes are in Scala:

Case class is scale way to allow pattern matching on an object without requiring a large amount of boilerplate. All you need to do is add a single case keyword modifier to each class that you want to pattern matching using such modifier makes scala compiler add some syntactic conveniences to your class and compiler add companion object(with the apply method)
Adds factory method with the name of the class this means that for instance, you can write StringValue(“X”) to construct a StringValue object instead of using new StringValue(“X”)

Given how useful case classes are in Spark, it’s good to know how they operate. For more background on the topic, Alessandro Lacava has a post from a few years back describing the topic well.

No Type Equivalence In M

Imke Feldmann notes an oddity in types in Power Query:

But this function will not return any matches. I also tried out a (potentially) slower version using Table.SelectColumns(Types, each [Value] = x[Types]) – but still no match. 

What I found particularly frustrating here was, that in some cases, these lookups or filters on type-columns worked.

That behavior seems odd to me. Imke shares a link from Microsoft which explains that the behavior occurs, but the why behind it eludes me.

Using AWS Lambda To Get Into Nice Restaurants

Stephane Maarek gives us the best use of AWS Lambda I’ve seen yet:

One attentive eye would have noticed that the booking platform is not hosted on the restaurant website at http://www.septime-charonne.fr/en/ but instead on https://module.lafourchette.com.

Upon using the Chrome Web Developer Tools to analyze the network calls being made between my browser and the booking service, I stumbled upon an easy to use and completely unprotected REST API:

I love the bonus hack at the end.

Working With WebHDFS From Node.js

Somanth Veettil shows us how to use Node.js to work with the WebHDFS REST API:

There is an npm module, “node-webhdfs,” with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs 
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.

Click through for examples on how the package works.

Pure Versus Impure Functions

On the Knoldus blog, Siddhant describes pure and impure functions:

As we can see, a value is pure, if it conforms very strictly to its type. In case of pureFunctionValue the declared type said that it was a function which takes an Int and returns a String, and that was indeed what it did. It could take an Int and it gave us back a reference to a String value and did nothing else.

In case of impureFunctionValue the declared type said that it was a function which takes an Int and returns a String. Indeed we could feed it an Int, but when we did so, it did something else apart from returning us a String. It printed stuff out to the console. This, friends, was not expressed in the type, thus it is a side-effect of the function and thus the value in question is impure, and not exactly a function in the mathematical sense.

Pure functions are great because they’re easy to reason about: you have an input, you have an output, and you can guarantee that nothing else changes in between. Impure functions are great because if we only had pure functions, our programs would add zero value. Impure functions drive I/O, including the ability to see what those pure functions did. The trick in functional programming is to push as much logic into the pure space as possible, making it easier to focus on the impure space and make sure you didn’t goof up there.

Querying Cosmos DB Execution Metrics

Hasan Savran shows us how to retrieve execution metrics for a Cosmos DB call:

When I speak about CosmosDB, I always get questions like “How can I retrieve information about the execution plans?” or “Isn’t there a tool like SSMS which can show me what’s happening in the background?” Usually, questions like that comes from DBAs. If you have questions like that, I have good and bad news for you. Good news is, Yes you can get retrieve metrics from CosmosDB about execution plans. Bad news is, you need to know some programming to be able to do that because you need to use CosmosDB SDK.

     The only way to access this information is from CosmosDB SDK 2.x. I couldn’t retrieve execution metrics by using SDK 3.x for custom queries. Here is the available metrics you can retrieve from CosmosDB queries.

I wonder if this is a “this is still new” thing, a “you don’t need these where you’re going” thing, or a “this is exactly how we envisioned implementation” thing. Especially around getting per-query metrics after the fact.

SPLIT And FLATTEN In Snowflake DB

Koen Verbeeck takes us through a couple of Snowflake functions which behave quite differently from their SQL Server equivalent:

I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
Let’s start with SPLIT. Splitting string is something most of us have to do from time to time. In SQL Server, we had to wait until SQL Server 2016 when the table-valued function STRING_SPLIT came along. The syntax of both functions is exactly the same. You specify the string to split and the delimiter.

It turns out that SPLIT doesn’t quite split the same way that STRING_SPLIT does in SQL Server. It gets closer when you add FLATTEN but someone coming from a SQL Server background will still need to pay attention.

Getting CSV Row Counts

Dave Mason shares a few techniques for getting row counts of CSV files:

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.

Java With Visual Studio Code

Niels Berglund learns about Visual Studio Code and writing Java in VS Code:

I mentioned above how Maven is a build automation tool for primarily Java projects. There are other build tools as well, but in this post, I use Maven as it is – which I mentioned above – the de-facto standard for Java-based projects.

Above we saw how we installed Maven as well as the VS Code Maven extension, however before we start to use it let us talk a little about Maven archetypes and naming conventions.

If I were in Java day in and day out, I’d probably go with IntelliJ IDEA but for occasional work, I can see VS Code doing well. That’s the interesting use case for Visual Studio Code: with all of the extensions, it’s a really good multi-purpose IDE while still being worse than the “natural” IDE for almost any single language.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031