Achilleus has a two-parter on working with columns in Spark. Part 1 covers some of the basic syntax and several functions:
Also, we can have typed columns which is basically a column with an expression encoder specified for the expected input and return type.
scala> val name = $"name".as[String]
name: org.apache.spark.sql.TypedColumn[Any,String] = name
scala> val name = $"name"
name: org.apache.spark.sql.ColumnName = name
There are more than 50 methods(67 the last time I counted ) that can be used for transformations on the column object. We will be covering some of the important methods that are generally used.
This is one of the most important function that is used in many of the window operations.We can talk about the window function in detail when discuss about aggregation in spark but for now, it will be fair enough to say that over method provides a way to apply an aggregation over a window specification which in turn can be used to specify partition, order and frame boundaries of the aggregation.
Check out both of these posts for useful tidbits.
Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price") selectExpr: List[String] = List(Type, Item, Price)
Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()
Click through for a demo.
Case class is scale way to allow pattern matching on an object without requiring a large amount of boilerplate. All you need to do is add a single case keyword modifier to each class that you want to pattern matching using such modifier makes scala compiler add some syntactic conveniences to your class and compiler add companion object(with the apply method)
Adds factory method with the name of the class this means that for instance, you can write StringValue(“X”) to construct a StringValue object instead of using new StringValue(“X”)
Given how useful case classes are in Spark, it’s good to know how they operate. For more background on the topic, Alessandro Lacava has a post from a few years back describing the topic well.
But this function will not return any matches. I also tried out a (potentially) slower version using Table.SelectColumns(Types, each [Value] = x[Types]) – but still no match.
What I found particularly frustrating here was, that in some cases, these lookups or filters on type-columns worked.
That behavior seems odd to me. Imke shares a link from Microsoft which explains that the behavior occurs, but the why behind it eludes me.
Upon using the Chrome Web Developer Tools to analyze the network calls being made between my browser and the booking service, I stumbled upon an easy to use and completely unprotected REST API:
I love the bonus hack at the end.
There is an npm module, “node-webhdfs,” with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.
Click through for examples on how the package works.
As we can see, a value is pure, if it conforms very strictly to its type. In case of
pureFunctionValuethe declared type said that it was a function which takes an Int and returns a String, and that was indeed what it did. It could take an Int and it gave us back a reference to a String value and did nothing else.
In case of
impureFunctionValuethe declared type said that it was a function which takes an Int and returns a String. Indeed we could feed it an Int, but when we did so, it did something else apart from returning us a String. It printed stuff out to the console. This, friends, was not expressed in the type, thus it is a side-effect of the function and thus the value in question is impure, and not exactly a function in the mathematical sense.
Pure functions are great because they’re easy to reason about: you have an input, you have an output, and you can guarantee that nothing else changes in between. Impure functions are great because if we only had pure functions, our programs would add zero value. Impure functions drive I/O, including the ability to see what those pure functions did. The trick in functional programming is to push as much logic into the pure space as possible, making it easier to focus on the impure space and make sure you didn’t goof up there.
When I speak about CosmosDB, I always get questions like “How can I retrieve information about the execution plans?” or “Isn’t there a tool like SSMS which can show me what’s happening in the background?” Usually, questions like that comes from DBAs. If you have questions like that, I have good and bad news for you. Good news is, Yes you can get retrieve metrics from CosmosDB about execution plans. Bad news is, you need to know some programming to be able to do that because you need to use CosmosDB SDK.
The only way to access this information is from CosmosDB SDK 2.x. I couldn’t retrieve execution metrics by using SDK 3.x for custom queries. Here is the available metrics you can retrieve from CosmosDB queries.
I wonder if this is a “this is still new” thing, a “you don’t need these where you’re going” thing, or a “this is exactly how we envisioned implementation” thing. Especially around getting per-query metrics after the fact.
I’m doing a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
Let’s start with SPLIT. Splitting string is something most of us have to do from time to time. In SQL Server, we had to wait until SQL Server 2016 when the table-valued function STRING_SPLIT came along. The syntax of both functions is exactly the same. You specify the string to split and the delimiter.
It turns out that
SPLIT doesn’t quite split the same way that
STRING_SPLIT does in SQL Server. It gets closer when you add
FLATTEN but someone coming from a SQL Server background will still need to pay attention.
I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).
To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.