Achilleus has a two-parter on working with columns in Spark. Part 1 covers some of the basic syntax and several functions:
Also, we can have typed columns which is basically a column with an expression encoder specified for the expected input and return type.
scala> val name = $"name".as[String]
name: org.apache.spark.sql.TypedColumn[Any,String] = name
scala> val name = $"name"
name: org.apache.spark.sql.ColumnName = name
There are more than 50 methods(67 the last time I counted ) that can be used for transformations on the column object. We will be covering some of the important methods that are generally used.
This is one of the most important function that is used in many of the window operations.We can talk about the window function in detail when discuss about aggregation in spark but for now, it will be fair enough to say that over method provides a way to apply an aggregation over a window specification which in turn can be used to specify partition, order and frame boundaries of the aggregation.
Check out both of these posts for useful tidbits.
Apache Kafka is one of today’s most commonly used event streaming platforms. While using the Kafka platform, quite often, we run into a scenario where we have to process a large number of events/messages that are placed on a broker. Traditional approaches, where a consumer is listening to a topic and then processes these message within the consumer itself, can become a performance bottleneck if the number of messages being placed on the topic is high. In such cases, the rate at which a consumer can process messages will be very low, as there are a large number of messages getting placed on the topic. A potential solution that can be applied in such a scenario is to offload message processing to the worker threads in a thread pool.
Click through for the Java code.
The high level process is to:
1. Create a measure that returns a colour as the result
1. It can be a word, such as blue, red, green
2. It can be a hex code for a colour, like #40E0D0″, “#FFA07A”
2. Use conditional formatting and use the measure to apply the formatting on the text as a rule.
Read on for a demo.
Pedro Lopes who is now on the SQL Server dev team wrote the blog post “T-SQL Misconceptions – JOIN ON vs. WHERE” which does a good job of explaining the “it depends” around the newer style of joins. Why link this article? Microsoft deprecated the OUTER JOIN operators in SQL Server 2008. Two other sources of information on this are here and here (the first is Ward Pond’s old technet blog, and sadly will probably go away soon). If you’re keeping score at home, WHERE clauses are not deprecated except if you’re using *= and =*). The changes people made were wholly unnecessary and as the author, the newer stuff is harder to decipher than what I originally did. They were putting their own biases onto things.
I personally do not like ANSI-89 syntax because it blurs the lines between filters and join criteria. Those are separate things serving different purposes and keeping them separate clarifies queries more than it obscures. Allan’s example doesn’t have any filters but in a more complex scenario with several filters and several join criteria, it can require extra care figuring out what’s going on, especially with multi-column join criteria and filters mixed in (meaning it’s not join criteria and then filters, but a mishmash of the two).
I love the questions I get when presenting: Can You Force a Parallel Plan in Query Store. I haven’t a clue. The trick I think is going to be in setting up the test. Let’s try it out.
Reliably Getting a Parallel Plan
Because this is for testing, rather than try to build some crazy query that may or may not go parallel, I’ve decided to cheat. I’ll take advantage of a little functionality that ensures I see a parallel plan when I want to.
Read on to see Grant’s trick and also the results of this test.
Azure warn you not to to store data on the D drive in Azure VMs, but following this advice could mean you are missing out on some very fast local storage. It’s good general advice because this local storage is not permanently attached to your instance, meaning you could lose data or log files if your VM is stopped and restarted but what if you could afford to lose certain files? Say files that are recreated during startup anyway.
TempDB is the ideal candidate for this. No other database is suitable! Putting the tempdb data and log files onto D drive can be achieved quite easily with a little bit of effort. And you will most likely see a big improvement in tempdb read/write latency.
John ended up seeing much bigger gains than I did when I tried this, but with a difference that big, it’s definitely worth using the temporary drive for tempdb.
Once we have crafted the perfect configuration and shipped it out to our target nodes, it’s time for the magic to happen. The MOF file that we created by executing our configuration is translated and enacted by the Local Configuration Manager (LCM) on each target node. The LCM is the engine of DSC and plays a vital role in managing our target nodes.
The LCM on each target node has many settings that can be configured using a meta configuration. This document is written very similarly to our regular DSC configurations and then pushed out to the target node.
I’m going to cover a few of the important LCM settings for use in
pushmode. This is where the LCM passively waits for a MOF file to arrive. The other option is
pullmode- this is a little more complicated to set up and in this scenario the LCM is set to actively check in with a pull server for new configurations.
Click through to see some of those important settings.
To test FROID, which is the codename for Microsoft’s initiative to inline those awful scalar valued function things that people have been griping about for like 20 years, I like to take functions I’ve seen used in real life and adapt them a bit to work in the Stack Overflow database.
Read on for a case where Erik saw major improvements.