Spark UDFs in Scala

Achilleus shows us how to create a user-defined function for Spark in Scala, as well as the performance drawbacks:

It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the USA that is similar with the help of a known dictionary.

The performance hit is often too much for me to accept, though that could just be that I write bad functions.

ETL with Spark and Hive

Emrah Mete gives us an example of using Apache Spark for ETL into Apache Hive:

Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.

Choosing Between Management Studio and Azure Data Studio

Brent Ozar gives us the lay of the tooling land:

Every time there’s a new release of SQL Server or SQL Server Management Studio, you can grab the latest version of SSMS and keep right on keepin’ on. Your job still functions the same way using the same tool, and the tool keeps getting better.

And it’s free. You don’t have to ask the boss for upgrade money. You can just download it, install it, and take advantage of things like the cool new execution plan est-vs-actual numbers (which also cause presenters all over to curse, knowing that they have to redo a bunch of screenshots.)

I spend a lot of time jumping back & forth between SQL Server and Postgres, and lemme just tell you, the tooling options on the other side of the fence are a hot mess.

Yeah, Management Studio is the best of the bunch. I’m using Azure Data Studio more at home but still need a couple of plugins to use it often at work. And those two beat pretty much every other tool I’ve ever worked with.

Scripting with Variables in Bash

Kevin Feasel

2019-04-26

Linux

Kellyn Pot’vin-Gorman shows how easy it can be to write Bash scripts with variables:

Let’s start with a use case of deploying a Azure database. When a customer is making the decision to build it out, there are specific information needed to deploy and this will continue to change as the Azure catalog is updated with new offerings. For our example, we’ll stick to a very small snippet of code, as the values we dynamically create will be reused throughout the script. This example will skip past the actual server creation, etc. and just focus on the user database creation. The Server, zone and subscription are all set in the default steps earlier on so as not to have to repeat it throughout each resource deployment step.

There’s a lot to Bash and its programming guide is a lot of sheets of paper (ask me how I know), but this is one of those places where you can get a nice benefit easily.

Using dbatools for Inventory Analysis

Andreas Schubert gives us a way to learn more about our SQL Server inventories with dbatools:

With the multitude of environments that I am operating, it’s impossible to remember every server, every database or the multiple different ways they are interacting with each other. Therefore, one of the first things I do when taking over a consulting engagement is mapping out all those different bits of information.

Since the environments usually change pretty fast, my goal is to automate this process as much as possible.

In this series of posts, I will try to show you how I am implementing this. Of course, your requirements or implementations may differ, but hopefully this blog post can give you some ideas about your tasks too.

Click through for a script. There are also some good comments.

READPAST In Action

Erik Darling shows how READPAST is no panacea:

Locking hints can be really handy in these situations, especially the READPAST hint. The documentation for it says that it allows you to skip over row level locks (that means you can’t skip over page or object level locks).

What it leaves out is that your READPAST query may also need to try to take row level shared locks.

Read on for an example as well as an alternative which ends up being better in this case.

SQL Server 2019 CTP 2.5

The SQL Server team has a new CTP out:

We’re excited to announce the monthly release of SQL Server 2019 community technology preview (CTP) 2.5. SQL Server 2019 is the first release of SQL Server to closely integrate Apache Spark™ and the Hadoop Distributed File System (HDFS) with SQL Server in a unified data platform.

This is a big one for me: lots of changes in Big Data Clusters, PolyBase on Linux, and a Java SDK. Looks like I am going to be pretty busy.

Why Optimize for Ad Hoc Workloads

Randolph West explains why optimize for ad hoc workloads should be enabled by default:

Enabling the optimize for ad hoc workloads configuration setting will reduce the amount of memory used by all query plans the first time they are executed. Instead of storing the full plan, a stub is stored in the plan cache. Once that plan executes again, only then is the full plan stored in memory. What this means is that there is a small overhead for all plans that are run more than once, on the second execution.

Read the whole argument. I don’t know that I’ve seen an instance yet where this setting was a really bad choice.

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930