Curated SQL – Page 1095 – A Fine Slice Of SQL Server

Spark RDDs and DataFrames

Published 2019-02-27 by Kevin Feasel

Ayush Hooda explains the difference between RDDs and DataFrames:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.
Before exploring these APIs, let’s understand the need for these APIs.

I like the piece about RDDs being better at explaining the how than the what.

Comments closed

Searching Complex JSON With SQL Server

Published 2019-02-27 by Kevin Feasel

Bert Wagner gives us a way that you can quickly search through complicated JSON:

Computed column indexes make querying JSON data fast and efficient, especially when the schema of the JSON data is the same throughout a table.
It’s also possible to break out a well-known complex JSON structure into multiple SQL Server tables.
However, what happens if you have different JSON structures being stored in each row of your database and you want to write efficient search queries against all of the rows of your complex JSON strings?

Bert’s solution is an example of a phenomenon I’ve noticed in relational databases: sometimes, the best solution is not the most straightforward. The most straightforward solution is to take the JSON as-is, but that hits a wall as Bert shows. Reshaping the data leads to much better performance…as long as you’re able to afford the time needed to reshape and don’t have JSON changing that frequently.

Comments closed

Postgres Tooling: Rant and Recommendations

Published 2019-02-27 by Kevin Feasel

Ryan Booz is not pleased with the current state of tooling for Postgres:

/* Begin Brief Soapbox*/
Honestly, this is by far one of my biggest grips about Open Source software now that I’m older, busier, and don’t want to spin my wheels trying to make something simple work. When the tools make it hard to dig in and work effectively with the database, most developers and shops will default to code-first/ORM only development. In nearly 20 years of software development and leading multiple teams, I’m still surprised how little most developers really care about effectively using a database of any kind. During most interviews only about 30% of applicants can ever answer a few basic SQL questions. And now I think I’m starting to understand why. Most of them have been relegated to an Open Source world with Open Source tooling when it comes to SQL. Yes, it’s cheap and allows projects to spin up quickly, but once those students get past their little pizza ordering app from CompSci 402, they’ll be lost in the real world.
/* End Brief Soapbox */

I completely agree with the tooling point. Having worked with Postgres and MySQL a little bit makes me appreciate Management Studio (for all its flaws) all the more. If you want Azure Data Studio to support Postgres, there’s a GitHub issue that you can vote up.

Comments closed

Group Managed Service Accounts

Published 2019-02-27 by Kevin Feasel

Jamie Wick explains Group Managed Service Accounts and uses Powershell to create them for use on a new SQL Server instance:

Service Accounts are a requirement for installing and running a SQL Server. For many years Microsoft has recommended that each SQL Server service be run as a separate low-rights Windows account. Where possible, the current recommendation is to use Managed Service Accounts (MSA) or Group Managed Service Accounts (gMSA
). Both account types are ones where the account password is managed by the Domain Controller. The primary difference being that MSA are used for standalone SQL instances, whereas clustered SQL instances require gMSA. In this post, we’re going to use PowerShell to create Group Managed Service Accounts, and then deploy them for use on multiple SQL servers that will be hosting an Availability Group.

Click through for more explanation as well as several scripts showing how to create and use them.

Comments closed

Performance Testing Scalar UDF Improvements

Published 2019-02-27 by Kevin Feasel

Wayne Sheffield tests out the scalar UDF performance improvements in SQL Server 2019:

This blog post will examine changes to the query plan and performance when Scalar UDF Inlining is occurring.
I have previously blogged about function performance – here and here. For a quick recap, the performance test ranks these function in duration. The order of the types of functions by duration is Inline TVF, Scalar UDF, and then finally a Multi-Statement TVF (MSTVF) – and the MSTVF is way behind the other two types of functions.
I’m using a Linux (Ubuntu) VM with SQL Server 2019 to perform these comparison performance tests. I use one database in the SQL 2019 compatibility level, and another one in the SQL 2017 compatibility level. I’m using the same performance test used in the previous blog posts.

Wayne finds a definite performance improvement, but not enough in my mind to start creating a bunch of these.

Comments closed

Learning More About Azure Data Lake Storage Gen2

Published 2019-02-27 by Kevin Feasel

Melissa Coates shares a compendium of links pertaining to Azure Data Lake Storage Generation 2:

A couple of people have asked me recently about how to ‘bone up’ on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

There’s a lot of good stuff there.

Comments closed

UNCOMPRESS Isn’t DECOMPRESS

Published 2019-02-27 by Kevin Feasel

Solomon Rutzky strives to solve the question, “What is the UNCOMPRESS function anyhow?”:

With no clear indications of what the UNCOMPRESS function does, we can at least pass in some simple values to see what comes back, and see if we can make sense of the output. For the following tests, please keep in mind that “8-bit” refers to the VARCHAR, CHAR, and TEXT (deprecated) datatypes. And, “16-bit” refers to the NVARCHAR, NCHAR, NTEXT (deprecated),and XML datatypes.

Read on as Solomon figures out what it does and how non-useful it is for anybody nowadays.

Comments closed

Selecting a List of Columns from Spark

Published 2019-02-26 by Kevin Feasel

Unmesha SreeVeni shows us how we can create a list of column names in Scala to pass into a Spark DataFrame’s select function:

Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price") selectExpr: List[String] = List(Type, Item, Price)

Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()

Click through for a demo.

Comments closed

Microservice Communication Patterns

Published 2019-02-26 by Kevin Feasel

John Hammink shares a few ways that you can have microservices communicate with one another and argues that Kafka is a great platform for microservice communication:

Simply put, microservices are a software development method where applications are structured as loosely coupled services. The services themselves are minimal atomic units which together, comprise the entire functionality of the entire app. Whereas in an SOA, a single component service may combine one or several functions, a microservice within an MSA does one thing — only one thing — and does it well.
Microservices can be thought of as minimal units of functionality, can be deployed independently, are reusable, and communicate with each other via various network protocols like HTTP (More on that in a moment).

Read the whole thing. I have a love-hate relationship with these but it’s a pattern worth understanding.

Comments closed

Power BI IntelliSense For Python and R

Published 2019-02-26 by Kevin Feasel

David Eldersveld makes me wonder about the value of Power BI’s IntelliSense for R and Python:

If I type the letter a into the R Script editor, my code completion options are acts, always, and, and as. Power BI’s editor is not offering any IntelliSense options from a Python or R dictionary. Instead, it’s pulling from the text already in the editor. Note the comment in Line 1 and the inclusion of words beginning with the letter a — always, and, acts, as.
By comparison, the DAX editor contains a detailed function list and helpful annotations for code completion. Can we get something similar for R and Python? Not exactly… But there’s a workaround that I’m almost embarrassed to suggest. If you are a user who codes directly into the script editor, the following hack could be helpful. If you use the option to Edit script in External IDE, keep doing that and ignore the following guidance.

As-is, this is worse than no IntelliSense because at least with no IntelliSense, it’ll never steal a mouse click or keystroke. I wouldn’t expect RStudio level quality out of the gate but unless I’m missing something, that’s pretty bad.

1 Comment

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Curated SQL Posts