Curated SQL – Page 1175 – A Fine Slice Of SQL Server

Tol Color Schemes In R

Published 2018-09-26 by Kevin Feasel

Jason C. Fisher walks us through a color scheme generator based on Paul Tol’s research;

Choosing colors for a graphic is a bit like taking a trip down the rabbit hole, that is, it can take much longer than expected and be both fun and frustrating at the same time. Striking a balance between colors that look good to you and your audience is important. Keep in mind that color blindness affects many individuals throughout the world and it is incumbent on you to choose a color scheme that works in color-blind vision. Luckily there are a number of excellent R packages that address this very issue, such as the colorspace,RColorBrewer, and viridis packages. And because this is R, where diversity is king, why not offer one more function for creating color blind friendly palettes.

Let me introduce the GetTolColors function in the R-package inlmisc. This function generates a vector of colors from qualitative, diverging, and sequential color schemes by Paul Tol (2018). The original inspiration for developing this function came from Peter Carl’s blog post describing color schemes from an older issue of Paul Tol’s Technical Note (issue 2.2, released Dec. 2012). And the qualitative color schemes described in his blog post found their way into the ptol_pal function in the R-package ggthemes. My intent with this document is to exhibit the latest Tol color schemes (issue 3.0, released May 2018) and show that they are not only visually pleasing but also well thought out.

Read on for step-by-step instructions and to see some of the palettes. The package authors have taken care in color design, so check it out.

Comments closed

Labeling Line Ends In ggplot2

Published 2018-09-26 by Kevin Feasel

Simon Jackson shows how you can use the secondary axis to label line endings in ggplot2:

Now we can use scale_y_*, with the argument sec.axis to create a second axis on the right, with numbers to be displayed at breaks, defined by our vector of line ends:
ggplot(d, aes(age, circumference, color = Tree)) +
      geom_line() +
      scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends))

This is good. I’d really prefer to show the labels instead of the value; that way it’d be possible to eliminate the legend altogether. H/T R-Bloggers.

Comments closed

Always Encrypted With Secure Enclaves In SQL Server 2019

Published 2018-09-26 by Kevin Feasel

Jakub Szymaszek walks us through Virtualization Based Security memory enclaves in Windows Server 2019 and SQL Server 2019:

Today, we are super excited to announce that you can now try and evaluate Always Encrypted with secure enclaves in the preview of SQL Server 2019.

Always Encrypted with secure enclaves in SQL Server 2019 preview uses an enclave technology called Virtualization Based Security (VBS) memory enclaves in the upcoming version of Windows (Windows Server 2019 and Windows 10, version 1809), which is currently also in preview. A VBS enclave is an isolated region of memory within the address space of a user-mode process. The isolation of VBS enclaves is provided by the Windows hypervisor, which makes VBS enclaves appear as black boxes, not only to the processes containing them, but also all other processes and the Windows OS on the machine. Even machine administrators are not able to see the memory of the enclave. The below screenshot shows what an admin would get to see when browsing the enclave memory using a debugger (note the question marks, as opposed to the actual memory content).

The compliance regime is shifting toward preventing high-privilege users (DBAs, sysadmins, etc.) from accidentally or maliciously exposing sensitive information, so it makes sense that this is the primary security push. I think that these changes are starting to make Always Encrypted a better option than a roll-your-own data encryption model.

Comments closed

Chicago Parking Ticket Data Set

Published 2018-09-26 by Kevin Feasel

Bob Pusateri shows us a new data set to mess with:

A few weeks ago I came across this blog post by Matt Chapman. Matt filed FOIA requests with the City of Chicago and, after multiple attempts, was able to get access to over 36 million parking tickets written between 2003 and 2016. Matt goes on to explain Chicago’s parking ticket database, how he processed the data, analyzed it, and in one location got Chicago to put up additional “No Parking” signs to reduce parking tickets in that spot by 50%. That is most definitely using analytics for a great cause!

But let’s get back to that data for a second, that’s what really interests me. Matt shared his raw data for others to analyze, but it was formatted as a PostgreSQL dump. Now PostgreSQL is a great tool with an even greater price, but it’s not always the easiest to use. After spinning up a Linux VM and spending hours setting everything up as best I could, I still couldn’t get the dump to restore properly. Apparently I didn’t have all the exact versions of certain extensions installed, and because of that the tables couldn’t be loaded. Grrrr.

Bob has our backs, though, and has a properly-formatting, normalized parking ticket data set that weighs in at about 500MB.

Comments closed

Running SQL Server 2019 In Docker

Published 2018-09-26 by Kevin Feasel

Andrew Pruski walks us through setting up SQL Server 2019 CTP 2 on Linux with Docker for Windows:

If you’ve been anywhere near social media this week you may have seen that Microsoft has announced SQL Server 2019.

I love it when a new version of SQL is released. There’s always a whole new bunch of features (and improvements to existing ones) that I want to check out. What I’m not too keen on however is installing a preview version of SQL Server on my local machine. It’s not going to be there permanently and I don’t want the hassle of having to uninstall it.

This is where containers come into their own. We can run a copy of SQL Server without it touching our local machine.

Click through for the step-by-step.

Comments closed

SQL Server 2019 Containers Available

Published 2018-09-26 by Kevin Feasel

The SQL Server team has a getting started post on pulling down the latest CTP in a container, as well as some additional container features:

SQL Server 2019 is now available on Red Hat Enterprise Linux as a Red Hat Certified Container Images and Ubuntu-based container images enabling you to take advantage of the latest SQL Server engine innovations such as new SQL Graph features, and Data Discovery and Classification. We are also making it possible to adopt SQL Server in containers with existing scenarios such as Replication and Distributed Transaction which are now part of SQL Server 2019 on Linux.

This makes it easier to get started with SQL Server 2019 without potentially messing up your already-working systems.

Comments closed

Improvements In Table Variable Performance In SQL Server 2019

Published 2018-09-26 by Kevin Feasel

Matthew McGiffen tries out SQL Server 2019 to test a scenario where table variables were giving poor estimates in prior versions:

One of the most popular posts on my blog last year was where I pretty much suggested that people not use table variables:

Think twice before using table variables

This wasn’t new information when I wrote it, but bad performance due to the use of table variables remained such a common anti-pattern that I thought it was worth stressing again.

So, when I saw the above 2019 feature I thought I’d better investigate and update what I’m telling people.

TL;DR It looks like table variables are no longer a problem.

Read the whole thing. This has the potential of changing long-standing advice going back a decade regarding table variables.

Comments closed

PFS Corruption When Moving From SQL Server 2014

Published 2018-09-26 by Kevin Feasel

Paul Randal notes a bug in SQL Server 2014:

I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:

Msg 8948, Level 16, State 6, Line 5

Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.

Msg 8948, Level 16, State 6, Line 5

Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.

CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.

CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).

I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.

Read on for the fix and additional good advice.

Comments closed

Writing To Elasticsearch With Spark Streaming

Published 2018-09-25 by Kevin Feasel

Anuj Saxena has an example of writing data from a Spark Streaming pipeline out to Elasticsearch:

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in another blog. But today we’ll focus on saving streaming data to Elasticseach using Spark Structured Streaming. Elasticsearch added support for Spark Structured Streaming 2.2.0 onwards in version 6.0.0 version of “Elasticsearch For Apache Hadoop” dependency. We will be using these versions or higher to build our sbt-scala project.

Click through for an example.

Comments closed

Wasting Money With Data Science

Published 2018-09-25 by Kevin Feasel

Giovanni Lanzani has a post with the controversial title above:

Some data is gathered, given to data scientists, and — after two weeks — the first demo takes place. The results are promising, but they need a bit more time.

Fine. After all, the data was messy: they had to clean it up and go back to the source a couple of times.

Two weeks pass and the new results are even nicer. With 70% accuracy, they can predict if a patient will go home after their visit to the emergency room.

This is much better than random (50%)! A full-fledged pilot starts.

They are faced with a couple of challenges to go from model to data product:

How to send the source data to the model is unclear;
Where the model should run;
The hospital operations need to change, as the intake happens with pen and paper;
They realize that without knowing to which department the patient will go, they won’t add any value;
To predict the department, the model need the diagnosis. But once the diagnosis gets typed in the computer, the patient has reached their destination: the model is useless!

I think it’s a fair point: it’s easy from the standpoint of internal researchers to look for things which they can do, but which don’t have much business value. The risk on the other side is that you’ll start diving into a high-potential-value problem and then realize that the data isn’t there to draw conclusions or that the relationships you expected simply aren’t there.

Comments closed

Curated SQL Posts