2018-01-05 – Curated SQL

Streaming Analytics With Kafka

Published 2018-01-05 by Kevin Feasel

Rathnadevi Manivannan shows how to use Kafka SQL to query streaming data:

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish trip data from the source to other destinations for further analysis.

In this article, let’s discuss enriching the Citi Bike trip data and finding the number of trips on a particular day to and from a particular station.

Read on for a nice tutorial.

Comments closed

Connecting R To Google Sheets

Published 2018-01-05 by Kevin Feasel

Rob Grant shows how to connect to Google Sheets with R:

That was a quick overview of the most basic functions of the google sheets package.

This is a really useful package. A lot of my work involves reading data in Google Sheets either before or after using R.

Googlesheets means I won’t have to bother with read.csv() or write.csv() as much in the future, saving me time.

Click through for a good tutorial.

Comments closed

Parallelization With Rcpp

Published 2018-01-05 by Kevin Feasel

Blazej Moska demonstrates how to use Rcpp to parallelize R code:

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set).

Read the whole thing.

Comments closed

Database-Scoped Optimize For Ad Hoc Workloads

Published 2018-01-05 by Kevin Feasel

Joe Sack introduces a new database-scoped configuration option:

SQL Server provides the “optimize for ad hoc workloads” server-scoped option that is used to reduce the memory footprint of single use ad hoc batches and associated plans. When enabled at the SQL Server instance scope, the “optimize for ad hoc workloads” option stores a reduced-memory compiled plan stub on the first execution of an ad hoc batch for any database on the instance. This server option has been available in SQL Server for several years now, but until recently there hasn’t been a way to enable this option in Azure SQL Database for individual databases.

We are now introducing a new database scoped configuration called OPTIMIZE_FOR_AD_HOC_WORKLOADS which enables this behavior at the database scope in Azure SQL Database.

I’m not sure if this will make it to the on-prem product, and if it does, I’m not sure how useful it would be in practice. But it is good that we can use it in Azure SQL Database.

Comments closed

VSS Snapshot: Freeze & Thaw

Published 2018-01-05 by Kevin Feasel

Erik Darling points out that VSS backups aren’t instantaneous and can block queries:

Ah, backups. Why are they so tough to get right?

You start taking them, you find out you’re not taking enough of them, or that they’re not the right kind, or that you’re not using checksums or compression, or that you’re not storing them in the right place, or that the storage isn’t redundant.

It’s just like, why won’t someone make this easy?

Then you read about VSS Snaps, and they look so dead simple. You don’t need your DBA Ph.D to use them.

And look how fast they are! Oh how they blaze.

Read the whole thing.

Comments closed

Azure SQL Analytics

Published 2018-01-05 by Kevin Feasel

Arun Sirpal gives an introduction to Azure SQL Analytics:

Please see the prerequisites section within this document – YOU MUST do this else you will not be able to use this feature. https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-azure-sql#prerequisites

Once setup it should take approximately 15 minutes to start capturing and rendering back some data. Don’t be surprised if it does take a little longer as was the case for myself.

My biggest complaint is about the visuals; otherwise, this looks like the beginning of a solid monitoring solution within Azure SQL Database.

Comments closed

Fixing Orphaned Users In SQL Server

Published 2018-01-05 by Kevin Feasel

Eitan Blumin shares a couple of methods to fix orphaned users in SQL Server:

The most correct solution for this problem, is to have consistent SIDs to your Logins across all your SQL Servers.
So that even when a database is moved to a different server, it could still use the same SID that it was originally created for.
And also, when you recreate a previously deleted Login, you’d need to create it with the same SID that it originally had.

This is, obviously, not a trivial matter, and not always possible.

But if this is a direction that interests you, then you will find the following very useful:

Read on for the best solution, as well as the second-best solution using sp_change_users_login.

Comments closed

Fun With Meltdown And Spectre

Published 2018-01-05 by Kevin Feasel

Brent Ozar looks at some of the consequences of Meltdown and Spectre for SQL Server DBAs:

That’s because some test results have found big slowdowns when the operating system is patched for Meltdown and/or Spectre. These are big vulnerabilities in the processors themselves, and OS vendors are having to make big changes that aren’t tuned for performance yet. Early benchmarks yesterday were showing 30% drops in PostgreSQL performance, but thankfully newer benchmarks have been showing smaller drops. Red Hat’s benchmarks show 3-7% slower analytics workloads, and 8-12% slower OLTP.

Joey D’Antoni has more:

Will This Impact My Performance?

Probably–especially If you are running on virtual hardware. For workloads on bare metal, the security risk is much lower, so Microsoft is offering a registry option to not include the microcode fixes. Longer term especially if you are audited, or allow application code to run on your database servers, you will need to enable the microcode options.

This will likely get better over time as software patches are released, that are better optimized to make fewer calls. Ultimately, this will need to fixed on the hardware side, and we will need a new generation of hardware to completely solve the security issue with a minimum impact.

Allan Hirt has even more:

There are two bugs which are known as Meltdown and Spectre. The Register has a great summarized writeup here – no need for me to regurgitate. This is a hardware issue – nothing short of new chips will eradicate it. That said, pretty much everyone who has written an OS, hypervisor, or software has (or will have) patches to hopefully eliminate this flaw. This blog post covers physical, virtualized, and cloud-based deployments of Windows, Linux, and SQL Server.

The fact every vendor is dealing with this swiftly is a good thing. The problem? Performance will most likely be impacted. No one knows the extent, especially with SQL Server workloads. You’re going to have to test and reset any expectations/performance SLAs. You’ll need new baselines and benchmarks. There is some irony here that it seems virtualized workloads will most likely take the biggest hit versus ones on physical deployments. Time will tell – no one knows yet.

This will have long-term ramifications. We’ll deal with them like we’ve dealt with other issues in the past, but it does seem that, at least for now, there will be some performance hit from this.

Comments closed

Clustering The Power BI Gateway

Published 2018-01-05 by Kevin Feasel

Craig Porteous show how to cluster the Power BI Data Gateway to allow for disaster recovery:

I love PowerShell and I even wrote a module with functions to query Power BI metadata but there should always be another way to get this vital information.

The documentation I mentioned earlier points you to a PowerShell module file included in the November update. You can load this file & use the commands they provide to get information about your Gateway cluster and its members or make changes to clusters.

If it’s important enough to use, it’s important enough to include in a disaster recovery plan.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: January 5, 2018

Will This Impact My Performance?