Finding Clusters Of Queries Using R

Kevin Feasel

2017-01-09

R

Tomaz Kastrun shows how to use R to find clusters of queries which behave similarly:

So the R code said that, there are three clusters generating And I used medians to generate data around it. In addition I have also tested the result with Partitioning around medoids (which is opposite to hierarchical clustering) and the results from both techniques yield clean clusters.

Clustering models can be powerful for discovering commonalities, and that might help you find a number of queries which all behave in some sub-optimal way without having to trawl through every procedure’s code.

Submitting A Spark Job On HDInsight

Bharath Venkatesh shows different ways to run a Spark job on HDInsight:

From HDI 3.5 onwards, our clusters come preinstalled with Zeppelin Notebooks. Much like Jupyter notebooks, Zeppelin is a web-based notebook that enables interactive data analytics. It provides built-in Spark intergration that allows for:

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository. Learn more about dependency loader.
  • Canceling job and displaying its progress

This MSDN article provides a quick easy-to-use onboarding guide to help get acclimatized to Zeppelin. You can also try several applications that come pre-installed on your cluster to get hands on experience of Zeppelin.

Zeppelin is probably my favorite method, but there are good reasons to use all of these.

Understanding Data Integration Lifecycle Management

Andy Leonard explains DILM:

Data Integration Lifecycle Management (DILM) is not about data integration development.

DILM is about everything else:

  • Configurations Management
  • Version Management
  • Deployment
  • Execution

Although DILM is not about development, implementing DILM will impact the design of SSIS solutions.

This is the first part in a multi-part series, and covers some of the conceptual basicsbehind DILM.

Reporting Services Mobile Reports

Kathi Kellenberger shows off mobile reports in SQL Server Reporting Services:

Mobile Reports are dashboards that will run on most modern mobile devices as well as within the web portal. They are supported on IOS 9 and later, Android 4.4 or better, and Windows 10. To run Mobile Reports on these devices, the mobile Power BI application must be installed.

At first glance, they are simple to create. There is a new tool to use, the SQL Server Mobile Report Publisher. The tool will look familiar to you if you have worked with Datazen in the past. Microsoft purchased Datazen in 2015.

This is the first major Reporting Services update since 2008 (unless you consider sparkline support in R2 a major update), and could be a good business justification for upgrading to SQL Server 2016.

Columnstore Query Patterns

Niko Neugebauer gives a couple hints on how to make columnstore queries as fast as possible:

On my VM with 4 cores it takes 33 seconds to execute this query on SQL Server 2016 with Service Pack 1, while it burns almost 48 seconds of the CPU Time.
The relevant part of the execution plan can be found below, showing so many performance problems that this query is suffering, such as INNER LOOP JOIN, INDEX SPOOL, besides even worse part that is actually hidden and is identifiable only once you open the properties of any of the lower tree (left side of the LOOP JOIN), seeing that it all runs with the Row Execution Mode actually.

To show you the problem, on the left side you will find the properties of the sort iterator that is to be found in the lower (left) part of the LOOP Join that was executed around 770.000 times in the Row Execution Mode, effectively taking any chances away from this query to be executed in a fast way. One might argue that it might that it might be more effective to do the loop part in Row Mode, but given that we are sorting around 3.1 Million Rows there – for me there is no doubt that it would be faster to do it within a Batch Execution Mode. Consulting the last sort iterator in the execution plan (TOP N SORT), you will find that it is running with the help of the Batch Execution Mode, even though it is processing around 770.000 rows.

There’s some valuable information here.

Changing TDE Certificates

Tom Norman has a script to change the certificate used for TDE:

When you create TDE certificate by default, the certificate is just created for a year. Depending upon your auditor, you may be asked to change this TDE certificate yearly.  In this blog post, you can watch when your TDE certificate expires using policy-based management, https://www.mssqltips.com/sqlservertip/2388/internal-sql-server-certificates-have-expired-is-this-an-issue/.

Click through for the script.

Uses For Binary Data Types

Daniel Hutmacher explains what binary data types are and one use case:

Because binary values are essentially strings, they easily convert to and from character strings, using CAST or CONVERT. To convert the binary value of 0x41 to a plain-text character value, try

SELECT CAST(0x41 AS char(1)); --- 'A'

The binary value 0x41 is equivalent to decimal 65, and CHAR(65) is the letter “A”. Note that I haven’t placed any quotes around 0x41 – that’s because it’s a numeric value (albeit in hex notation) and not a string.

A couple use cases I’ve seen are creating hashes (SHA1 or MD5) for change detection, storing password hashes, and encrypted columns—Always Encrypted uses varbinary data types to store encrypted information, for example.

Getting A Handle On Isolation Levels

Robert Bishop explains some of the basics behind transaction isolation levels:

Lost Updates

It is possible for two or more transactions to modify the same row. While the subsequent transactions are reading the data, the 1st transaction commits its changes. Then the subsequent transaction makes its changes and commits, thereby possibly losing the changes committed by the 1st transaction. In this scenario, the last transaction always wins.

Dirty Read

A dirty read is when a SELECT statement will read against a table that is currently being modified and not yet committed to the database. By default, SQL Server will protect against this on all isolation levels, except for Read Uncommitted or by using NOLOCK in your queries.

This is the first in a series.

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031