Press "Enter" to skip to content

Curated SQL Posts

Monitoring Big Data Clusters

Mohammad Darab continues a series on Big Data Clusters:

There are many ways to view the health of your Big Data Cluster. As of CTP 3.0, there are kubectl commands, mssqlctl commands as well as dashboards. For the sake of this series, I will focus on the dashboards. I will blog about some of the useful kubectl and mssqlctl commands in later posts.

The first dashboard is the Microsoft Cluster Administration portal (see below snapshot). This is a view into the Big Data Cluster Controller. As you can see from the image below, the Overview pane shows the Controller, Master Instance and all the pools. On the left hand side you can see more details. If you click on the “Service Endpoint” option, you will see a list of endpoints that you can bookmark.

Something I appreciate is that Microsoft thought ahead on what the monitoring story should look like rather than waiting until the end and slapping something together.

Comments closed

Simplifying Columnstore

Monica Rathbun takes us through concepts behind columnstore indexes:

Now, I admit when these first were introduced in SQL Server 2012 I found them very intimidating (additionally, you couldn’t update them directly). For me, anytime you say columnstore, my mind tends to set off alarms saying wait stay away, this is too complicated. So, in this post I am going to try and simplify the feature for you.

To do that first you need to understand some terminology and the difference between a columnstore index and a row store index (the normal kind we all use). Let’s start with the terminology.

There are some interesting complications around columnstore indexes but for analytical or warehousing queries, they’re excellent.

Comments closed

More Testing of Inline Scalar UDFs

Erik Darling makes a FROIDian slip:

The idea behind FROID is that it removes some restrictions around scalar valued functions.

1. They can be inlined into the query, not run per-row returned
2. They don’t force serial execution, so you can get a parallel plan

If your functions already run pretty quickly over a small  number of rows, and the calling query doesn’t qualify for parallelism, you may not see a remarkable speedup.

Even in that case, Erik argues that you can still get some benefits from SQL Server 2019 bringing those scalar UDFs inline.

Comments closed

SQL Server 2019 CTP 3.1 Released

Anshul Rampal announces CTP 3.1 of SQL Server 2019:

The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.

Click through to learn more about what has changed.

Comments closed

Shortest Path with SQL Server 2019

Shreya Verma shows off a new feature in SQL Server 2019 CTP 3.1:

We are expanding the graph database capabilities with several new features. In this blog we discuss one of these features that is now available for public preview in SQL Server 2019Shortest Path, which can be used to find a shortest path between two nodes in a graph. The shortest path function can also be used to compute a transitive closure or for arbitrary length traversals.
 
With CTP3.1, we are introducing a new function SHORTEST_PATH, which can be used inside MATCH to find a shortest path between any 2 nodes in a graph or to perform arbitrary length traversals. Users can specify a pattern they would like to search for in the graph using a regular expression style syntax.

I’d be interested in seeing how well it performs. But it’s good to see graph functionality fleshed out a bit more.

Comments closed

Best Practices for Query Store

Erin Stellato ties together a number of best practices around Query Store:

I’m a huge fan of Query Store, which regular readers may know, but there’s a need to write a bit more about Query Store best practices.  This isn’t a “you must use this feature” post, this is a “here is what you must know if you want to use this feature” post.

I have a lot of content about Query Store, but maybe what’s really important gets lost amongst everything else.  Glenn prompted me to write this, after I worked with two customers last week that ran into issues related to their configuration of Query Store.  Listed below are the things you must know before you enable Query Store.  If you already have Query Store enabled and running, I recommend reviewing the information to ensure you are following best practices.

Click through for the full set of practices and links to additional details.

Comments closed

Lasso and Ridge Regression in Python

Kristian Larsen shows off a few regression techniques using Python:

Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

Read on for demonstrations.

Comments closed

Building a Big Data Cluster

Mohammad Darab continues a series on SQL Server Big Data Clusters in Azure Kubernetes Service:

To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:

mssqlctl cluster create

That will first prompt us to accept the license terms. Type y and Enter. 

Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.

Comments closed

Identity Inserts: One Table at a Time

Bert Wagner shows that you can only insert with IDENTITY_INSERT = ON for one table at a time:

Ok, simple enough to fix: we just need to do what the error message says and SET IDENTITY_INSERT ON for both tables:

SET IDENTITY_INSERT dbo.User_DEV ON; SET IDENTITY_INSERT dbo.StupidQuestions_DEV ON;

And… it still didn’t work:

IDENTITY_INSERT is already ON for table 'IdentityTest.dbo.User_DEV'. Cannot perform SET operation for table 'dbo.StupidQuestions_DEV'.

Click through for the ramifications and your alternative.

Comments closed

Deadlock Check Frequency

Dave Bland clarifies how frequently deadlock checks occur:

Because deadlocks happen when two task permanently block each other, without a deadlock, both process will simply block forever. Of course this could never be good in a production system.  It is important that these situations be identified and dealt with in some manner.  This is where SQL Server database engine steps in, it is frequently searching the lock manager looking for deadlocks.

Click through for the answer.

Comments closed