SQL Server 2019 CTP 3.1 Released

Anshul Rampal announces CTP 3.1 of SQL Server 2019:

The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.

Click through to learn more about what has changed.

Shortest Path with SQL Server 2019

Kevin Feasel

2019-06-27

Graph

Shreya Verma shows off a new feature in SQL Server 2019 CTP 3.1:

We are expanding the graph database capabilities with several new features. In this blog we discuss one of these features that is now available for public preview in SQL Server 2019Shortest Path, which can be used to find a shortest path between two nodes in a graph. The shortest path function can also be used to compute a transitive closure or for arbitrary length traversals.
 
With CTP3.1, we are introducing a new function SHORTEST_PATH, which can be used inside MATCH to find a shortest path between any 2 nodes in a graph or to perform arbitrary length traversals. Users can specify a pattern they would like to search for in the graph using a regular expression style syntax.

I’d be interested in seeing how well it performs. But it’s good to see graph functionality fleshed out a bit more.

Best Practices for Query Store

Erin Stellato ties together a number of best practices around Query Store:

I’m a huge fan of Query Store, which regular readers may know, but there’s a need to write a bit more about Query Store best practices.  This isn’t a “you must use this feature” post, this is a “here is what you must know if you want to use this feature” post.

I have a lot of content about Query Store, but maybe what’s really important gets lost amongst everything else.  Glenn prompted me to write this, after I worked with two customers last week that ran into issues related to their configuration of Query Store.  Listed below are the things you must know before you enable Query Store.  If you already have Query Store enabled and running, I recommend reviewing the information to ensure you are following best practices.

Click through for the full set of practices and links to additional details.

Lasso and Ridge Regression in Python

Kristian Larsen shows off a few regression techniques using Python:

Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

Read on for demonstrations.

Building a Big Data Cluster

Mohammad Darab continues a series on SQL Server Big Data Clusters in Azure Kubernetes Service:

To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:

mssqlctl cluster create

That will first prompt us to accept the license terms. Type y and Enter. 

Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.

Identity Inserts: One Table at a Time

Bert Wagner shows that you can only insert with IDENTITY_INSERT = ON for one table at a time:

Ok, simple enough to fix: we just need to do what the error message says and SET IDENTITY_INSERT ON for both tables:

SET IDENTITY_INSERT dbo.User_DEV ON; SET IDENTITY_INSERT dbo.StupidQuestions_DEV ON;

And… it still didn’t work:

IDENTITY_INSERT is already ON for table 'IdentityTest.dbo.User_DEV'. Cannot perform SET operation for table 'dbo.StupidQuestions_DEV'.

Click through for the ramifications and your alternative.

Deadlock Check Frequency

Dave Bland clarifies how frequently deadlock checks occur:

Because deadlocks happen when two task permanently block each other, without a deadlock, both process will simply block forever. Of course this could never be good in a production system.  It is important that these situations be identified and dealt with in some manner.  This is where SQL Server database engine steps in, it is frequently searching the lock manager looking for deadlocks.

Click through for the answer.

CONCATENATEX in DAX

Kevin Feasel

2019-06-26

DAX

Alberto Ferrari shows us how the CONCATENATEX function works in DAX:

At this point, extracting the first row from Results which would contain the string to produce in the report, is sufficient. TOPN is the function extracting that row, but there is a major drawback here: in case of a tie, TOPN returns all values involved.

In case TOPN returns multiple rows, it is necessary to show them one after the other so to make it clear that the result is not unique. CONCATENATEX is an iterator that concatenates strings and produces a single string out of a table. 

This is a good demonstration of a useful function.

Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark:

Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.

However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.

There’s good advice in here, so check it out.

Using the Cosmos DB Data Migration Tool

Kevin Feasel

2019-06-25

Cloud

Hasan Savran shows how you can use the Cosmos DB Data Migration Tool to move data from various sources into Cosmos DB:

All you need is the connection string of your database or location of your source files and your CosmosDB keys. If you are using a database as source, you can format the data model pretty easy. You might have issues if the source you use does not have a JSON data type. JSON Array might look like string in CosmosDB because of data type mapping problems.

I am going to use CSV file as source in the following example. You can’t define a column as JSON array in excel. I have the following two columns in my CSV file. If I import these values into CosmosDB as they are, I see coordinates field’s data type will turn into a text field in CosmosDB. You can go back and try to update them in CosmosDB but that will be an expensive solution.

The one downside to this tool is that it doesn’t work with collections defined using the Mongo API.

Categories

June 2019
MTWTFSS
« May Jul »
 12
3456789
10111213141516
17181920212223
24252627282930