The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.
Click through to learn more about what has changed.
We are expanding the graph database capabilities with several new features. In this blog we discuss one of these features that is now available for public preview in SQL Server 2019, Shortest Path, which can be used to find a shortest path between two nodes in a graph. The shortest path function can also be used to compute a transitive closure or for arbitrary length traversals.
With CTP3.1, we are introducing a new function SHORTEST_PATH, which can be used inside MATCH to find a shortest path between any 2 nodes in a graph or to perform arbitrary length traversals. Users can specify a pattern they would like to search for in the graph using a regular expression style syntax.
I’d be interested in seeing how well it performs. But it’s good to see graph functionality fleshed out a bit more.
I’m a huge fan of Query Store, which regular readers may know, but there’s a need to write a bit more about Query Store best practices. This isn’t a “you must use this feature” post, this is a “here is what you must know if you want to use this feature” post.
I have a lot of content about Query Store, but maybe what’s really important gets lost amongst everything else. Glenn prompted me to write this, after I worked with two customers last week that ran into issues related to their configuration of Query Store. Listed below are the things you must know before you enable Query Store. If you already have Query Store enabled and running, I recommend reviewing the information to ensure you are following best practices.
Click through for the full set of practices and links to additional details.
Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.
Read on for demonstrations.
To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:
mssqlctl cluster create
That will first prompt us to accept the license terms. Type y and Enter.
Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.
Ok, simple enough to fix: we just need to do what the error message says and SET IDENTITY_INSERT ON for both tables:
SET IDENTITY_INSERT dbo.User_DEV ON; SET IDENTITY_INSERT dbo.StupidQuestions_DEV ON;
And… it still didn’t work:
IDENTITY_INSERT is already ON for table 'IdentityTest.dbo.User_DEV'. Cannot perform SET operation for table 'dbo.StupidQuestions_DEV'.
Click through for the ramifications and your alternative.
Because deadlocks happen when two task permanently block each other, without a deadlock, both process will simply block forever. Of course this could never be good in a production system. It is important that these situations be identified and dealt with in some manner. This is where SQL Server database engine steps in, it is frequently searching the lock manager looking for deadlocks.
Click through for the answer.
At this point, extracting the first row from Results which would contain the string to produce in the report, is sufficient. TOPN is the function extracting that row, but there is a major drawback here: in case of a tie, TOPN returns all values involved.
In case TOPN returns multiple rows, it is necessary to show them one after the other so to make it clear that the result is not unique. CONCATENATEX is an iterator that concatenates strings and produces a single string out of a table.
This is a good demonstration of a useful function.
Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.
However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.
There’s good advice in here, so check it out.
All you need is the connection string of your database or location of your source files and your CosmosDB keys. If you are using a database as source, you can format the data model pretty easy. You might have issues if the source you use does not have a JSON data type. JSON Array might look like string in CosmosDB because of data type mapping problems.
I am going to use CSV file as source in the following example. You can’t define a column as JSON array in excel. I have the following two columns in my CSV file. If I import these values into CosmosDB as they are, I see coordinates field’s data type will turn into a text field in CosmosDB. You can go back and try to update them in CosmosDB but that will be an expensive solution.
The one downside to this tool is that it doesn’t work with collections defined using the Mongo API.