Kevin Feasel – Page 1123

Lasso and Ridge Regression in Python

Published 2019-06-26 by Kevin Feasel

Kristian Larsen shows off a few regression techniques using Python:

Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

Read on for demonstrations.

Comments closed

Building a Big Data Cluster

Published 2019-06-26 by Kevin Feasel

Mohammad Darab continues a series on SQL Server Big Data Clusters in Azure Kubernetes Service:

To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:
mssqlctl cluster create
That will first prompt us to accept the license terms. Type y and Enter.

Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.

Comments closed

Identity Inserts: One Table at a Time

Published 2019-06-26 by Kevin Feasel

Bert Wagner shows that you can only insert with IDENTITY_INSERT = ON for one table at a time:

Ok, simple enough to fix: we just need to do what the error message says and SET IDENTITY_INSERT ON for both tables:
SET IDENTITY_INSERT dbo.User_DEV ON; SET IDENTITY_INSERT dbo.StupidQuestions_DEV ON;
And… it still didn’t work:
IDENTITY_INSERT is already ON for table 'IdentityTest.dbo.User_DEV'. Cannot perform SET operation for table 'dbo.StupidQuestions_DEV'.

Click through for the ramifications and your alternative.

Comments closed

Deadlock Check Frequency

Published 2019-06-26 by Kevin Feasel

Dave Bland clarifies how frequently deadlock checks occur:

Because deadlocks happen when two task permanently block each other, without a deadlock, both process will simply block forever. Of course this could never be good in a production system. It is important that these situations be identified and dealt with in some manner. This is where SQL Server database engine steps in, it is frequently searching the lock manager looking for deadlocks.

Click through for the answer.

Comments closed

CONCATENATEX in DAX

Published 2019-06-26 by Kevin Feasel

Alberto Ferrari shows us how the CONCATENATEX function works in DAX:

At this point, extracting the first row from Results which would contain the string to produce in the report, is sufficient. TOPN is the function extracting that row, but there is a major drawback here: in case of a tie, TOPN returns all values involved.
In case TOPN returns multiple rows, it is necessary to show them one after the other so to make it clear that the result is not unique. CONCATENATEX is an iterator that concatenates strings and produces a single string out of a table.

This is a good demonstration of a useful function.

Comments closed

Controlling Partition and File Counts in Spark

Published 2019-06-25 by Kevin Feasel

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark:

Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.
However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.

There’s good advice in here, so check it out.

Comments closed

Using the Cosmos DB Data Migration Tool

Published 2019-06-25 by Kevin Feasel

Hasan Savran shows how you can use the Cosmos DB Data Migration Tool to move data from various sources into Cosmos DB:

All you need is the connection string of your database or location of your source files and your CosmosDB keys. If you are using a database as source, you can format the data model pretty easy. You might have issues if the source you use does not have a JSON data type. JSON Array might look like string in CosmosDB because of data type mapping problems.
I am going to use CSV file as source in the following example. You can’t define a column as JSON array in excel. I have the following two columns in my CSV file. If I import these values into CosmosDB as they are, I see coordinates field’s data type will turn into a text field in CosmosDB. You can go back and try to update them in CosmosDB but that will be an expensive solution.

The one downside to this tool is that it doesn’t work with collections defined using the Mongo API.

Comments closed

Creating an Azure Databricks Cluster

Published 2019-06-25 by Kevin Feasel

Brad Llewellyn shows how you can create an Azure Databricks cluster:

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. We will dig into each of these in due time. For this post, we’re going to talk about Clusters. Clusters are where the work is done. Clusters themselves do not store any code or data. Instead, they operate the physical resources that are used to perform the computations. So, it’s possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment. Let’s start by creating a small cluster.

Read on for an example.

Comments closed

Using APPLY to Reduce Function Calls

Published 2019-06-25 by Kevin Feasel

Erik Darling shows a clever use of the APPLY operator:

A while back, Jonathan Kehayias blogged about a way to speed up UDFs that might see NULL input.
Which is great, if your functions see NULL inputs.
But what if… What if they don’t?
And what if they’re in your WHERE clause?
And what if they’re in your WHERE clause multiple times?
Oh my.

But fear not—Erik’s got you covered.

Comments closed

SQL Server 2019 CTP3 T-Log Writers Increased

Published 2019-06-25 by Kevin Feasel

Lonny Niederstadt observes a change in SQL Server 2019 CTP 3.0:

In SQL Server 2016, transaction log writing was enhanced to support multiple transaction log writers. If the instance had more than one non-DAC node in [sys].[dm_os_nodes], there would be one transaction log writer per node, to a maximum of 4.
In SQL Server 2019, it seems the maximum number of transaction log writers has been increased. The system below with 4 vNUMA nodes (and autosoftNUMA disabled) has eight transaction log writer sessions, each on their own hidden online scheduler, all on parent_node_id = 3/memory_node_id = 3 on processor group 1.

Click through for the proof.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Author: Kevin Feasel