Press "Enter" to skip to content

Day: January 6, 2021

Basic Theory on Correlation Analysis, Using R

Petr Baranovskiy wants to take us through the key concepts of correlation analysis, starting with basic theory:

When I was learning statistics, I was surprised by how few learning materials I personally found to be clear and accessible. This might be just me, but I suspect I am not the only one who feels this way. Also, everyone’s brain works differently, and different people would prefer different explanations. So I hope that this will be useful for people like myself – social scientists and economists – who may need a simpler and more hands-on approach.

These series are based on my notes and summaries of what I personally consider some the best textbooks and articles on basic stats, combined with the R code to illustrate the concepts and to give practical examples. Likely there are people out there whose cognitive processes are similar to mine, and who will hopefully find this series useful.

This is clear and well-written, so check it out even if you feel like you have a solid understanding of the topic.

Comments closed

Choosing an ML Algorithm

Hui Li developed a flow for determining appropriate machine learning algorithms:

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers. There are several issues on which we have not reached an agreement and for these issues we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

Read the whole thing.

Comments closed

Writing a Python Language Extension for ML Services

Niels Berglund shows how you can bring your own Python 3.9 runtime to SQL Server Machine Learning Services:

When I wrote we’d look at it in a future post I thought to myself; “how hard can it be?”. I had read the steps of how to build a Python language extension for Windows here, and it didn’t seem that hard: some Boost, CMake, compile, and Bob’s your uncle! Well, it turned out it was somewhat more complicated than what I anticipated. So, if you are interested – read on!

I was going to say that the steps seem a bit complicated but not overly terrible, though Niels’s conclusion leaves me wondering.

Comments closed

Performance Impact of Foreign Keys with Non-Default ON UPDATE or ON DELETE

Hugo Kornelis continues a dive into foreign keys:

Welcome to part fifteen of the plansplaining series. In the three previous parts I looked at the operators and properties in an execution plan that check a modification doesn’t violate foreign key constraints. That part is done. But I’m not done with foreign keys yet.

We normally expect foreign keys to throw an error on violations. But that’s actually only the default option: they can also be set to be self-correcting. This is done using the ON UPDATE and ON DELETE clauses, which provide the user with several choices on how to handle child data that would become orphaned, and hence violate the constraint, as a result of a change in the parent table.

Read on to see how these operate in SQL Server.

Comments closed

Contrasting Data Warehouses with Power BI Dataflows

Reza Rad makes a comparison:

Dataflow is the data transformation service in Power BI, and also some other Power Platform services. Data Warehouse is the cloud storage and also compute engine for data. I often get this question that: “Now that we have dataflow in Power BI, should we not use the Data warehouse? What are the differences? which is better? When to use what?” This article and video, explains answer to these questions.

I’m probably a bit lower on self-service BI compared to others. When I see something like Dataflows, it reminds me too much of a mess of Excel spreadsheets on shared drives. There’s a lot of relevant business knowledge embedded in those disbursed locations, and bringing it together becomes as much a forensic exercise as it is architectural.

Comments closed

Don’t Fear the tempdb

Erik Darling puts his pants on one leg at a time and once his pants are on, he makes gold records:

One persistent idea is that tempdb is something to be avoided. Either because it was “slow” or to avoid contention.

Granted, if a query has been around long enough, these may have been valid concerns at some point. In general though, temp tables (the # kind, not the @ kind) can be quite useful when query tuning.

Erik is absolutely right in this post. Ceteris paribus I’d rather not directly use tempdb because I’d prefer one query over multiple queries. But once performance comes into question, working on smaller subsets of data one step at a time will typically give you at least an acceptable solution.

Comments closed

Deploying ADS Database Projects Manually

Elizabeth Noble continues a series of videos on database projects in Azure Data Studio:

This week, we’ll talk about one of the easier ways to deploy your database changes. One of the benefits of database projects is that they can generate data-tier applications (DAC). The data-tier applications can be bundled into what is called a DACPAC. This is a collection of files that can be used to deploy your database.

Click through for the video.

Comments closed