Press "Enter" to skip to content

Curated SQL Posts

Basic Data Tidying

Sarah Dutkiewicz tidies up a data set in R:

Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the spread() function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.

Sarah has also made the data set available in case you’re interested in following along.

Comments closed

Interpreting Regression Coefficients

Steph Locke explains what beta values on parameters in a regression actually signify:

When we read the list of coefficients, here is how we interpret them:

  • The intercept is the starting point – so if you knew no other information it would be the best guess.

  • Each coefficient multiplies the corresponding column to refine the prediction from the estimate. It tells us how much one unit in each column shifts the prediction.

  • When you use a categorical variable, in R the intercept represents the default position for a given value in the categorical column. Every other value then gets a modifier to the base prediction.

Linear regression is easy, but the real value here is Steph’s explanation of logistic regression coefficients.

Comments closed

Azure SQL Data Warehouse Max Size Property

Joey D’Antoni says not t freak out if you look at the Max Size property on an Azure SQL Data Warehouse database:

One of the promises of Azure SQL Data Warehouse is the ability to have petabyte scale. The ability to quickly scale data, and have that data scale independently of compute resources. So when one I my clients emailed me yesterday with this screenshot, needless to say I was concerned.

As you can see, when  the properties screen shows a max size of 5 GB.

Click through for the reason why.

Comments closed

Dealing With 404 Errors In Power BI Query Editor

Callum Green shows how to deal with a scenario when you try to retrieve data for a particular row but get a 404 error:

The error message is a little misleading but let’s save the debugging debate for another day. The key observation is “Guildford” data is not available, simply because it comes after “Camberley” in the list. Whilst we want to see errors in a Query, we do not want them causing data loss.

Resolution

As I mentioned at the beginning of this article, using the Remove Errors function would prevent the loss of Guildford data. However, the user needs to handle errors as Unknown Members and conform to a typical Kimball Data Warehouse.

I am sure there are many ways to fulfil the requirement, but here is how I approached it:

Read on for the resolution.

Comments closed

Creating Graph Objects In SQL Server

Steve Jones creates a simple graph relationship in SQL Server 2017:

What does all that mean? No idea. Clearly there is JSON that’s returned here and can be deserialized to gather meanings. Is this useful? I think graphs solve a certain set of problems very well, and more efficiently than relational systems. Certainly I could implement a graph structure relationally, but at scale I’m not sure the queries would be as easy to write or run as quickly.

I don’t know if I’d use a graph structure in any of the problems we try to solve in the SQLServerCentral app, but who knows. Maybe we would if we could.

Steve leaves this with more questions than answers, but he does give a very simple repro script if you want to futz about with graphs.

Comments closed

Adaptive Join Internals

Dmitry Pilugin digs deep into how adaptive joins work in SQL Server 2017:

We have three types of physical join algorithms in SQL Server: hash, nested loops and merge. Adaptive join allows SQL Server automatically choose an actual physical algorithm on the fly between the first two – hash (HM) and nested loops (NL).

NL has two join strategiesnaive nested loops join (inner loop scans the whole inner table or index) and index nested loops join (index on the join column of the inner table is used to find necessary rows and then those rows are applied to the outer row, also called Nested Loops Apply). Typically, the second one performs very well if you have rather small input on the outer side and indexed rather big input on the inner side.

HM is more universal and uses hash algorithms to match rows, so no indexes are necessary. You may refer to my blog post Hash Join Execution Internals for more details.

Adaptive Join starts execution as a Hash Join. It consumes all the input of the build phase and looks at the adaptive join threshold, if the number of rows is more or equal this threshold it will continue as a hash join. However, if the number of rows is less than this threshold, it will switch to a NL.

If you want to get a better understanding of how adaptive joins works, Dmitry’s post is a great start.

Comments closed

K-Means Clustering In R

Raghavan Madabusi provides an example of how k-means clustering can help segment data points in an understandable manner:

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours.

For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

This article won’t really explain k-means clustering in any detail, but it does give you an example to apply the technique using R.

Comments closed

Monitoring Spark And Kafka

Larry Murdock gives some hints on monitoring Kafka topics and their associated Spark jobs:

Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.

Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.

Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.

Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.

Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.

Larry then applies these concepts and gives links to some useful tools.

Comments closed