Kevin Feasel – Page 1203

When Window Functions are Too Slow

Published 2019-04-18 by Kevin Feasel

Bert Wagner shows a scenario where a window function ends up performing poorly:

If you’ve used FIRST_VALUE before, this query should be easy to interpret: for each badge Name, return the first UserId sorted by Date (earliest date to receive the badge) and UserId (pick the lowest UserId when there are ties on Date).
This query was easy to write and is simple to understand. However, the performance is not great: it takes 46 seconds to finish returning results on my machine.

Bert’s response is to rewrite the query using a correlated subquery. My first shot would look at using APPLY though needing to aggregate the “parent” could lead to an awful result there if the join happened before aggregation.

The moral of the story here is to know different ways to write a query, as you can nudge the optimizer to better (or worse) behavior.

Comments closed

Explaining Implicit Conversion

Published 2019-04-18 by Kevin Feasel

Monica Rathbun explains to us what implicit conversion is and when it can go wrong:

Another quick post of simple changes you can make to your code to create more optimal execution plans. This one is on implicit conversions. An implicit conversion is when SQL Server must automatically convert a data type from one type to another when comparing values, moving data or combining values with other values. When these values are converted, during the query process, it adds additional overhead and impacts performance.

Read on for more info, including a common scenario where implicit conversion causes performance degradation.

Comments closed

Custom kubectl Plugin: Connect to SQL Server

Published 2019-04-18 by Kevin Feasel

Andrew Pruski shows how to create custom kubectl plugins:

When I deploy SQL Server to Kubernetes I usually create a load balanced service so that I can get an external IP to connect from my local machine to SQL running in the cluster. So how about creating a plugin that will grab that external IP and drop it into mssql-cli?
Let’s have a go at creating that now.

Click through for two demos including the appropriately-named kubectl prusk.

Comments closed

Conflict Tracking in Merge Replication

Published 2019-04-18 by Kevin Feasel

Ranga Babu shows the two different models for conflict detection with merge replication:

Conflict Detection:
The conflict detection depends on the type of tracking we configure for the article.
– Row-level tracking: If data changes are made to any column on the same row at both ends, then it is considered a conflict.
–Column-level tracking: If data changes are made on the same column at both ends, this change is qualified as a conflict.

Read on for a detailed demonstration of the two.

Comments closed

Pivoting Spark DataFrames

Published 2019-04-17 by Kevin Feasel

Unmesha Sreeveni shows how we can pivot a DataFrame in Apache Spark using one line of code:

A pivot can be thought of as translating rows into columns while applying one or more aggregations.
Lets see how we can achieve the same using the above dataframe.
We will pivot the data based on “Item” column.

Click through for the code. This is an area where dropping back into Scala or Python is a lot more lines-of-code efficient than sticking to SQL.

Comments closed

Bayes’ Theorem In A Picture

Published 2019-04-17 by Kevin Feasel

Stephanie Glen gives us the basics of Bayes’ Theorem in a picture:

Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second challenge is rephrasing your question so that you can work backwards: turning P(A|B) into P(B|A). The following image shows a basic example involving website traffic. For more simple examples, see: Bayes Theorem Problems.

Click through for the image and related links.

Comments closed

Tidying Video Game Data

Published 2019-04-17 by Kevin Feasel

Arvid Kingl has a fun article analyzing data from an open-source video game and applying tidy data principles to it:

You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.
Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.

Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.

Comments closed

Troubleshooting Spark Performance

Published 2019-04-17 by Kevin Feasel

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Basic Forensic Accounting Techniques

Published 2019-04-17 by Kevin Feasel

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Scaling Power BI Premium Capacity

Published 2019-04-17 by Kevin Feasel

Matt Allington gives us instructions on how to scale Power BI Premium capacity:

This is the third article in my series about how to make Power BI Premium more affordable for small to medium sized enterprises (SMEs). In my first article I explained the problem and the logic behind how to configure a workable solution. In my second article I provided step by step instructions on how to configure Flow to start/stop Power BI Premium capacities. In the article today I am covering a way to scale the capacity up/down either on demand, or on a timed schedule.

The cloud is generally more expensive than on-prem, though it can potentially become cheaper if you are smart about scaling and have more scaling-friendly workloads. Matt even provides a really cool cost analysis to help you figure out what (if anything) you end up saving using this technique.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Author: Kevin Feasel