Press "Enter" to skip to content

Curated SQL Posts

The Costs of Specialization within Data Science

Eric Colson argues in favor of data science generalists rather than specialists:

But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systemsclient engagement banditsstyle preference classificationsize matchingfashion design systemslogistics optimizersseasonal trend detection, and more can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and iteration. With pins, the learning and design are done up-front, before you produce them. With data science, you learn as you go, not before you go.

In the pin factory, when learning comes first we do not expect, nor do we want, the workers to improvise on any aspect the product, except to produce it more efficiently. Organizing by function makes sense since task specialization leads to process efficiencies and production consistency (no variations in the end product).

I think this article captures the downside risk of specialization, but not the downside risks of generalization: some people simply aren’t very good at some things, leading to huge amounts of technical debt down the road or failing a project due to the lack of requisite knowledge or skills. To give a personal example, I have a generalist team, but I still control the data flows (at the very least doing thorough code reviews of any database changes), my application specialist controls app architecture, my statistician reviews algorithms, etc. I don’t claim that this is the best strategy, but a group of pure generalists will have their own set of problems too.

Comments closed

Deploying SQL Server Versions with Kubernetes

Anthony Nocentino shows how we can upgrade or even downgrade our SQL Server containers using Kubernetes deployment scripts:

There’s a few things I want to point out in our YAML file. First, we’re using a Deployment Controller. This will implement a Replica Set of the desired number of replicas using the container imaged defined. In this case, we’ll have 1 replica using the SQL Server 2017 CU11 Image. A Replica Set will guarantee that a defined set of Pods are running at any given time, here we’ll have exactly one Pod. We’re using a Deployment Controller, which gives us move between versions of Replica Sets based off different container images in a controlled fashion…more on that in a second.

Read the whole thing.

Comments closed

Parameterized Direct Query In Power BI

Gerhard Brueckl shows how you can parameterize your SQL queries in Power BI, letting you switch server or database names and easily target a different location altogether:

I frequently work on projects where we have multiple tiers on which our solution is deployed to using continuous integration / continuous deployment (CI / CD) pipelines in Azure DevOps. Once everything is deployed, you also need to monitor these different environments and check the status of the data or ETL pipelines. My tool of choice is usually Power BI desktop as it allows me to connect to e.g. SQL databases very easily. However, I always ended up creating a multiple Power BI files – one for each environment.

Having multiple files results in a lot of overhead when it comes to maintenance and also managing these files. Fortunately, I came across this little trick when I was investigating in composite models and aggregations that I am going to explain in this blog post.

I have had to do exactly this same thing, so I’m going to have to try it out myself.

Comments closed

Performance Testing Simple Scalar UDF Functions

Wayne Sheffield tries out a simple scalar UDF in SQL Server 2019 to see how it performs:

I recently published a post detailing the new Scalar UDF Inlining feature in SQL 2019 here. That post introduced the new feature in a way that I used to compare performance to the other function types, continuing the performance evaluation of functions that I had previously posted here and here. In the Scalar UDF Inlining post, I used a function to strip all non-numeral values from a string, and to return the result. This used the FOR XML output option.

In thinking about how scalar functions are commonly used, I’ve decided to revisit this feature with a simpler function. I will still compare it to all the other types of functions to see how Scalar UDF Inlining compares to the others.

Wayne’s results are music to the product team’s ears, I’m sure.

Comments closed

SqlServer Module Now with Invoke-Sqlcmd

Aaron Nelson alerts us to a new preview of the SqlServer Powershell module:

If you’re still on an earlier version of PSCore and are unable to install PSCore 6.2 right now, you can still download preview of the SqlServer module to get the latest fixes and new features.  You just won’t be able to use the Invoke-Sqlcmd cmdlet.

Another quick thing to note is that this is like a v.0.0.1 of Invoke-SqlCmd on PSCore; it does not have all the bells & whistles of the version of Invoke-Sqlcmd for [full blown] Windows PowerShell.  Obviously, more features will be added over time, but the basic functionality was ready to for customers to start “kicking the tires”.

Read on for more notes and the link to check this all out.

Comments closed

Comparing STRING_AGG Performance Against FOR XML PATH

Jeffry Schwartz compares the performance of STRING_AGG in SQL Server 2017 versus the tried and true method of FOR XML PATH for string concatenation:

Two different T-SQL methods were used for testing.  The FOR XML PATH solution incorporates the reverse and stuff functions along with FOR XML PATH.  The String_agg solution is much simpler, but is only available on SQL Server 2017 currently.

The FOR XML PATH method has been documented thoroughly over the years in many articles, so these explanations will not be repeated here.  However, two points are worth mentioning.  The first one is that the @strSeparator (in this case ‘ or  ’) string will reside at the end of the constructed string (in this case @strFilter).  Since the last occurrence of @strSeparator must be removed to prevent a syntax error, characters must be dropped from the end of the string.  Although it may seem obvious that four characters (the length of @strSeparator) should be removed, actually only three need be removed because the end of the constructed string is “r” not “ “.  Therefore, the length argument of the left function specification is the length of @strFilter minus three.

I like STRING_AGG a lot, primarily because I can remember it and I can explain it to people relatively new to SQL Server. Read on to see how they both compare in terms of performance.

Comments closed

Showing Totals on Power BI Stacked Column Charts

Reza Rad shows us how to add a totals figure to Power BI stacked column charts:

As you can see, there are data labels for each subcategory (means gender and education), but no data label showing the total of each education category. for example, we want to know how much was the total sales in the High School category. Now that you know the problem, let’s see a way to fix it.

Read on for Reza’s solution to the problem. In general, if people might care about the total, do them a favor and show the total.

Comments closed

Accidentally Building a Population Graph

Neil Saunders shares an example of a newspaper headline which ultimately just shows us population sizes:

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

When working with spatial data cases, it’s important to differentiate an effect you see because it’s actually unique or interesting versus an effect you see because that’s where all of the people are.

Comments closed

Aspect-Based Sentiment Analysis

Federico Pascual explains aspect-based sentiment analysis and then shows how to implement it with MonkeyLearn:

Imagine you have a large dataset of customer feedback from different sources such as NPS, satisfaction surveys, social media, and online reviews. Some positive, some negative and others that contain mixed feelings. You’d use sentiment analysis to classify the polarity of each text, right? After all, it’s already proven to be a highly efficient tool.

But, what if you wanted to pick customer feedback apart, hone in on the details, get down to the nitty-gritty of each review for a more accurate analysis of your customers’ opinions?

Cue aspect-based sentiment analysis (ABSA). A text analysis technique that breaks down text into aspects (attributes or components of a product or service) and allocates each one a sentiment level. This technique can help businesses become customer-centric, which means putting their customers at the heart of everything they do. It’s about listening to their customers, understanding their voice, analyzing their feedback and learning more about customer experiences, as well as their expectations for products or services.

Click through for the demo.

Comments closed

Azure Data Studio and T-SQL Notebooks

Constantine Kokkinos takes us through the preview of T-SQL notebooks in Azure Data Studio:

I have been waiting for word about the new Notebook functionality in Azure Data Studio, and when I heard it was available in the insider build, I jumped in to take a look.

Jupyter Notebook is a web application that allows you to host programming languages, run code (often with different programming languages), return results, annotate your data, and importantly, share the source controlled results with your colleagues.

This is an exciting addition; SQL is a great language to combine with notebooks given the exploratory nature of the language. I’m going to wait until it’s officially out before diving too far into it, though.

Comments closed