Press "Enter" to skip to content

Curated SQL Posts

Gradient Boosting for Classification

I have a new video:

In this video, I take a look at an alternative to bootstrap aggregation & random forest: boosting. We cover a brief history of boosting and see how it works in action with XGBoost and LightGBM.

This is probably the video with the single largest number of links in my show notes. It’s also one of the shortest in the series; it’s funny how things work out sometimes.

Comments closed

Authenticating to Fabric APIs via Sempy and Service Principals

Gilbert Quevauvilliers links everything together:

I have been doing a fair amount of work lately with Fabric Notebooks.

I am always conscious to ensure that when I am authenticating using a Service Principal, I can make sure it is as secure as possible. To do this I have found that I can use the Azure Key Vault and Azure identity to successfully authenticate.

Read on for some of the advantages of using Azure Key Vault for this sort of credential management, as well as how to get it all working.

Comments closed

Creating Orchestrators in Azure Data Factory

Martin Schoombee continues a series on building an orchestration framework in Azure Data Factory:

The orchestration layer of the framework is where all the magic happens. It facilitates the execution of processes and/or tasks as defined in the metadata, and needs to do it both seamlessly and efficiently. Ideally you would want to deploy this layer only once, and never have to touch it again. And it is really with that in mind that I designed this layer…to function independently and with minimal dependencies in both directions.

I would have loved for this layer to consist of only one pipeline but there are some nuances in Data Factory that make it impossible, the primary nuance being that you cannot nest ForEach activities. As a result, this layer contains three pipelines that will be covered by the sections below in more detail.

Read on to see what those three pipelines are.

Comments closed

Refreshable Excel Files in the Power BI Service

Kristyna Ferris shows how to refresh a Power BI data source from Excel files in Sharepoint:

Ever since Excel made its debut in the 1980’s, it has been used as a quick way for end users to input and manipulate data on their own without going through the extensive data engineering and data ingestion processes. With Power BI coming on to the scene in 2015, it quickly became the go-to visualization tool for various data sources. These two powerful tools can be used together to drive customized insights for your organization. By uploading your Excel file into SharePoint/OneDrive, you can easily connect and set up a refresh to a Power BI report in the Power BI Service without an on-premises gateway.

Read on to see how it all works.

Comments closed

A Primer on SSIS Package Deployment

Andy Brownsword gives us a blast from the past:

Configurations for Integration Services packages allow us to tailor their execution without needing to redeploy. There are two main ways to manage these configurations – Package Configuration and Project Configuration. In this post we’ll look at the Package Configuration approach.

Package deployment was the original approach, though as Andy points out, it’s no longer the default.

Comments closed

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed

Checking if a Column Exists in an R Data Frame

Steven Sanderson takes a peek:

When working with data frames in R, it’s common to need to check whether a specific column exists. This is particularly useful in data cleaning and preprocessing, to ensure your scripts don’t throw errors if a column is missing. Today, we’ll explore several methods to perform this check efficiently in R, and I encourage you to try these methods out with your own data sets.

Read on for four ways to do this.

Comments closed

Including a SQL Query in an Expression in SSRS

Slava Murygin shares a tip:

Why: Most of the time, when you want a flexibility of your SQL query you can use parameterization. However there might be a situation when you’d need to build a dynamic query. In my case I used SQL query within an expression to feed it to multiple data sources targeting different servers with the exact same query.

Read on to see what Slava has fought with in the past.

Comments closed

Optimized Column Order for Indexes

Eitan Blumin talks indexing:

SQL Server performance optimization is not a simple topic, and index design plays a pivotal role in it, determining the efficiency of database queries.

One key aspect that often influences performance is the order of columns in an index.

In this guide, I’ll use my real-world experience from our consulting jobs to examine the thinking process behind selecting the best column sequence for an index, the logic behind the decisions, and offer some practical insights for optimal database performance.

Read on to learn more about indexing strategy. And this is probably a good time to remind people that the missing index DMV’s column order has as its sole basis the column’s ordinal value. You can, and often will, do better than the missing index DMV recommendation by going through Eitan’s exercise.

Comments closed