Press "Enter" to skip to content

Author: Kevin Feasel

Trying out fabric-cicd

Kevin Chant tries a Python package:

In this post I want to cover my initial tests of fabric-cicd. In order to provide some tips for those looking to work with this new offering.

Just so that everybody is aware, fabric-cicd is a Python library that allows you to perform CI/CD of various Microsoft Fabric items into Microsoft Fabric workspaces. At this moment in time there is a limited number of supported item types. However, that list is increasing.

Read on for the test. It currently supports a limit amount of functionality, but it looks promising.

Leave a Comment

Scraping SQL Server Version Updates

Rod Edwards scrapes patch information:

(just for info, i’ve always refer to the SQL Server Version List (https://sqlserverbuilds.blogspot.com) as Blogspot for some reason)

Doesn’t take long, but it would be nice to not have to bother at all, and I don’t mean by moving everything to the cloud and let MS handle it either before anyone says it. 🙂

The details are just websites, so the html is just plain text, AND the awesome people who provide Blogspot also add the details to a publicly shared google doc, in multiple formats, legends.

So what can we do to make our lives even easier. How about downloading/scraping all of the details that you might ever need, and dropping them into tables somewhere? So when its in SQL, you can then choose to trigger off that however you like?

Read on to learn how.

Leave a Comment

Orchestrating Data Pipelines in R with maestro

Will Hipson moves some data:

If you look at data orchestration tools today you are bombarded with a dizzying array of software platforms that claim unsurpassed processing capability, AI-readiness, elegant UIs, etc. Apache Airflow is just one example of a popular orchestration platform that scales to meet virtually any orchestration need. And while these claims may be true, I argue it is rarely the case that these gargantuan platforms are needed in the first place. For most data engineers, you probably only need to process a moderate amount of data at a moderate time scale. Moreover, if you’re an R user, you don’t want to have to define your data pipelines using drag-and-drop tools or learn another programming language. Not only will this reduce cloud costs but also development time costs.

Click through to see why Will developed maestro and how it works. H/T R-Bloggers.

Leave a Comment

Thoughts on Scaling Elasticsearch

Vivek Kumar can’t stop at one:

With the evolution of modern applications serving increasing needs for real-time data processing and retrieval, scalability does, too. One such open-source, distributed search and analytics engine is Elasticsearch, which is very efficient at handling data in large sets and high-velocity queries. However, the process for effectively scaling Elasticsearch can be nuanced, since one needs a proper understanding of the architecture behind it and of performance tradeoffs.

Click through for those considerations and the trade-offs you might see.

Leave a Comment

Table Compaction in Apache Spark

Miles Cole groups things together:

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Leave a Comment

Step Outputs to Help Troubleshoot Failed SQL Agent Jobs

Jim Evans gives us a reminder:

When troubleshooting SQL Agent jobs, often the Job history output is truncated or poorly formatted, making it hard to read. This is especially true when calling SSIS Packages, running jobs like DBCC CheckDB or when running T-SQL code that returns a lot of output. Are there options to get more readable Job output to aid in troubleshooting?

There are a few settings here that we can use to make troubleshooting SQL Agent jobs a little bit easier. In addition to these, it’s also a good idea to retain more history for longer, especially if you’re not in a position to track those job outputs each day.

Leave a Comment

Microsoft Fabric February 2025 Feature Round-Up

Patrick LeBlanc tells us what’s new:

There are a lot of exciting features for you this month! Here are some highlights: In Power BI, Explore from Copilot visual answers which lets you do easy ad-hoc exploration. In Data Warehouse, Browse files with OPENROWSET (Preview) and Copilot for Data Warehouse Chat (Preview). For Data Science, AI Skill is now conversational.

These are just some of the great features this month, keep reading to learn about all of what’s happened in Fabric this month.

Click through for the full report.

Leave a Comment

Sales KPIs in T-SQL

I have a new video:

In this video, we will take a look at several useful KPIs for measuring sales. We will also demonstrate the utility of the APPLY operator, the WINDOW clause, and the APPROX_PERCENTILE_CONT() function.

I enjoyed putting this one together, because I enjoy it any time I have a chance to talk about the APPLY operator.

Leave a Comment

Vertical Partitioning Rarely Works

Brent Ozar lays out an argument:

You’re looking at a wide table with 100-200 columns.

Years ago, it started as a “normal” table with maybe 10-20, but over the years, people kept gradually adding one column after another. Now, this behemoth is causing you problems because:

  • The table’s size on disk is huge
  • Queries are slow, especially when they do table scans
  • People are still asking for more columns, and you feel guilty saying yes

You’ve started to think about vertical partitioning: splitting the table up into one table with the commonly used columns, and another table with the rarely used columns. You figure you’ll only join to the rarely-used table when you need data from it.

Read on to understand why this is rarely a good idea and what you can do instead.

I will say that I’ve had success with vertical partitioning in very specific circumstances:

  1. There are large columns, like blobs of JSON, binary data, or very large text strings.
  2. There exists a subset of columns the application (or caller) rarely needs.
  3. Those large columns are in the subset of columns the caller rarely needs, or can access via point lookup.

For a concrete example, my team at a prior company worked on a product that performed demand forecasting on approximately 10 million products across the customer base. For each product, we had the choice between using a common model (if the sales fit a common pattern) or generating a unique model. Because we were using SQL Server Machine Learning Services, we needed to store those custom models in the database. But each model, even when compressed, could run in the kilobytes to megabytes in size. We only needed to retrieve the model during training or inference, not reporting, but we did have reports that tracked what kind of model we were using, whether it was a standard model or custom (and if custom, what algorithm we used). Thus, we had the model binary in its own table, separate from the remaining model data.

Leave a Comment