Press "Enter" to skip to content

Curated SQL Posts

Web Page Scraping with R and ML Services

Dave Mason shows how you can scrape webpages with R and pull the resulting data into SQL Server using Machine Learning Services:

For this post, it might make more sense to skip ahead to the end result, and then work our way backwards. Here is a web page with some data: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players. We will scrape the web page, and write the data from the “roster” grid to a SQL Server table.

Read on for a demonstration of the process.

Comments closed

Refreshing Power BI Datasets in Azure Data Factory

Meagan Longoria shows us how to refresh a Power BI dataset using Azure Data Factory:

I recently needed to ensure that a Power BI imported dataset would be refreshed after populating data in my data mart. I was already using Azure Data Factory to populate the data mart, so the most efficient thing to do was to call a pipeline at the end of my data load process to refresh the Power BI dataset.

Power BI offers REST APIs to programmatically refresh your data. For Data Factory to use them, you need to register an app (service principal) in AAD and give it the appropriate permissions in Power BI and to an Azure key vault.

Click through for the solution.

Comments closed

Creating Evenly-Sized Batches from Groups

Daniel Hutmacher has a variant on the islands problem as well as the bin-packing problem:

My aim with this post is to split the dataset into batches of roughly 100 rows each.

DECLARE @target_rowcount bigint=100;

I say “roughly”, because we’re not allowed to split a transaction so that a group (grouping_column_1, grouping_column_2) appears in more than one batch, although a batch can obviously contain more than one group. This means that by necessity, some of the batches are going to be slightly under 100 rows and some are going to be slightly over.

Read on for a good solution to the problem. Daniel mentions places where performance could be better, though this feels like the kind of task you don’t necessarily run all that frequently.

Comments closed

Power BI the Right Way

Paul Turley is starting a series on doing Power BI the right way:

Any attempt to apply universal best practices to Power BI solution design is a slippery slope. The tools are so flexible and powerful, and the requirements of each project are so varied that it is challenging to establish a set of steps or rules that, if followed, will always yield the absolute best design for a given scenario. With that out of the way, I’ll say this: In my job, I see a lot of poorly-designed Power BI projects. I’ve worked with dozens or scores (maybe even hundreds?) of consulting clients who bring us projects – some partially completed, some finished, and many that are just broken – to be fixed or completed. My reactions range from “that’s just downright wrong” to “hmmm… I wouldn’t have done it that way but I guess it will work for the time being”. I try not to cast stones and do, on occasion, realize that others have found a better way to solve a problem. I don’t have all the answers but I do have a lot of experience with Microsoft Business Intelligent solution design. I have learned many good practices and design patterns from community leaders over the past twenty or so years.

This is only the introductory post in the series, but I’m excited to see what Paul has in store.

Comments closed

Areas of Improvement for DROP TABLE

Michael J. Swart points out a few foibles about the DROP TABLE syntax:

I was looking at the docs for DROP TABLE and I noticed this in the syntax: [ ,...n ]. I never realized that you can drop more than one table in a statement.

I think that’s great. When dropping tables one at a time. You always had to be careful about order when foreign keys were involved. Alas, you still have to care about order.

That is a shame. Michael also includes a few other places where DROP TABLE could be made better, so check it out.

Comments closed

A Brief Overview of Azure Synapse Analytics

Ginger Grant gives us the nickel tour of why Azure Synapse Analytics is interesting:

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

But it’s more than that. Read on to see what else is available.

Comments closed

Schema References and Multiple Event Types in a Kafka Topic

Robert Yokota updates some prior knowledge:

In the article Should You Put Several Event Types in the Same Kafka Topic?, Martin Kleppmann discusses when to combine several event types in the same topic and introduces new subject name strategies for determining how Confluent Schema Registry should be used when producing events to an Apache Kafka® topic.

Schema Registry now supports schema references in Confluent Platform 5.5, and this blog post presents an alternative means of putting several event types in the same topic using schema references, discussing the advantages and disadvantages of this approach.

Click through to see how this works out.

Comments closed

Performance Tuning for Cloudera’s Operational Database

Liliana Kadar, et al, show us the tools we can use to tune Cloudera’s Operatioanl Database:

A query optimizer determined the most efficient way to run a query. Query optimization helps you to reduce the hardware resources required to run a query and also speeds up your query-response time. Cloudera’s Operational Database provides you with various tools such as plan analyzers to make optimal use of your computing resources. 

Cloudera’s OpDB provides various cost-based and rules-based optimizers. You can use different optimizers based on your use cases. OpDB is primarily used for Online Transactional Processing (OLTP) use cases with Apache Phoenix in the OpDB used as a SQL engine. But you can also use Hive and Impala for Online Analytical Processing (OLAP) use cases. 

Read on for recommendations on platform choice as well as indexing and tuning options.

Comments closed

Changing Power BI Models with Visual Studio Code

Phil Seamark has a integration I hadn’t expected to see:

Visual Studio Code is a reasonably new development environment which is a lightweight and quick install and get up and running. There is nothing you can do in VS Code that you can’t also do in another tool using TOM. I just thought it would be fun to show how quick and easy it is to get up and running in very few steps.

The following exercise uses VS Code to connect and manage a Power BI Desktop model. You can also connect to models hosted in Azure Analysis Services as well as models hosted in Power BI Premium.

Read on to see how to get everything going.

Comments closed

SQL Server Assessment Reports

Dave Bland shares the results of a lot of effort:

When you click the SQL Server 2012 or newer you will be taken to a report that will return a great deal of information that will be useful when doing an assessment.  This is very similar to the report you will be taken to if you click the 2008r2 or older button.  Since many of these data points are not options in an Azure SQL DB, that button will take you to the Report Library.

Below are the 21 data points that will be returned in just a few seconds.  A number of these will be helpful when doing an assessment for performance reasons.  The boxes will also change color to yellow or red if issues are found, just like the image above.

Click through for samples as well as the download link.

Comments closed