Press "Enter" to skip to content

Author: Kevin Feasel

Power BI and Azure Synapse Analytics

James Serra gives us some insights on the future of Power BI and how it relates with Azure Synapse Analytics today:

As an example of the speed of each layer, during an Ignite session (view here), there was a Power BI query run against 26 billion rows that was returning a sum of store sales by year. The same query was run three times using a different layer:

1. Using a DirectQuery against tables in SQL DW took 8 seconds
2. Using a DirectQuery against a materialized view in SQL DW took 2.4 seconds.  Note you don’t have to specify that you are using a materialized view in the query, as the SQL DW optimizer will know if it can use it or not
3. Using a Aggregation table that is Imported into Power BI took 0 milliseconds

Keep in mind this is all hidden from user – they just create the report.  If they do a query against a table not in memory in Power BI, it will do a DirectQuery against the data source which could take a while.  However, due to SQL DW result-set caching, repeat DirectQuery’s can be very fast (in the Ignite session they demo’d a DirectQuery that took 42 seconds the first time the query was run, and just 154 milliseconds the second time the query was run that used result-set caching).

There’s some interesting information in here, especially around Power BI eventually taking over Azure Analysis Services’ space in the market.

Comments closed

Azure Data Factory Triggers

Cathrine Wilhelmsen continues a series on Azure Data Factory by looking at triggers:

One important thing to note is that all times are in UTC. And since UTC does not observe daylight saving time… Well, let’s just say that if you need to execute pipelines during the workday and you have business users waiting for data, you may want to plan some trigger maintenance on the days when you fall back or spring forward. I know. Ugh 🙂 I’m hoping for better timezone support in the future 🤞🏻

Schedule triggers and pipelines have a many-to-many relationship. That means that one schedule trigger can execute many pipelines, and one pipeline can be executed by many schedule triggers.

Time-based triggers aren’t the only options, however—Cathrine also looks at the other three possibilities.

Comments closed

Posting SQL Query Results to Teams with Powershell

Michael Bourgon shows how we can post SQL Server result sets to Microsoft Teams channels:

So…. you want to post to a Teams channel automagically.  Should be simple, and it is!  Alas, it means you have to ignore most of the documentation.  Let’s do this!

1) Here’s how to hook it up with your channel.  Note that when I created a brand new “Team”, it took about 5-10 minutes before I was able to add the webhook connector – prior to that, I got a “channel does not exist or has been deleted”.

Michael takes us through it step by step and also includes things you should avoid, including misleading documentation.

Comments closed

Flink 1.8.3 Released

Hequn Cheng announces Flink 1.8.3:

The Apache Flink community released the third bugfix version of the Apache Flink 1.8 series.

This release includes 45 fixes and minor improvements for Flink 1.8.2. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.8.3.

There’s a nice list of bugfixes in the update.

Comments closed

Kryo Serialization in Spark

Pinku Swargiary shows us how to configure Spark to use Kryo serialization:

If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations and the grouping operations are where serialization has an impact on and they usually have data shuffling. Now lesser the amount of data to be shuffled, the faster will be the operation.
Caching also have an impact when caching to disk or when data is spilled over from memory to disk.

Also, if we look at the size metrics below for both Java and Kryo, we can see the difference.

Sounds like it’s better overall but requires some custom configuration.

Comments closed

Finding the Max Value Across Multiple Columns

Erik Darling shows a couple techniques for finding the maximum value across several columns, whether they’re in one table or in more than one:

It’s sorta kinda pretty crazy when every major database platform has something implemented, and SQL Server doesn’t.

Geez, even MySQL.

But a fairly common need in databases is to find the max value from two columns.

Maybe even across two tables.

Read on to see how you can do this.

Comments closed

Distributing Notebooks

Grant Fritchey wants to know where to buy notebooks and notebook accessories:

I’m myopically focused at the moment on Azure Data Studio, but there are a lot of other places and ways to create or consume notebooks. However, I’m going to keep my focus.

The issue I’m running into, is distributing the notebooks.

There are a lot of great comments. Before reading them, here’s my answer:

  • GitHub repos, like Grant mentions. They’re good, though I have the same feeling about a production notebook that I do about an SSIS package: notebooks are binaries (after a fashion). For pedagogical purposes, I’ll absolutely slap notebooks into GitHub, typically without data. But for a real data science project, those notebooks can get hefty when you store all of the data in them, and it’s really hard to diff the JSON to understand what changed.
  • Binder and Azure Notebooks are services which let you host notebooks remotely. Binder reads from a GitHub repo and spins up a virtual environment for you. Azure Notebooks lets you run notebooks (including F# notebooks) against free VMs in Azure, or you can use your own VM for more power. Azure Notebooks let you fork projects pretty easily. I haven’t used Google Colab but it looks pretty similar to Azure Notebooks.
  • When you start up Jupyter Notebooks, you’re really starting a server. You can have a server running in your environment with your team’s notebooks. I’d probably still drop them in source control as well.
Comments closed

Column Alteration with Minimal Downtime

Andy Mallon shows how you can turn an integer column into a bigint column without disrupting your users:

Changing a column from int to bigint has gotten a lot easier since I started working on SQL Server back at the turn of the century. SQL Server 2016 introduced the ability to do ALTER TABLE...ALTER COLUMN as an online operation using the WITH (ONLINE=ON) syntax. This wonderful syntax now allows you to alter a column from int to bigint without causing major blocking. The int to bigint conversion is one of the most popular data type changes I see–A developer inevitably creates the table thinking they will never have more than 2 billion rows… then some years or months later 2 billion becomes a reality.

The DBA is left with the task of implementing that data type change, and now that it has almost 2 billion rows, it’s a well-established table and uptime during the change is a major consideration.

This is a great post from Andy. If you want to dig into the concept of near-zero downtime in more detail, I’ve got a series on the topic.

Comments closed

Debugging Azure Data Factory Pipelines

Cathrine Wilhelmsen shows us how to debug Azure Data Factory pipelines:

You debug a pipeline by clicking the debug button:

Tadaaa! Blog post done? 😀

I joke, I joke, I joke. Debugging pipelines is a one-click operation, but there are a few more things to be aware of. In the rest of this post, we will look at what happens when you debug a pipeline, how to see the debugging output, and how to set breakpoints.

Turns out there’s more to it than clicking a button.

Comments closed

Tools for Using SQL Server on Linux

Kellyn Pot’vin-Gorman has a list of tools you can use to make working with SQL Server on Linux a bit easier:

Along with the above versions of Linux distributions, SQL Server 2019 is supported in a container scenario using a Docker image.  Running a SQL Server database inside a Docker engine with Linux offers more flexibility, faster recovery, and quicker deployments, including deployments into the Azure cloud. For those becoming familiar with Linux, Docker for Windows or Mac gives you the option to run a Docker engine on your workstation with SQL Server 2019 on Linux.

Along with Docker technology, orchestration can be achieved, both managing and deploying SQL Server containers on Linux using Red Hat Open shift or Kubernetes. This includes SQL Server 2019 Big Data Clusters (BDC), fully scalable clusters with SQL Server, Spark, and Hadoop File System (HDFS). BDCs provide the ability to read, write, and analyze big data with T-SQL or Spark, and you can combine big data and relational data, too.

The set of tools just happens to be almost exactly the same set of tools as for Windows, but there are a few differences.

Comments closed