Press "Enter" to skip to content

Month: August 2022

Save and Unsafe Simple Parameterization

Paul White puts on the safety glasses:

When a statement passes the earlier parser and decoder checks, it arrives at the trivial plan stage as a prepared (parameterized) statement. The query processor now needs to decide if the parameterization attempt is safe.

Parameterization is considered safe if the query processor would generate the same plan for all possible future parameter values. This might seem like a complex determination to make, but SQL Server takes a practical approach.

Read on to learn more about the process.

Comments closed

Concurrency Control and VACUUM in Postgres

Paul Randal explains how multi-version concurrency control works in Postgres:

PostgreSQL uses an optimistic isolation system known as Multi-Version Concurrency Control (MVCC). MVCC ensures transactions writing data to the database don’t block concurrent transactions needing to read the data being modified. This works through the magic of row-versioning—PostgreSQL creates versions of rows in the database tables to minimize blocking from concurrent access. As more and more versions are generated, a garbage control mechanism called VACUUM must be used to ensure the tables are properly maintained. In this article, I’ll explain how all this works via a series of examples.

This is quite similar to Read Committed Snapshot Isolation in SQL Server but with a couple of twists, including the need to vacuum tuples.

Comments closed

From Azure Data Explorer to Excel

Dany Hoter views data in Excel:

In a previous article Direct Query from Excel to Azure Data Explorer (microsoft.com) I described a way to mimic Direct Query access ala Power BI in Excel.

The method used in this article that allows the user to filter the imported data using values entered into cells in the grid.

In this article I would like to describe a way to really query Kusto data in real time without importing any data and without any volume limitations.

Read on to see how, though there’s a pretty big intermediate step.

Comments closed

“Expensive” Queries

Erik Darling asks, what’s in a name?

When we talk about finding queries to tune, there’s an unfortunate term that gets thrown around: Expensive Queries.

Why is it unfortunate? Well, it reinforces the wrong mindset when it comes to query tuning, and leads people down the wrong path, looking at the wrong metrics.

I disagree on the “bad name” bit but agree on the substance. The term “expensive query” has a very useful connotation: this is a query which requires a significant amount of resources. Where I fully agree with Erik is that “query cost” from the optimizer does not do a great job of describing “significant amount of resources.” There is also a relevant point that expensive queries may not be the most important ones to look at. Reasons why can include:

  • The query runs at a time when there’s little load on the system, so it does not impact anybody else.
  • The query runs within acceptable performance boundaries for customers: it may take 10 minutes to run but it’s a batch process and the relevant business unit might only need it within an hour.
  • The amount of work that the query is doing is such that further optimizations are either not possible at all or they are only possible with a significant restructure that the business is unwilling to accept.

Even so, the term “expensive query” is still very useful. So is “expensive query relative to what it could be,” although we do tend to conflate the latter with the former. But now we’re getting deep into semantics and I forgot my waders.

Comments closed

Power BI Dataflows and On-Prem Data Gateways

Marc Lelijveld keeps getting pulled back in:

Another week, another challenge! Summer holidays is the perfect time to run into challenges that you normally would not encounter or simply do not know about because they are fixed before you know. This blog will elaborate on a case I had at a client, where we run into issues with the on-premises data gateway in Power BI and in particular related to building dataflows based on sources that connect via the gateway.

Read on for the problem as well as Marc’s solution.

Comments closed

Visualizing Data in Python

Mark Litwintschik provides some recommendations:

There are two major phases of data analysis. The first is building up a basic understanding of a new dataset. Once this is done there is a second phase of understanding what’s changing over time and if there are any new outliers.

For the first phase, I find Tableau to be more productive than writing code in a Jupyter Notebook. For the second phase, I like to build periotic Airflow jobs that send charts and Excel files to operational channels on Slack. These are formatted to be mobile-friendly and allow me to do more of my work on a phone rather than being chained to a laptop. This also means access is controlled via Slack rather than a custom web app.

Mark also covers some examples with Altair.

Comments closed

Worrying over Columns or Rows

John Mount explains an attitude difference:

I say: if you are a data scientist or working on an analytics projectworry over columns not rows.

In analytics “rows” are instances, and “columns” are possible measurements. For example: each click on a website might generate a row recording the visit, and this row would be populated with columns describing what was clicked on (and if you are lucky there are more records recording what else was presented and not clicked on).

Read the whole thing. This is also why formats like Parquet and ORC are so popular for data analysis. Same goes for business intelligence people, who reason mostly over columns, leading to columnstore indexes being so useful.

Comments closed

SQL Server 2019 CU 16 and Log Shipping

Lee Markum notes a change in SQL Server 2019 CU 16:

As a data professional managing SQL Servers, you need to be thinking about Windows and SQL Server patching.

You want to stay up to date, but you also have to be careful because sometimes staying fully up to date comes with the risk that a new patch might break a feature you are using. This is possibly the case for SQL Server 2019 CU 16. If you’re using Log Shipping on a database that has TDE enabled and compression is enabled on those backups, then you need to take heed.

I’ll walk you through what can happen and what I’ve seen work to resolve the problem.

This is the tricky part of getting rid of service packs: when I think cumulative update, I don’t think “Here’s a thing that could break backwards compatibility with other SQL Server instances which have not been patched.” Read on to see an example of this in action.

Comments closed

Organizing Data Domains in a Data Mesh

Paul Andrew continues a series on data mesh architecture:

Defining an organisation hierarchy is always hard, even more so for large enterprises with massive amounts of interlock between business functions. In the context of data analytics, we attempt to tackle the problem by creating an organisation dimension as part of our star schema data model. This could include things like region, operating company, branch, department, team etc.

So, my friends, how do we go about handling this when considering a data mesh architecture and the de-centralised domains that support the natural scalability we crave. For me, it feels like we are just frontloading the dimensional modelling problem. Tackling it from the beginning in the very foundations of our data platform. But, with a twist.

Read on for that twist and for some solid guidance on data domains in practice compared to the theory.

Comments closed

Showing Filter Selections on Power BI Reports

Mara Pereira makes a Power BI breadcrumb:

When I’m developing reports, I’m asked multiple times to “hide” the slicers/filters from the report page. Usually this is to make space in the page for other visuals and because customers don’t want to use the filter pane for some reason.

This happened so many times, and only in the last couple of months I decided to try some things out and get a bit creative with Power BI.

I came up with two solutions, which I think work great in these scenarios.

Read on to see the solutions.

Comments closed