Press "Enter" to skip to content

Day: May 3, 2021

Recent Apache NiFi Updates

Pierre Villard has some news for us around Apache NiFi:

Cloudera released a lot of things around Apache NiFi recently! We just released Cloudera Flow Management (CFM) 2.1.1 that provides Apache NiFi on top of Cloudera Data Platform (CDP) 7.1.6. This major release provides the latest and greatest of Apache NiFi as it includes Apache NiFi 1.13.2 and additional improvements, bug fixes, components, etc. Cloudera also released CDP 7.2.9 on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 and more.  Let’s have a look at the main highlights of these releases.

Click through to see what’s included.

Comments closed

Writing SQL to Query R data.frames

Tomaz Kastrun tries out a package:

There are many R packages for querying SQL Databases. Recently, I was looking into sqldf package | CRAN documentation.

There are so many great advantages (simple running SQL statements, creating, loading, deleteing data to data.frames, connectivity to many databases, support for SQL functions, data types and many many more) , but one that was really a major win was interactions with data frames and SQL Language.

Between sqldf and dbplyr, you get it both ways: treat a data.frame like a SQL table, or treat a SQL database like R data.frames.

Comments closed

Row Goals and Query Optimizer Estimates

Hugo Kornelis explains how row goals can change query optimizer behavior:

The most simple is when the query literally tells SQL Server that you don’t want to have all rows returned. Everyone knows the TOP clause, which is most commonly used for this. For ANSI portability, and because it adds a few options, you should also be aware of the FETCH and OFFSET modifiers to the ORDER BY clause, that have a similar functionality and are specifically designed to support paging. And there is of course the SET ROWCOUNT option, though I sincerely hope nobody actually uses that. All of these options literally tell SQL Server that we don’t want all results, only a part of them. The execution plan that would produce the entire set the fastest might not necessarily be the fastest way to get the few rows we actually want, so it’s a good thing that the optimizer has a way to come up with a different execution plan for these cases.

But there are plenty of other ways you might get a row goal, so check them out.

Comments closed

An Introduction to Latches

Paul Randal starts a series on latches:

In some of my previous articles here on performance tuning, I’ve discussed multiple wait types and how they are indicative of various resource bottlenecks. I’m starting a new series on scenarios where a synchronization mechanism called a latch is a performance bottleneck, and specifically non-page latches. In this initial post I’m going to explain why latches are required, what they actually are, and how they can be a bottleneck.

Read on to learn what a latch is, why it is useful, and how latches work at a high level.

Comments closed

Turning On and Off Calculations in Power BI Visuals

Phil Seamark has a clever workaround:

Power BI doesn’t yet have a feature that allows end-users to turn on/off the ability to process calculations for visuals on a report page until they are ready. Most of the time, this is perfectly fine – however, in some instances, it can be handy to disable long-running and heavy calculations from running. At the same time filters/slicers are get selected.

The scenario you most likely want to have this control is when your model uses Direct Query mode against large tables in data sources that charge you for query processing. Even if your Direct Query data source does not charge per query, having a user make quick-fire selections over several slicers can potentially saturate a back-end data-source and unnecessarily chew up resources.

Read on to understand how to use calculation groups to do this, as well as the limitations around this solution.

Comments closed

Measuring DirectQuery Performance

Chris Webb shows how you can use the Performance analyzer in Power BI to measure DirectQuery performance:

If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.

I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data.

Read on for more information on how it all works.

Comments closed