Press "Enter" to skip to content

Author: Kevin Feasel

Monitoring Backpressure in Apache Flink

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:

– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.

– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

Apache Phoenix on Cloudera

Krishna Maheshwari announces that Cloudera will officially support Apache Phoenix on its CDH and its upcoming Cloudera Data Platform:

Cloudera’s CDH releases have included Apache HBase which provides a resilient, NoSQL DBMS for customers operational applications that want to leverage the power of big-data.  These applications have grown into mission important and mission critical applications that drive top-line revenue and bottom-line profitability.  These applications include customer facing applications, ecommerce platforms, risk & fraud detection used behind the scenes at banks or serving AI/ML models for applications and enabling further reinforcement training of the same based on actual outcomes.

However, for many customers, HBase has been too daunting a journey 

Phoenix is one of my favorite examples of Feasel’s Law in action.

Comments closed

Median Calculation with T-SQL

Nisarg Upadhyay shows three ways to calculate the median in T-SQL:

To calculate the median of any dataset, we first need to arrange all values from the dataset in a specific order. After arranging the data, we must determine the middle value of the specified dataset. If the dataset contains an odd number of values, than the middle value of the entire dataset will be considered as a median. Following is the example:

Median (M) = value of ((X + 1)/2) th item. (x is the number of values in the dataset)

Honestly, CLR’s probably the best approach here if you want a fast calculation for a reasonably large number of rows. Using ML Services and R/Python is another alternative, though the launchpad spinup time will probably make it slower than CLR.

Comments closed

Power Bi Dataflows and the Right Tool for the Job

Matthew Roche answers a reader question and waxes philosophical at the same time:

– Power BI dataflows and CDM folders provide capabilities for bridging the low-code/no-code world of self-service BI with managed central corporate BI in Azure.
– Power BI dataflows enable Excel-like composition of ETL processes with linked and computed entities.
– Power BI dataflows can scale beyond the desktop and leverage the power of the cloud to become part of an end-to-end BI application.

But… This is just a list of features.

Read the whole thing.

Comments closed

Using Graph + Spatial to Find Closest Points

Hasan Savran shows how you can combine graph tables with spatial data types in SQL Server to find the nearest thing—in this case, a distribution center:

Today, I want to show you how Graph Processing Tables can make your data models flexible and smart. Let’s say we work in a e-commerce company, we have many users and products just like Amazon. We also have many warehouses, same product might be located in multiple warehouses. Whenever we want to ship a product, we want to pick the closest warehouse to buyer. In this way, we should be able save good amount of money for shipping and products will arrive to our customers locations faster.

Click through for the demo.

Comments closed

Top Products Per Group in Power BI

Marco Russo shows how you can display just the top few products in each grouping using Power BI:

A common approach to this scenario is to create a complex measure that hides the result if the element should not be displayed. In other words, it computes the ranking of the product and it blanks out the result if the ranking is larger than three. Though easy to implement, this approach requires modifying all the measures that should be displayed in a visual and negatively impacts performance. A better solution is to create a specific measure to use in the visual-level filter: that measure is executed only once for each product to define the set of visible items, without requiring any change to the other measures used in the visualization. We will see different options to solve the scenario.

Marco shows off a few techniques to get this done.

Comments closed

PolyBase in SQL Server 2019

Ben Weissman takes us through SQL Server 2019’s PolyBase enhancements:

Isn’t that the same thing, as a linked server?
At first sight, it sure looks like it. But there are a couple of differences. Linked Servers are instance scoped, whereas PolyBase is database scoped, which also means that PolyBase will automatically work across availability groups. Linked Servers use OLEDB providers, while PolyBase uses ODBC. There are a couple more, like the fact that PolyBase doesn’t support integrated security, but the most significant difference from a performance perspective is PolyBase’s capability to scale out – Linked Servers are single-threaded.

Read the whole thing. Ben asks and answers the question of whether PolyBase replaces ETL. You’ll want to read his answer. My answer (and I won’t tell you how close it is to his because I want you to read his article) is that PolyBase will only replace a fraction of total ETL and will act as an ETL process in a larger percentage of cases. I can see a pattern where you virtualize the data as external tables and then connect them together locally to insert into local facts and dimensions, for example. But there are too many things you can do with other ETL platforms which make me say this will never be a full replacement.

Comments closed

Polishing Uncalibrated Models

Nina Zumel takes an uncalibrated random forest model and applies a calibration technique to improve the estimate on one variable:

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This “polishing step” potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

This is a great explanation of the process as well as its risks and limitations.

Comments closed

Generating Excel Spreadsheets from Shiny

Richard Hill and Andy Merlino show how you can export data from a Shiny app into Excel:

R is great for report generation. Shiny allows us to easily create web apps that generate a variety of reports with R.

This post details a demo Shiny app that generates an Excel report, a PowerPoint report, and a PDF report:

The full Shiny app source code is available here. Also, we included a more basic Shiny app that generates an Excel report at the end of this post. Follow up posts will include similar simple Shiny apps generating PowerPoint and PDF reports.

Excel is still the most popular business intelligence tool and Excel support tends to be one of the first requests people get with third-party apps, so it’s good to know you can do this in Shiny without too much rigmarole.

Comments closed

When tempdb Spills Attack

Josh Darnell ran into a problem with a SQL Agent job:

One of my colleagues reached out to me recently about a production issue where a SQL Server Agent job had failed with this error message:

Msg 1105, Level 17, State 2, Line 15
Could not allocate space for object ‘dbo.SORT temporary run storage: 140737513062400’ in database ‘tempdb’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

I fully expected this to be a scheduled maintenance task, like index rebuilds or statistics updates. I’ve seen this error before in those contexts (rebuilding large indexes in tempdb, or updating statistics with FULLSCAN).

But watch as SQL Server subverts Josh’s expectations.

Comments closed