Press "Enter" to skip to content

Day: July 24, 2019

MLflow 1.1 Released

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.

Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Monitoring Backpressure in Apache Flink

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:

– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.

– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

Apache Phoenix on Cloudera

Krishna Maheshwari announces that Cloudera will officially support Apache Phoenix on its CDH and its upcoming Cloudera Data Platform:

Cloudera’s CDH releases have included Apache HBase which provides a resilient, NoSQL DBMS for customers operational applications that want to leverage the power of big-data.  These applications have grown into mission important and mission critical applications that drive top-line revenue and bottom-line profitability.  These applications include customer facing applications, ecommerce platforms, risk & fraud detection used behind the scenes at banks or serving AI/ML models for applications and enabling further reinforcement training of the same based on actual outcomes.

However, for many customers, HBase has been too daunting a journey 

Phoenix is one of my favorite examples of Feasel’s Law in action.

Comments closed

Median Calculation with T-SQL

Nisarg Upadhyay shows three ways to calculate the median in T-SQL:

To calculate the median of any dataset, we first need to arrange all values from the dataset in a specific order. After arranging the data, we must determine the middle value of the specified dataset. If the dataset contains an odd number of values, than the middle value of the entire dataset will be considered as a median. Following is the example:

Median (M) = value of ((X + 1)/2) th item. (x is the number of values in the dataset)

Honestly, CLR’s probably the best approach here if you want a fast calculation for a reasonably large number of rows. Using ML Services and R/Python is another alternative, though the launchpad spinup time will probably make it slower than CLR.

Comments closed

Power Bi Dataflows and the Right Tool for the Job

Matthew Roche answers a reader question and waxes philosophical at the same time:

– Power BI dataflows and CDM folders provide capabilities for bridging the low-code/no-code world of self-service BI with managed central corporate BI in Azure.
– Power BI dataflows enable Excel-like composition of ETL processes with linked and computed entities.
– Power BI dataflows can scale beyond the desktop and leverage the power of the cloud to become part of an end-to-end BI application.

But… This is just a list of features.

Read the whole thing.

Comments closed

Using Graph + Spatial to Find Closest Points

Hasan Savran shows how you can combine graph tables with spatial data types in SQL Server to find the nearest thing—in this case, a distribution center:

Today, I want to show you how Graph Processing Tables can make your data models flexible and smart. Let’s say we work in a e-commerce company, we have many users and products just like Amazon. We also have many warehouses, same product might be located in multiple warehouses. Whenever we want to ship a product, we want to pick the closest warehouse to buyer. In this way, we should be able save good amount of money for shipping and products will arrive to our customers locations faster.

Click through for the demo.

Comments closed

Top Products Per Group in Power BI

Marco Russo shows how you can display just the top few products in each grouping using Power BI:

A common approach to this scenario is to create a complex measure that hides the result if the element should not be displayed. In other words, it computes the ranking of the product and it blanks out the result if the ranking is larger than three. Though easy to implement, this approach requires modifying all the measures that should be displayed in a visual and negatively impacts performance. A better solution is to create a specific measure to use in the visual-level filter: that measure is executed only once for each product to define the set of visible items, without requiring any change to the other measures used in the visualization. We will see different options to solve the scenario.

Marco shows off a few techniques to get this done.

Comments closed

PolyBase in SQL Server 2019

Ben Weissman takes us through SQL Server 2019’s PolyBase enhancements:

Isn’t that the same thing, as a linked server?
At first sight, it sure looks like it. But there are a couple of differences. Linked Servers are instance scoped, whereas PolyBase is database scoped, which also means that PolyBase will automatically work across availability groups. Linked Servers use OLEDB providers, while PolyBase uses ODBC. There are a couple more, like the fact that PolyBase doesn’t support integrated security, but the most significant difference from a performance perspective is PolyBase’s capability to scale out – Linked Servers are single-threaded.

Read the whole thing. Ben asks and answers the question of whether PolyBase replaces ETL. You’ll want to read his answer. My answer (and I won’t tell you how close it is to his because I want you to read his article) is that PolyBase will only replace a fraction of total ETL and will act as an ETL process in a larger percentage of cases. I can see a pattern where you virtualize the data as external tables and then connect them together locally to insert into local facts and dimensions, for example. But there are too many things you can do with other ETL platforms which make me say this will never be a full replacement.

Comments closed