K-Means Clustering In R

Raghavan Madabusi provides an example of how k-means clustering can help segment data points in an understandable manner:

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours.

For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

This article won’t really explain k-means clustering in any detail, but it does give you an example to apply the technique using R.

Monitoring Spark And Kafka

Larry Murdock gives some hints on monitoring Kafka topics and their associated Spark jobs:

Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.

Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.

Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.

Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.

Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.

Larry then applies these concepts and gives links to some useful tools.

Jupyter + Pandas On Azure Data Lake Store

Amit Kulkarni demonstrates how to access data in Azure Data Lake Store within a Jupyter notebook:

For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.

On your machine, you will need all of the following installed:

  1. Python 2 or 3 with Pip

  2. Pandas

  3. Jupyter

Amit shows two separate methods for retrieving data, so check it out.

Power BI Free Is The Problem

Matt Allington shares his thoughts on the recent Power BI licensing changes:

I think the existence of the Power BI Free product has been the root of the problem here.  The fact that you could do so much for free (including some sharing) really muddied the waters and has taken the focus away from acknowledging that there needs to be a two tier pricing model for users (free is not a pricing tier). Microsoft is addressing one part of the problem by making it clear that Power BI Free is for personal (non sharing) use. However it has not addressed the second part of the problem being the need for a lower priced offering for users that just consume data in a way I would describe as “low involvement”. Microsoft has taken away the “proxy for a low priced sharing tier” without providing a genuine low priced replacement – this had just made the situation worse, not better and it has upset a lot of people.  Power BI Free has been a great product to “try before you buy” but unfortunately its existence prevented Microsoft from realising it was missing a price tier for 2 years!  Power BI Free for personal use (no sharing) is an incredibly generous offering from Microsoft.  It is a shame that it will need a backlash to fill the real gap – a lower priced tier.

Check out the comments as well.  I think Matt has a good point, and my guess is that the Power BI team will make it easier for small to medium sized businesses to use Power BI, but they first wanted to focus on the problem with big customers.

No More Sharing With Power BI Free

Ginger Grant explains an important ramification of the recent Power BI licensing changes:

Included in the recent list of announcements Microsoft made about Power BI Local and Power BI Premium are a series of changes to the Power BI Free version which will go into effect on June 1. The free edition of Power BI will no longer be able to share reports. Currently free users could create reports and share them with others, which will be discontinued.  Only Power BI Pro Editions will be able to share reports.  Currently Power BI Pro users can create reports which can be shared with Free versions as long as no Pro features are used.  This means that if a Power BI report is set to automatically refresh the data, that report cannot be shared as Free versions do not have the ability to create reports which have data refreshed automatically. If the report was recreated to remove the automatic updates and instead refreshed manually, then the report could be shared with Free versions.  Starting June 1, the sharing feature will be removed. No longer can Power BI Pro users share anything to Power BI Free users.  If you have a Power BI Free account, there is no way to share information in the service. The Power BI Desktop will continue to be free but since you cannot print the content within it and sharing a PBIX file means that you will always be sharing the entire data model, this is of limited value.

Read the whole thing.

UNION ALL Ordering

Paul White shows how UNION ALL concatenation has changed since SQL Server 2008 R2:

The concatenation of two or more data sets is most commonly expressed in T-SQL using the UNION ALL clause. Given that the SQL Server optimizer can often reorder things like joins and aggregates to improve performance, it is quite reasonable to expect that SQL Server would also consider reordering concatenation inputs, where this would provide an advantage. For example, the optimizer could consider the benefits of rewriting A UNION ALL B as B UNION ALL A.

In fact, the SQL Server optimizer does not do this. More precisely, there was some limited support for concatenation input reordering in SQL Server releases up to 2008 R2, but this was removed in SQL Server 2012, and has not resurfaced since.

It’s an interesting article about an edge case.

Azure Data Lake Tools For VS Code

Jenny Jiang announces that Azure Data Lake Tools for Visual Studio Code is now generally available:

ADLA Integration

The ADL Tools for VSCode integrate well with ADLA. Azure Data Lake includes the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. U-SQL on ADLA offers Job as a Service with the Microsoft invented U-SQL language. Customers do not have to manage deployment of clusters, but can simply submit their jobs to ADLA, an analytics platform managed by Microsoft.

Click through for the full announcement.

File Management In Containers

Andrew Pruski hows how to copy files into and out from Docker containers:

Last week I was having an issue with a SQL install within a container and to fix I needed to copy the setup log files out of the container onto the host so that I could review.

But how do you copy files out of a container?

Well, thankfully there’s the docker cp command. A really simple command that let’s you copy whatever files you need out of a running container into a specified directory on the host.

I’ll run through a quick demo but I won’t install SQL, I’ll use an existing SQL image and grab its Summary.txt file.

Read on for the demo.

Cardinality Estimation On Memory-Optimzied Table Variables

Jack Li explains that the cardinality estimator works the same for memory-optimized table variables as it does for regular table variables:

In a previous blog, I talked about memory optimized table consumes memory until end of the batch.   In this blog, I want to make you aware of cardinality estimate of memory optimized table as we have had customers who called in for clarifications.  By default memory optimized table variable behaves the same way as disk based table variable. It will have 1 row as an estimate.   In disk based table variable, you can control estimate by using option (recompile) at statement level (see this blog) or use trace flag 2453.

You can control the same behavior using the two approaches on memory optimized table variable if you use it in an ad hoc query or inside a regular TSQL stored procedure.  The behavior will be the same. This little repro will show the estimate is correct with option (recompile).

Jack also explains how this works for natively compiled stored procedures (spoilers:  it doesn’t), so read the whole thing.


May 2017
« Apr