Press "Enter" to skip to content

Month: August 2018

Scaling Kafka With Kafka-Kit

Jamie Alquiza announces Kafka-Kit:

Kafka-Kit is a collection of tools that handle partition to broker mappings, failed broker replacements, storage based partition rebalancing, and replication auto-throttling. The two primary tools are topicmappr and autothrottle.

These tools cover two categories of our Kafka operations: data placement and replication auto-throttling.

It looks like an interesting project, and is available on GitHub.

Comments closed

Getting Started With Azure Databricks

David Peter Hansen has a quick walkthrough of Azure Databricks:

RUN MACHINE LEARNING JOBS ON A SINGLE NODE

A Databricks cluster has one driver node and one or more worker nodes. The Databricks runtime includes common used Python libraries, such as scikit-learn. However, they do not distribute their algorithms.

Running a ML job only on the driver might not be what we are looking for. It is not distributed and we could as well run it on our computer or in a Data Science Virtual Machine. However, some machine learning tasks can still take advantage of distributed computation and it a good way to take an existing single-node workflow and transition it to a distributed workflow.

This great example notebooks that uses scikit-learn shows how this is done.

Read the whole thing.

Comments closed

Considerations When Using Sort By Column In DAX

Marco Russo shows us some things to keep in mind when using Sort By Column in DAX:

A query in MDX automatically inherits the correct column sort order from the data model; the result of an MDX query is always sorted according to the Sort By Column setting. However, DAX does not have any implicit sort order for the columns other than the natural sort order of the underlying data type. For this reason, a DAX query must always specify the sort order in an ORDER BY condition – similarly to a query in SQL. Because DAX requires for a column used in ORDER BY to be part of the query result, a Power BI visual that sorts a column always generates a query that includes at least two columns: the column requested in the report and the underlying column used in the Sort By Column setting. In other words, a Power BI visual showing data by Month must generate a query that contains both Month Name and Month Number, whereas MDX only requires the Month Name column.

There’s also an interesting example where Power BI behaves differently from Excel.

Comments closed

Tips For Troubleshooting Code Problems

Bert Wagner shares some techniques he uses to troubleshoot code:

1. Rubber Duck Debugging

The first thing I usually do when I hit a wall like this is talk myself through the problem again.

This technique usually works well for me and is equivalent to those times when you ask  someone for help but realize the solution while explaining the problem to them.

To save yourself embarrassment (and to let your coworkers keep working uninterrupted), people often substitute an inanimate object, like a rubber duck, instead of a coworker to try and work out the problem on their own.

Alas, in this case explaining the problem to myself didn’t help, so I moved on to the next technique.

This one works more often than you might expect, and is a big part of the value behind pair programming.

Comments closed

Visualizing Deadlocks In SQL Sentry & Plan Explorer

Aaron Bertrand shows off new functionality in SQL Sentry and SentryOne Plan Explorer around deadlock visualization:

There’s a lot going on there, but much of it is noise. There is a whole bunch of contention on the table SqlPerf.Session — session 342 is trying to perform an update, but it is stuck waiting on shared locks taken by two services. Now, let’s check the Optimize Layout box above, and look at the circular graph again. Simplified, right?

This checkbox is easily the most powerful option to discard noise and help you focus on the crux of the deadlock issue. In the original graph, you can see that many of the elements presented are simply innocent bystanders — waiters that are captured as part of the deadlock activity, but in no way contributing to it. We can detect this in a lot of cases and so, when you check the box, we hide them from view, allowing you to focus much more directly on the key players involved in the deadlock. There is no question that eliminating the noise can really speed up troubleshooting; with those extra nodes removed, I can clearly see that I have some kind of order-of-operations issue on the SqlPerf.Session table, between the transfer service and the processor service.

Very cool.

Comments closed

Aggregations In Power BI

Teo Lachev takes us on a tour of aggregates in Power BI:

During the “Building a data model to support 1 trillion rows of data and more with Microsoft Power BI Premium” presentation at the Business Applications Summit, Microsoft discussed the technical details of how the forthcoming “Aggregations” feature can help you implement fast summarized queries on top of huge datasets. Following incremental refresh and composite models, aggregations are the next “pro” feature that debuts in Power BI and it aims to make it a more attractive option for deploying organizational semantic models. In this blog, I summarize my initial observations of this feature which should be available for preview in the September release of Power BI.

Aggregations are not a new concept to BI practitioners tackling large datasets. Ask a DBA what’s the next course of action after all tricks are exhausted to speed up massive queries and his answer would be summarized tables. That’s what aggregations are: predefined summaries of data, aimed to speed queries at the expense of more storage. BI pros would recall that Analysis Services Multidimensional (MD) has supported aggregations for a long time. Once you define an aggregation, MD maintains it automatically. When you process the partition, MD rebuilds the partition aggregations. An MD aggregation is tied to the source partition and it summarizes all measures in the partition. You might also recall that designing proper aggregations in MD isn’t easy and that the MD intra-dependencies could cause some grief, such as processing a dimension could invalidate the aggregations in the related partitions, requiring you to reprocess their indexes to restore aggregations. On the other hand, as it stands today, Analysis Services Tabular (Azure AS and SSAS Tabular) doesn’t support aggregations. Power BI takes the middle road. Like MD, Power BI would search for suitable aggregations to answer summarized queries, but it requires more work on your part to set them up.

Aggregates in Power BI aren’t as simple as they were in Analysis Services Multidimensional, but they do exist, and hopefully the Power BI team will improve upon them in future versions.

Comments closed

Query Lables In Azure SQL Data Warehouse

Arun Sirpal demonstrates how to use query labels in Azure SQL Data Warehouse:

Using a query label in Azure SQL DW (Data Warehouse) can be a really handy technique to track queries via DMVs. You might want to do this to see what problematic queries are doing under the covers.

Let’s check out an example. First I will show you how things would look without using a query label. I connect to SQL DW and issue the following basic example query.

It’s an interesting approach and solves a problem I saw in Polybase around figuring out which session details were yours after the fact.

Comments closed

Principal Component Analysis With Faces

Mic at The Beginner Programmer shows us how to creepy PCA diagrams with human faces:

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

  1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

  2. Calculate the covariance matrix of the data.

  3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

  4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

  5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

  6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Comments closed

Sorting With data.table Versus dplyr

John Mount shows us that data.table is way faster for sorting than dplyr‘s arrange function:

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

John has also provided us with a markdown document for comparison.

Comments closed

Matrices In R

Dave Mason continues his perusal of R data types, this time looking at the matrix:

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it’s a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):

> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
>  new_matrix <- rbind(row1, row2)
> new_matrix
     [,1] [,2] [,3] [,4]
row1 "1"  "2"  "3"  "4" 
row2 "a"  "b"  "c"  "d"

Matrices drive a large number of statistical techniques, though I tend to deal with them less directly than I would have imagined.

Comments closed