Press "Enter" to skip to content

Month: March 2018

Exploratory Analysis With Hockey Data In Power BI

Stacia Varga digs into her hockey data set a bit more:

Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are:

  • Counts for a categorical variable
  • Minimum and maximum values in addition to mean and median for a numerical value

To handle my initial analysis of the categorical variables, I can add new measures to the modelto compute the count using a DAX formula like this, since each row in the games table is unique:

Game Count = countrows(games)

It’s interesting seeing Stacia use Power BI for exploratory analysis.  My personal preference would definitely be to dump the data into R, but there’s more than one way to analyze a data set.

Comments closed

Defending Pie Charts

Bobby Johnson makes a valiant effort at defending the indefensible:

In the world of data analysis, there are few things more reviled than the pie chart. Among “serious” data people, it is at best trivial and naive, and at worst downright evil.

I do not agree with this. The pie chart is simple, but that is its beauty. It does exactly one thing and it does it well: it shows you how much different parts contribute to a whole. This isn’t the only question you ever have about your data, but when it’s the question you do have, the pie chart is perfect. That is not evil and it is not naive. It is data visualization doing what it should: taking something large and abstract and saying something simple about it that your brain can easily internalize.

I strongly disagree with arguments in the article, but do respect the attempt.  In each of the cases, at least one of a bar chart, stacked 100% bar chart, or dot plot could give at least the same amount of information with less lower mental overhead.

Comments closed

Cassandra To Kafka Connect

Mike Barlotta shows how to feed data into Kafka from Cassandra via Kafka Connect.  Part one involves basic setup:

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

Part 2 is around tuning the connector:

One of the problems we initially had with the Cassandra Source connector was how much data it tried to process during one polling cycle. In the original versions (0.2.5 and 0.2.6) the connector would retrieve all of the data that was inserted since the last polling cycle. For systems ingesting large amounts of data this can pose a challenge.

Our logs showed that it took 6 hours to retrieve and publish 6.8 million rows of data.

The problem (or one of them) with this slow rate of ingestion was that the table was continuing to have new data inserted into it while the connector was processing the data it had retrieved. With data being added to the table faster than it was being published the connector was getting behind. Worse there was no opportunity for it to ever catch up, until there was a lull in receiving new data.

If you’re using Cassandra, this looks like a rather useful connector.

Comments closed

Use Cases For Apache Kafka

Amy Boyle shows a few scenarios where New Relic uses Apache Kafka:

The Events Pipeline team is responsible for plumbing some of New Relic’s core data streams-specifically, event data. These are fine-grained nuggets of monitoring data that record a single event at a particular moment in time. For example, an event could be an error thrown by an application, a page view on a browser, or an e-commerce shopping cart transaction.

In this post, we show how we built our Kafka pipeline so that it stitches together microservices and serves as a changelog and “durable cache,” all with the idea of processing data streams as smoothly and effectively as possible at our scale. In an upcoming post, we’ll share thoughts on how we manage topic partitions in this pipeline.

If you’re wondering if Kafka might be right for you, check out this post for several scenarios which fit.

Comments closed

Replication In Azure SQL DB Managed Instances

Tara Kizer looks at transactional replication in Azure SQL Database Managed Instances:

I have a love-hate relationship with replication. Mostly hate due to latency and errors, but it does serve its purpose. Before Availability Groups came out, I used Transactional Replication to copy data from the production OLTP database to another server so that we could offload reports.

Will it work with a Managed Instance?

My relationship is closer to hate-hate-love-hate-hate-love-hate.  Tara also provides some good advice about watching replication latency stats.

Comments closed

Protecting Database Assets From Administrators

Louis Davidson walks through which things are granted to administrators of different levels:

Own is a strange term, because really there is just one user that is listed as owner, but there are there are three users who essentially are owner level, super-powered users in a database:

1. A login using server rights, usually via the sysadmin server role (or a server permission to view all data)
2. The user dbo in a database, acquired either as a sysadmin, or as being the user listed as the owner of the database
3. Members of the db_owner database role

Sometimes, in the context of a database, these all start to blur together. But they are definitely all three independent things. Let’s write some code and see the differences, and one of the cases may be surprising to you. To do this, I will just be using the SELECT permission on a single table, but other rights will generally behave similarly. Note that another tool in your toolbox is Row Level Security in SQL Server 2016+. It is different, in that you can include a predicate that excludes the dbo, which you can read more about here in some of my earlier blogs.

Great read.

Comments closed

Highlighting Scatter Charts In Power BI

Jason Thomas has a great post showing how to implement highlighting of scatter/bubble charts in Power BI:

That said, there is one feature from my previous blog that was not implemented in Power BI – highlighting scatter/bubble charts. In Power BI, the scatter charts are not considered as area charts and hence you can only filter them and not highlight. This feature is useful when you have a lot of data points in your scatter chart and you want to see where a particular data point is with respect to the other data points. That said, you can make use of some nifty DAX and replicate the same behavior.

There are several steps to the process so it’s not point-and-click easy, but Jason has a nice walkthrough showing how to set it up.

Comments closed

Reserved Memory Allocation Waits And Trace Flag 834

Joe Obbish has another post looking at sub-optimal columnstore index performance:

It is possible to see a scalability bottleneck in the form of high average wait time for the RESERVED_MEMORY_ALLOCATION_EXT wait if a highly concurrent workload is run on a server that consumes memory with batch mode operators. I believe that the severity of the bottleneck depends on unknown factors in the server’s initial memory state and the rate of memory actually used by queries to run batch mode operations. This blog post shares a reproduction of the issue along with a call to action.

If you use clustered columnstore indexes, check out Joe’s User Voice entry.

Comments closed

vtreat

John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Comments closed