Simplify Visuals: No Unnecessary Lines

Stephanie Evergreen shows how you can improve your visuals by removing most of the lines:

The Lines section of the Data Visualization Checklist helps us enhance reader interpretability by handling a lot of the junk, or what Edward Tufte called the “noise” in the graph. I’m referring to all of the parts of the graph that don’t actually display data or assist reader cognition. Create more readability by deleting unnecessary lines. 

The default chart, on the left, has black gridlines. These stand out quite a bit because of how well black contrasts against the white chart background. But the gridlines shouldn’t be standing out so much because they are not the most important part of the graph 

I like that Stephanie keeps the gridlines. I’ve seen Tufte advocate removing them altogether but there’s a lot of value in keeping them in; just don’t make them the sharpest focus color.

Parameters in Rmarkdown Files

Neil Saunders shows how you can parameterize Rmarkdown files, consequently making changes easy later:

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

The example is small but important.

Selecting a List of Columns in Spark

Unmesha SreeVeni shares how to get a specific set of columns by name from a DataFrame in Spark:

Create a List[Column] with column names.

scala> var selectExpr : List[Column] = List("Type","Item","Price")

<console>:25: error: not found: type Column var selectExpr : List[Column] = List("Type","Item","Price") ^

If you are getting the same error Please take a look into this page .

Read on for the demo.

Monitoring Kafka Streams with JMX Metrics

Rishi Khandelwal provides a reference architecture for monitoring a Kafka Streams application using JMX Metrics and pushing the results into Graphite:

Service (application) exposes the JMX metrics at some port which will be captured by Jolokia java agent. Then Jolokia exposes those metrics at some port which is easily accessible through a rest endpoint (we call it Jolokia URL). Then we have JMX2Graphte which polls the metrics from Jolokia URL and push it to Graphite. Then Grafana reads the Graphite metrics and creates a beautiful dashboard for us along with the alerts.

So this is the working of the proposed monitoring solution. Now let’s discuss the components of the monitoring solution.

There’s a bit of code/configuration in here as well, so check it out.

Deleting in Azure Data Factory

Kevin Feasel

2019-03-07

Cloud, ETL

Meagan Longoria is happy that Azure Data Factory v2 now has a Delete activity:

It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.

But now Data Factory V2 has a Delete activity.

Meagan shows how it works, what kinds of parameters you can set, and a couple of gotchas, so check it out.

PolyBase and Pushdown Limitations

I have a post covering something I learned about predicate pushdown against Hadoop in PolyBase:

Before I start, let’s talk about predicate pushdown for a moment. The gist of it is that when you have data in two sources, you have two options for combining the data:

1. Bring the data in its entirety from your remote source to your local target and work as though everything were in the local target to begin with. I’ll call this the streaming approach.

2. Send as much of your query’s filters, projections, and pre-conditions to the remote source, have the remote source perform some of the work, and then have the remote source send its post-operative data to the local target. Then, the local target once more treats this as though it were simply local data. This is the pushdown approach because you push down those predicates (that is, filters, projections, and pre-conditions).

Click through for the unfortunate finding and also vote up my UserVoice feature request if you want to see string columns as filters.

Running Totals in DAX

Kevin Feasel

2019-03-07

DAX

Alberto Ferrari shows how you can calculate running totals in DAX:

A very common calculation in DAX is the year-to-date calculation (YTD), which aggregates values from the beginning of the year all the way to a certain date. A simple implementation uses the predefined DATESYTD function:

Sales YTD :=
CALCULATE (
[Sales Amount],
DATESYTD( 'Date'[Date] )
)

But click through to see when this function stops being useful and what you should replace it with when it does.

CTP 2.3 and Multi-Scoped Temp Tables

Joe Sack announces a performance optimization when you generate temp tables and then work with them in a sub-procedure:

SQL Server 2019 introduces several performance optimizations which will improve performance with minimal changes required to your application code.   In this blog post we’ll discuss one such improvement available in CTP 2.3: reduced recompilations for workloads using temporary tables in multiple scopes.

In order to understand this improvement, we’ll first go over the current behavior in SQL Server 2017 and prior. When referencing a temporary table with a DML statement (SELECT, INSERT, UPDATE, DELETE), if the temporary table was created by an outer scope batch, we will recompile the DML statement each time it is executed.

There’s a pretty big performance improvement here, but architecturally, I really don’t like it. This makes the inner procedure unrunnable unless you know that there should be some temp table(s) and what should be in those temp table(s).

Performance Tuning DAX

Matthew Mowrey has some tips for tuning DAX code:

Optimization has become one of my favorite topics. OK, I’ll admit it: it might be an obsession and a source of joy. My inability to optimize (queries, data models, and DAX) used to be a source of headaches and the occasional heart attack depending on what I was trying to accomplish. I make no claim that I’m an optimization expert, but I’ve spent hours wondering and researching why a data model’s calculation time could go from a couple of seconds to many minutes. This is a rare occurrence, but when it has happened, it’s happened when I’ve been on the verge of something great. Of course, if you’re taking business intelligence to its edge, you’re not adding 2+2—you’re trying to give someone an answer they (and you) thought was impossible. A recalculation that takes many minutes (especially if you’re not the end-user) is unacceptable and may cause your data model to crash—fatal exceptions anyone?

This article focuses on an optimization technique that I couldn’t find anywhere. Before I get into that technique, I’d like to touch on what I consider to be the basic tenets of optimization when working with data modeling and DAX.

Read on for the technique.

When Deletes Increase Data Size

Brent Ozar shows a case where indexes can grow in size as you delete data:

Here’s one way that deletes can cause a table to grow:
– The rows were originally written when the database didn’t have Read Committed Snapshot (RCSI), Snapshot Isolation (SI), or Availability Groups (AGs) enabled
– RCSI or SI was enabled, or the database was added into an Availability Group
– During the deletions, a 14-byte timestamp was added to each deleted row to support RCSI/SI/AG reads

Click through for a demo and takeaways.

Categories

March 2019
MTWTFSS
« Feb Apr »
 123
45678910
11121314151617
18192021222324
25262728293031