Yarn Service Framework Coming

Jian He, et al, announce the Yarn Service Framework:

Apache Hadoop YARN is well known as the general resource-management platform for big-data applications such as MapReduce, Hive / Tez and Spark. It abstracts the complicated cluster resource management and scheduling from higher level applications and enables them to focus solely on their own application specific logic.

In addition to big-data apps, another broad spectrum of workloads we see today are long running services such as HBase, Hive/LLAP and container (e.g. Docker) based services. In the past year, the YARN community has been working hard to build first-class support for long running services on YARN.

This is going to ship with Hadoop 3.1.

PySpark DataFrame Transformations

Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:

We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work in Machine learning in data cleansing.  Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data:  data wrangling.

Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.

It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.

The notebook used for this article is persisted on GitHub.

Read on for explanation, or check out the notebook to work on it at your own pace.

ggplot2 Mappings And Geoms

I continue my ggplot2 series:

We have used a new geom here, geom_smooth.  The geom_smooth function creates a smoothed conditional mean.  Basically, we’re drawing some line as a result of a function based on this input data.  Notice that there are two parameters that I set:  method and se.  The method parameter tells the function which method to use.  There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm).  The se parameter controls whether we want to see the standard error bar.

I don’t cover all of the mapping options and all of the geoms, but I think it’s enough to get a grip on the concept.

Using FreeTDS To Connect To SQL Server

Steph Locke embraces the pain of FreeTDS:

If you use SQL Server (or Azure SQL DB) as your data store and you need to connect to the databasse from shinyapps.io, you’re presently stuck with FreeTDS. If you have any control over infrastructure I cannot recommend highly enough the actual ODBC Driver on Linux for ease. Alas, shinyapps.io does not let you control the infrastructure. We have to make do with with FreeTDS and it can be pretty painful to get right.

Due to how obtuse the error messages you end up getting back from FreeTDS in your shiny app and the time to deploy an app, you might just want to cry a little. I know I did. Determined to succeed, here is my solution to getting a working database connection that you can also use to test you’re doing it right. If you’re on a particularly old version of SQL Server though, I can’t guarantee this will work for you.

Read on for more.  I also have an older post on working with FreeTDS, though I ended up using TDS_Version = 8.0 instead of 7.4.

Power BI Cumulative Totaling

Martin Schoombee runs into an interesting issue with cumulative totals in Power BI:

A common practice in the data warehousing world is to use a Date Key as unique identifier in a date dimension. This attribute is usually a number in the format yyyymmdd. I’m not going to dive into all the reasons why it is used in data warehouse environments here, but (for fun) let’s change our data model to use the Date Keyattribute in the relationship between the two tables.

If we look at our visualizations again, we see a very different picture. Sales by date still looks the same, but the sales by month seems a little out of whack (image below). If you had cumulative sales at any other aggregated level (quarter, year, etc.) it would also have been incorrect.

The answer is not immediately intuitive, so it’s good to know this ahead of time rather than have to struggle with it later.

Processing Tabular Models With Helpful Information

Ust Oldfeld has a stored procedure which runs a SQL Agent job and provides notice when processing completes:

Recently, at a client, I was challenged to create a stored procedure that would process a tabular model. This stored procedure would then be executed from a web application. The process behind it being: a user enters data into a web application, which gets written to a database. That data then needs to be immediately surfaced up into reports, with additional calculations and measures along the way. Therefore the tabular model, which does all the additional calculation and measures, needs to be processed by a user from the web application.

Click through for the script.

Ungrouped Results In Query Store

Erin Stellato explains why you might see two rows for the same query plan in Query Store’s run-time stats:

You can see that there are two rows for the same query_id and plan_id, but the count_executions is different, as are the avg_duration and avg_logical_io_reads values .  The data is not truly a duplicate.  This behavior occurs because the Query Store data is stored in memory before it is flushed to disk, and when you query the data SQL Server is pulling it from both locations (and doing a UNION ALL) and displaying it in the output.  If I waited a bit and ran the query again, the two rows for that interval would probably disappear – most likely because the in memory data had been flushed to disk.

Read the whole thing.

Wrapping Up SSMS Tricks

Wayne Sheffield wraps up his 31 SSMS tricks in 31 days series.  First, he shows off the presentation settings in SSMS.  Presenter mode isn’t nearly as full-featured as I’d like to see it be, but it’s an improvement over the old version, at least.

He then shows standard reports built into SSMS.

The last post in the series is all about the Activity Monitor:

The Activity Monitor can be useful for seeing a mile-high view of a SQL Server instance. However, leaving it running can be as big a drag on the instance as is the use of SQL Profiler. The Resource Waits section of Activity Monitor, which would have been one of the strongest features, has been dumbed down by the filtering of wait types, and that many others are grouped together into categories. Sure, using the Activity Monitor convenient – but spend the time to develop your own scripts or XE sessions to get this information in a more efficient, with less impact. Overall, refrain from using it… and especially don’t set up SSMS to open it automatically on startup.

Agreed.  For all of Wayne’s tips and tricks, check out his index page.


February 2018
« Jan