Press "Enter" to skip to content

Day: December 11, 2018

Working With Images In Spark 2.4

Tomas Nykodym and Weichen Xu give us an update on working with images in the most recent version of Apache Spark:

An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.
Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post Image Data Support in Apache Spark), which was originally developed in the MMLSpark library. In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column.
This blog post describes what an image data source is and demonstrates its use in Deep Learning Pipelines on the Databricks Unified Analytics Platform.

If you’re interested in working with convolutional neural networks or otherwise need to analyze image data, check it out.

Comments closed

Comparing Streaming Engines

George Vetticaden compares Spark Streaming, Storm, and Kafka Streams:

Before the addition of Kafka Streams support, HDP and HDF supported two stream processing engines:  Spark Structured Streaming and Streaming Analytics Manager (SAM) with Storm. So naturally, this begets the following question:
Why add a third stream processing engine to the platform?
With the choice of using Spark structured streaming or SAM with Storm support, customers had the choice to pick the right stream processing engine based on their non- functional requirements and use cases. However, neither of these engines addressed the following types of requirements that we saw from our customers:

And this doesn’t even include Samza or Flink, two other popular streaming engines.

My biased answer is, forget Storm.  If you have a legacy implementation of it, that’s fine, but I wouldn’t recommend new streaming implementations based off of it.  After that, you can compare the two competitors (as well as Samza and Flink) to see which fits your environment better.  I don’t think either of these has many scenarios where you completely regret going with, say, Kafka Streams instead of Spark Streaming.  Each has its advantages, but they’re not so radically different.

Comments closed

Reporting On Unit Tests In R With covrpage

Maelle Salmon recaps Locke Data’s involvement with the covrpage package:

To read more about getting started with covrpage in your own package in a few lines of code only, we recommend checking out the “get started” vignette. It explains more how to setup the Travis deploy, mentions which functions power the covrpage report, and gives more motivation for using covrpage.
And to learn how the information provided by covrpage should be read, read the “How to read the covrpage report” vignette.

Check it out.

Comments closed

Tee-Object In Powershell

Adam Bertram has a paean in honor of Tee-Object:

What does Tee-Object  do anyway? Tee-Object basically kills two birds with one stone. This cmdlet redirects output from a PowerShell command and either saves it to a file or to a variable while also returning the output to the pipeline. It allows a scripter to save the output and send the output across the pipeline all in one shot.
Let’s say a scripter wants to send some text to a file and then perform kind of task on that same text afterward. Perhaps we’re pulling a list of Windows services from a computer and saving that output to a text file to create a snapshot of the before state before we attempt to start the services. This task can be done in two separate steps or with a single one with Tee-Object .

I’ve used it several times in Powershell as well as the tee command in Linux.  It’s great when you need to do several things with the same data but don’t want to break out of your pipeline.

Comments closed

A Graph Database Of US Capitals

James Livingston has a graph database to share:

While there’s countless relational databases out there for practice, there’s not much in the way of graph databases. It is my intent to share my graph databases with the world in hopes that it removes the friction associated with your learning.
US Capitals is a popular data set for working with graphs. Nodes identify a state capital. An edge connects a capital in one state with the capital of a neighboring state. Only the lower 48 states are present. While the data is readily available, I was unable to find TSQL scripts to create the graph using SQL Server 2017 graph database. I created those scripts and have made them readily available on GitHub.

I’m interested in the forthcoming post on Dijkstra’s algorithm; I think the last time I saw that was my undergrad days.

Comments closed

Tips For Migrating SSISDB

Kenneth Fisher shares some thoughts on SSISDB:

We’ve been doing a lot of upgrading recently and at one point had to move an instance from one 2016 server to another. In the process, we found out (the hard way) that it’s not that easy to move SSISDB (the SSIS Catalog that may or may not be named SSISDB). I mean it’s not hard, but it’s definitely not a basic backup/restore. The full BOL instructions on how to do this are here. That said, here are the elements that are involved.

Read on for the list as well as an order of operations.

Comments closed

Experimenting With Power BI Data Privacy Settings

Chris Webb takes us through some of the intricacies of Power BI data privacy settings and what it means when data sets are Private:

Not only does it only show the Package Search endpoint, there is a warning that says:
“Some data sources may not be listed because of hand-authored queries”
This refers to the output step in the query that calls the Package Show endpoint with the dynamically-generated url.
Closing this dialog and going back to the Query Editor, if you click the Edit Credentials button, you can set credentials for the data source (anonymous access is fine in this case). These credentials can be set at all levels in the path down to https://data.gov.uk/api/3/action/package_search.

Read the whole thing.

Comments closed

Using Slicers In Power BI To Filter Chart Categories

Prathy Kamasani shows how you can use slicers in Power BI to filter out specific categories in a line chart:

The logic is to create a table with the DAX function UNION. Each Table expression in UNION function represents a value of slicer. Apart from that slicer related value, all the rest of the values are blanks.  It is key to have them as blanks than zero’s, we don’t see any data.

In other words, pivoting the table to turn one measure with several different category values into one measure per category.  If you know the number of categories (4 in this case), this solution can work well for you.

Comments closed

Making dbatools Load Faster

Chrissy LeMaire has found a technique to reduce the load time for dbatools:

I don’t know the in-depth internals of Import-Module, but I know that importing a DLL filled with C# cmdlets is extremely fast. For instance, Microsoft’s SqlServer module imports 100 commands in less than a second. Sometimes it’s closer to half a second!
I wondered if we could somehow compile our commands into a C# binary but that seemed far-fetched. One thing we could do, though, is use compression! It works for SQL Server backups, could it work for PowerShell?

Spoilers:  the answer is “yes.”  Read on to see how much faster the initial import gets.

Comments closed