ML Algorithm Cheat Sheet

Hui Li has a quick cheat sheet on which algorithms might be useful in a particular situation:

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

  • The size, quality, and nature of data.
  • The available computational time.
  • The urgency of the task.
  • What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

Hui then goes into detail on each. h/t Vincent Granville

Data Science Resources

Steph Locke has some resources if you are interested in getting started with data science:

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.

If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.

Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.

This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.

It’s a good starting set of links.

Kafka + Spark Streaming

Kunal Khamar, et al, show how to integrate Apache Kafka with Spark’s structured streaming:

Kafka is a distributed pub-sub messaging system that is popular for ingesting real-time data streams and making them available to downstream consumers in a parallel and fault-tolerant manner. This renders Kafka suitable for building real-time streaming data pipelines that reliably move data between heterogeneous processing systems. Before we dive into the details of Structured Streaming’s Kafka support, let’s recap some basic concepts and terms.

Data in Kafka is organized into topics that are split into partitions for parallelism. Each partition is an ordered, immutable sequence of records, and can be thought of as a structured commit log. Producers append records to the tail of these logs and consumers read the logs at their own pace. Multiple consumers can subscribe to a topic and receive incoming records as they arrive. As new records arrive to a partition in a Kafka topic, they are assigned a sequential id number called the offset. A Kafka cluster retains all published records—whether or not they have been consumed—for a configurable retention period, after which they are marked for deletion.

Read the whole thing.

Twitter Campaign/Brand Management In Power BI

Mindy Curnutt looks at a Power BI solution template for managing Twitter campaigns:

Now you can start poking around and seeing what’s in the Dashboard. Since I opted to not put any handles in for analysis of FROM and TO, the first two tabs in the workbook (Outbound Tweets and Inbound Tweets) will not have any information, this is normal.

But then we get to tab #3 – Author Hashtag Graph.  The gray dots are hashtags and the green dots are accounts that have tweeted. You can see that I made a tweet that had 2 hashtags – #osmf2017 and #mvpbuzz. And boy was @TexasMusicDude busy tweeting up a storm – and using lots of other hashtags in conjunction with his tweets. Other hashtags that were popular appear to be #CampGround, #ShinyRibs, #TexasMusic, #DreamFolk and #Strings. Along the bottom you can see the day/timeline and the quantity of tweets at what time of day. If you click on any of the nodes, the information about what time the tweet(s) took place is highlighted in the timeline. It’s very interactive.

It does require an Azure subscription, but it looks very useful as a model for an advanced set of dashboards as well as a campaign management tool.

FlowFile Continuation In NiFi

Kevin Feasel


ETL, Hadoop

Tim Spann describes one of the more powerful features of Apache NiFi:

Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.

This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).

Read on for an explanation of how FlowFile streams can do this.

SandDance Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the SandDance Power BI Custom Visual.  The SandDance visual is an incredibly interactive visual that allows you to create insights to view your data in multiple ways and with animations.

This is a long video for a complicated visual.

Dynamic Data Masking

Andrea Allred has been checking out Dynamic Data Masking in SQL Server 2016:

This is a great time to talk about the different masking functions and what they do.  The four types in 2016 are Default, Email, Random and Custom String.

Default – For numeric and binary it will show a “0” For a date it will show 01/01/1900 and for strings it will show xxxx’s (more or less depending on the size of the field).

Email – It will expose the first letter of the email address and the suffix at the end of the email (.com, .net, .edu etc.) For example  would now be

Random – Number randomly generated between a set range. Kind of like the game, “Pick a number between 1 and 10” but for SQL.

Custom String – Lets you get creative with how much you show or cover and what you use to cover (not stuck with just xxxx’s).

It’s not really a security feature, but it could be useful for protecting sensitive data from snoopers glancing over the shoulder.

Missing JRE, Or Maybe C++

Meagan Longoria went through a frustrating scenario:

On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same.  Then we set up a separate server for the production DMG.

Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.

Weird solution, but I’m going to guess that it makes perfect sense if you are able to look at the code.

Availability Group Tips

Derik Hammer has some tips to help you learn about Availability Groups:

3. Use MultiSubnetFailover=true

The Availability Group Listener is technically an optional component of an Availability Group. However, in my opinion it is necessary. By default, your listener will register all IP addresses as DNS A records and it will have multiple IP addresses when your cluster crosses subnets, most commonly when you have disaster recovery between data centers. Using the MultiSubnetFailover=true parameter in your client connection strings will attempt to connect to all IP addresses and completes the connection on the first thread to succeed. The listener ensures that only one IP address is online at a time, therefore you always connect to correct node.

This feature effectively bypasses the limitations of your DNS cache. Traditionally, you would cache the IP address for a DNS record. When you needed the client to connect to a different IP address using the same virtual network name, you would have to wait for the time to live setting to expire. This would delay your recovery time. With the MultiSubnetFailover setting, you can still cache your IP addresses but without the delay that they could induce.

There’s some good reading here.


May 2017
« Apr