Press "Enter" to skip to content

Author: Kevin Feasel

Building A SQL Server 2014 Container

Andrew Pruski shows how to create a Docker container with SQL Server 2014 SP2 installed:

I’m running all of this on my Windows 10 machine but there are a few things you’ll need before we get started: –

Pre-requisites

  • The microsoft/windowsservercore image downloaded from the Docker Hub

  • Windows Server 2016 installation media extracted toC:\Docker\Builds\Windows

  • SQL Server 2014 SP2 Developer Edition installation media extracted to C:\Docker\Builds\SQLServer2014

Or if you prefer, you can just pull his image, but where’s the fun in that?

Comments closed

DatasauRus Lives

Steph Locke shows how to create a package in R:

Then we need to add github repository to our project. I use the git command line for this:

git remote add origin git@github.com:stephlocke/datasauRus.git
git push --set-upstream origin master

With just these things, I have a package that contains the unit test framework, documentation stubs, continuous integration and test coverage, and source control.

That is all you need to do to get things going!

This is great timing for me, as I’m starting to look at packaging internal code.  Also, it’s great timing because it includes dinosaurs.

Comments closed

ML Algorithm Cheat Sheet

Hui Li has a quick cheat sheet on which algorithms might be useful in a particular situation:

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

  • The size, quality, and nature of data.
  • The available computational time.
  • The urgency of the task.
  • What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

Hui then goes into detail on each. h/t Vincent Granville

Comments closed

Data Science Resources

Steph Locke has some resources if you are interested in getting started with data science:

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.

If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.

Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.

This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.

It’s a good starting set of links.

Comments closed

Kafka + Spark Streaming

Kunal Khamar, et al, show how to integrate Apache Kafka with Spark’s structured streaming:

Kafka is a distributed pub-sub messaging system that is popular for ingesting real-time data streams and making them available to downstream consumers in a parallel and fault-tolerant manner. This renders Kafka suitable for building real-time streaming data pipelines that reliably move data between heterogeneous processing systems. Before we dive into the details of Structured Streaming’s Kafka support, let’s recap some basic concepts and terms.

Data in Kafka is organized into topics that are split into partitions for parallelism. Each partition is an ordered, immutable sequence of records, and can be thought of as a structured commit log. Producers append records to the tail of these logs and consumers read the logs at their own pace. Multiple consumers can subscribe to a topic and receive incoming records as they arrive. As new records arrive to a partition in a Kafka topic, they are assigned a sequential id number called the offset. A Kafka cluster retains all published records—whether or not they have been consumed—for a configurable retention period, after which they are marked for deletion.

Read the whole thing.

Comments closed

Twitter Campaign/Brand Management In Power BI

Mindy Curnutt looks at a Power BI solution template for managing Twitter campaigns:

Now you can start poking around and seeing what’s in the Dashboard. Since I opted to not put any handles in for analysis of FROM and TO, the first two tabs in the workbook (Outbound Tweets and Inbound Tweets) will not have any information, this is normal.

But then we get to tab #3 – Author Hashtag Graph.  The gray dots are hashtags and the green dots are accounts that have tweeted. You can see that I made a tweet that had 2 hashtags – #osmf2017 and #mvpbuzz. And boy was @TexasMusicDude busy tweeting up a storm – and using lots of other hashtags in conjunction with his tweets. Other hashtags that were popular appear to be #CampGround, #ShinyRibs, #TexasMusic, #DreamFolk and #Strings. Along the bottom you can see the day/timeline and the quantity of tweets at what time of day. If you click on any of the nodes, the information about what time the tweet(s) took place is highlighted in the timeline. It’s very interactive.

It does require an Azure subscription, but it looks very useful as a model for an advanced set of dashboards as well as a campaign management tool.

Comments closed

FlowFile Continuation In NiFi

Tim Spann describes one of the more powerful features of Apache NiFi:

Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.

This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).

Read on for an explanation of how FlowFile streams can do this.

Comments closed

Dynamic Data Masking

Andrea Allred has been checking out Dynamic Data Masking in SQL Server 2016:

This is a great time to talk about the different masking functions and what they do.  The four types in 2016 are Default, Email, Random and Custom String.

Default – For numeric and binary it will show a “0” For a date it will show 01/01/1900 and for strings it will show xxxx’s (more or less depending on the size of the field).

Email – It will expose the first letter of the email address and the suffix at the end of the email (.com, .net, .edu etc.) For example Batgirl@DC.com  would now be bxxx@xxxx.com.

Random – Number randomly generated between a set range. Kind of like the game, “Pick a number between 1 and 10” but for SQL.

Custom String – Lets you get creative with how much you show or cover and what you use to cover (not stuck with just xxxx’s).

It’s not really a security feature, but it could be useful for protecting sensitive data from snoopers glancing over the shoulder.

Comments closed