Press "Enter" to skip to content

Author: Kevin Feasel

Using Talend To Build Shape Files

Paul Hernandez has a demo where he uses Talend’s product to convert latitude and longitude pairs to a shape file:

Input data

Customers coordinates: a flat file containing x,y coordinates for every customer.

Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source

Goal

The goal was to “look-up” the coordinates in the shape file in order to get the municipality code GKZ which in german stand for “Gemeindekennzahl”.

Check out the demo.

Comments closed

One Surefire Guarantee For Slow Performance

Patrick Keisler troubleshoots an issue where the buffer pool gets flushed each morning:

What is the McShield service? A quick Bing search revealed that it’s one of the services for McAfee VirusScan Enterprise. Could this be the cause? To get a quick look at all the history, I filtered the application log for event IDs: 17890 and 5000. Each time McAfee got an updated virus DAT file, SQL Server soon followed that by paging out the entire buffer pool. I checked the application log on several other SQL Servers for the same event IDs, and sure enough the same events occurred in tandem each morning. I also got confirmation from the security administration team that McAfee is scheduled to check for new DAT files each morning around 8AM. Eureka!

This seems like it could be the cause of our paging, but a little more research is needed. Searching the McAfee knowledge base, lead me to this article about the “Processes on enable” option.

Enabling this option causes memory pages of running processes to get paged to disk. And the example given is “Oracle, SQL, or other critical applications that need to be memory-resident continually, will have their process address space paged to disk when scan Processes On Enable kicks in”. OUCH! So when the McAfee service starts up or it gets a new DAT file, it will page out all processes.

Fortunately, this is a setting you can turn off, and Patrick shows how.

Comments closed

Building A SQL Server 2014 Container

Andrew Pruski shows how to create a Docker container with SQL Server 2014 SP2 installed:

I’m running all of this on my Windows 10 machine but there are a few things you’ll need before we get started: –

Pre-requisites

  • The microsoft/windowsservercore image downloaded from the Docker Hub

  • Windows Server 2016 installation media extracted toC:\Docker\Builds\Windows

  • SQL Server 2014 SP2 Developer Edition installation media extracted to C:\Docker\Builds\SQLServer2014

Or if you prefer, you can just pull his image, but where’s the fun in that?

Comments closed

DatasauRus Lives

Steph Locke shows how to create a package in R:

Then we need to add github repository to our project. I use the git command line for this:

git remote add origin git@github.com:stephlocke/datasauRus.git
git push --set-upstream origin master

With just these things, I have a package that contains the unit test framework, documentation stubs, continuous integration and test coverage, and source control.

That is all you need to do to get things going!

This is great timing for me, as I’m starting to look at packaging internal code.  Also, it’s great timing because it includes dinosaurs.

Comments closed

ML Algorithm Cheat Sheet

Hui Li has a quick cheat sheet on which algorithms might be useful in a particular situation:

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

  • The size, quality, and nature of data.
  • The available computational time.
  • The urgency of the task.
  • What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

Hui then goes into detail on each. h/t Vincent Granville

Comments closed

Data Science Resources

Steph Locke has some resources if you are interested in getting started with data science:

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.

If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.

Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.

This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.

It’s a good starting set of links.

Comments closed

Kafka + Spark Streaming

Kunal Khamar, et al, show how to integrate Apache Kafka with Spark’s structured streaming:

Kafka is a distributed pub-sub messaging system that is popular for ingesting real-time data streams and making them available to downstream consumers in a parallel and fault-tolerant manner. This renders Kafka suitable for building real-time streaming data pipelines that reliably move data between heterogeneous processing systems. Before we dive into the details of Structured Streaming’s Kafka support, let’s recap some basic concepts and terms.

Data in Kafka is organized into topics that are split into partitions for parallelism. Each partition is an ordered, immutable sequence of records, and can be thought of as a structured commit log. Producers append records to the tail of these logs and consumers read the logs at their own pace. Multiple consumers can subscribe to a topic and receive incoming records as they arrive. As new records arrive to a partition in a Kafka topic, they are assigned a sequential id number called the offset. A Kafka cluster retains all published records—whether or not they have been consumed—for a configurable retention period, after which they are marked for deletion.

Read the whole thing.

Comments closed

Twitter Campaign/Brand Management In Power BI

Mindy Curnutt looks at a Power BI solution template for managing Twitter campaigns:

Now you can start poking around and seeing what’s in the Dashboard. Since I opted to not put any handles in for analysis of FROM and TO, the first two tabs in the workbook (Outbound Tweets and Inbound Tweets) will not have any information, this is normal.

But then we get to tab #3 – Author Hashtag Graph.  The gray dots are hashtags and the green dots are accounts that have tweeted. You can see that I made a tweet that had 2 hashtags – #osmf2017 and #mvpbuzz. And boy was @TexasMusicDude busy tweeting up a storm – and using lots of other hashtags in conjunction with his tweets. Other hashtags that were popular appear to be #CampGround, #ShinyRibs, #TexasMusic, #DreamFolk and #Strings. Along the bottom you can see the day/timeline and the quantity of tweets at what time of day. If you click on any of the nodes, the information about what time the tweet(s) took place is highlighted in the timeline. It’s very interactive.

It does require an Azure subscription, but it looks very useful as a model for an advanced set of dashboards as well as a campaign management tool.

Comments closed

FlowFile Continuation In NiFi

Tim Spann describes one of the more powerful features of Apache NiFi:

Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.

This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).

Read on for an explanation of how FlowFile streams can do this.

Comments closed