Press "Enter" to skip to content

Author: Kevin Feasel

Exploring Spark

Adnan Masood has photos of slides from a Spark-related meetup:

Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.

Apache Spark for Azure HD-Insight

Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.

Check out the links.

Comments closed

Mapping German Postal Codes With R

Achim Rumberger shows how to map German postal codes using R:

Just at this time Ari published his webinar about getting shape files into R. Which also includes a introduction to shape files to get you going, if you are new to it, as I am. I remembered Ari from his mail course introducing his great R-package (choroplethr). By the way this is a terrible name, being a biologist by heart, I always type “chloroplethr”, as in “chlorophyll”, and this is not found by the R package manager. [Editor’s note: I agree!]

Next question, where do I get the shapefiles, describing Germany? A major search engine was of great help here. http://www.suche-postleitzahl.org/downloads?download=zuordnung_plz_ort.csv . Germany has some 8700 zip code areas, so expect some time for rendering the file, if you do on your computer. Right on this side one can also find a dataset which might act as a useful warm up practice to display statistical data in a geographical context. Other sources are https://datahub.io/de/dataset/postal-codes-de

This is really cool.

Comments closed

Early Metrics On Warehouse Performance

Sunil Agarwal shows some results from a sample workload indicating that SQL Server 2016 has improved two customers’ performance:

As part of SQL Server 2016 technology adoption program, during development, we work with many customers validating their production-like workload in a test environment and opportunistically take some of these workloads to production running on production-ready preview build.

In one such engagement, we worked with a customer in health industry who was running analytics workload on SYBASE IQ 15.4 system. Challenged by exponential data growth and the requirement for running analytics queries even faster for insights, the customer wanted to compare solutions from multiple vendors to see which analytical database could deliver the performance and features they need over the next 3-5 years. After extensive proof-of-concept projects, they concluded SQL Server 2016’s clustered columnstore delivered the best performance. The performance proof-of-concept tested the current database against Sybase IQ 16, MS SQL 2016, Oracle 12c, and SAP Hana using the central tables from the real-life data model filled with synthetic data in a cloud environment. MS SQL Server 2016 came out the clear winner. SAP Hana was second in performance, but also required much higher memory and displayed significant query performance outliers. Other contenders were out-performed by a factor of 2 or more.

Standard disclaimers apply:  your mileage may vary; we don’t get raw data; “all other things” are not necessarily equal.

Comments closed

DAX Time Zones With Power BI

Reza Rad shows a few ways to deal with date/time issues related to Power BI being in the cloud:

Power BI is a cloud service, and that means Power BI files are hosted somewhere. Some DAX functions such as Date/Time functions work on system date/time on the server their file is hosted on. So If you use DAX functions such as TODAY() or NOW() you will not get your local date/time, You will fetch server’s date/time. In this blog post I’ll explain methods of solving this issue, so you could use Power BI to resolve your specific time zone’s date and time. If you want to learn more about Power BI read Power BI online book; Power BI from Rookie to Rock Star.

This is your daily reminder that “the cloud” is just somebody else’s machine.

Comments closed

Kafka And MapR Streams

Ellen Friedman compares and contrasts Apache Kafka with MapR streams:

What’s the difference in MapR Streams and Kafka Streams?

This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.

Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.

This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.

Comments closed

Disable Performance Counters

Dan Taylor shows how to disable SQL Server performance counters using a registry setting:

About 6 or 7 years ago, I had an issue where my SQL Server performance counters were not available when looking within performance monitor. I went thru the steps of unloading and reloading the counters according to Microsoft documentation and by looking at many of the blog posts out there. The performance counters still would not show.

My next step was to dig into the registry and look at the performance counters entry for SQL Server.

I’d have to imagine that the number of good use cases for disabling performance counters is low, but it’s occasionally necessary for troubleshooting.

Comments closed

Finishing The Event Scheduler

Reza Rad has part 3 of his event date and time scheduler up:

This table has three columns: Date, Time, and Duration. I separated the date and time for simplicity of this example. Date to be formatted as YYYYMMDD, and Time as HHMM, and duration as an integer value illustrating hours.

Configuration above means the event starts at 9th of May 2016, at 1:00 pm New Zealand time (this is what my local time is), with duration of 3 hours. I named this table as InputData.

This wraps up his series on Power Query for non-BI developers.

Comments closed

Configure SAP HANA With Impala

Sreedhar Bolneni has a walkthrough on integrating SAP HANA with Impala:

Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward

  1. Install the Impala ODBC driver on the HANA host.

  2. Configure the Impala data source.

  3. Create remote source and virtual tables using SAP HANA Studio; then test.

There are a lot of screenshots and configuration files to help guide you through.

Comments closed

SSIS + Google Distance Matrix API

Terry McCann shows how to use the Google distance matrix API to calculate a distance from a starting point to an ending point:

The most basic use is to calculate the distance from A to B. This is what we will look at using SSIS. The API takes in an origin and a destination. These can either be a lat & lng combo or an address which we be translated into Lat & lng when executed. To demonstrate the process I have created a database and loaded it with some sample customer and store data. We have 2 customers and 4 stores. These customers are linked to different stores. You can download the example files at the bottom of the page.

I like that this shows just how easy it is to hit an API with an SSIS component.  If I had one wish with this article, I’d use Biml to generate the package rather than talking through the tasks and components.  Regardless, check this out; it’s a great use of the script component.

Comments closed

“RAID” And Backups

Kenneth Fisher explains that you can set up backup strategies similar to different RAID types:

RAID 0

Splitting the backup data between multiple files.

This is actually a fairly common way to speed up a large backup. IO is one of the slowest part of the whole process so by splitting the backup into multiple files we can reduce our backup time by quite a bit. Having multiple files will also increase your restore time but you do have to make sure that all of the files are available. If you lose one file then the whole backup is useless.

This is done by listing multiple locations for the backup.

This is an interesting use of RAID as metaphor.

Comments closed