Press "Enter" to skip to content

Month: May 2018

Converting Between Time Series Classes In R

Christoph Sax announces a new R library:

tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as tsxtsdata.framedata.tabletibblezootsibble or timeSeries to each other.

If you have to work with time series data, this will be a useful library.  H/T R-Bloggers

Comments closed

Using AU Analyzer To Lower Data Lake Analytics Costs

Matthew Hicks shows off the Data Lake Analytics AU Analyzer:

The AU Analyzer looks at all the vertices (or nodes) in your job, analyzes how long they ran and their dependencies, then models how long the job might run if a certain number of vertices could run at the same time. Each vertex may have to wait for input or for its spot in line to run. The AU Analyzer isn’t 100% accurate, but it provides general guidance to help you choose the right number of AUs for your job.

You’ll notice that there are diminishing returns when assigning more AUs, mainly because of input dependencies and the running times of the vertices themselves. So, a job with 10,000 total vertices likely won’t be able to use 10,000 AUs at once, since some will have to wait for input or for dependent vertices to complete.

In the graph below, here’s what the modeler might produce, when considering the different options. Notice that when the job is assigned 1427 AUs, assigning more won’t reduce the running time. 1427 is the “peak” number of AUs that can be assigned.

I like this kind of tooling, as it provides a realistic assessment of tradeoffs.

Comments closed

What’s New In Hadoop 3.1?

Wangda Tan, et al, look at some of the new features in Apache Hadoop 3.1:

The diagram below captures the building blocks together at a high level. If you have to tie this back to a fictitious self-flying drone company, the company will collect tons of raw images from the test drones’ built-in cameras for computer vision. Those images can be stored in the Apache Hadoop data lake in a cost-effective (with erasure coding) yet highly available manner (multiple standby namenodes). Instead of providing GPU machines to each of the data scientists, GPU cards are pooled across the cluster for access by multiple data scientists. GPU cards in each server can be isolated for sharing between multiple users.

Support of Docker containerized workloads means that data scientists/data engineers can bring the deep learning frameworks to the Apache Hadoop data lake and there is no need to have a separate compute/GPU cluster. GPU pooling allows the application of the deep learning neural network algorithms and the training of the data-intensive models using the data collected in the data lake at a speed almost 100x faster than regular CPUs.

If the customer wants to pool the FPGA (field programmable gate array) resources instead of GPUs, this is also possible in Apache Hadoop 3.1. Additionally, use of affinity and anti-affinity labels allows us to control how we deploy the microservices in the clusters — some of the components can be set to have anti-affinity so that they are always deployed in separate physical servers.

It’s interesting to see Hadoop evolve over time as the ecosystem solves more real-time problems instead of focusing on giant batch problems.

Comments closed

Using mssql-cli

Prashanth Jayaram shows how to install and use the mssql-cli client:

Switching to the editor mode is pretty simple and straight-forward. At the bottom of the screen, we can see the help bar which guides us through the switching process between the available editor modes. The options available for instant switching are the multiline mode, activated by pressing F3, and the Emacs mode, activated by pressing the F4 button.

To run the multi-line query in the multi-line mode, append the query with a semicolon and then press the enter key to execute it.

Use the same keys as mentioned above to turn on and turn off the editor modes—F3 for the multi-line query mode and F4 for the EMACS mode.

If you’re big on command-line interfaces, you’ll probably enjoy this client.

Comments closed

Gaps And Islands With DAX

Philip Seamark answers one of the classic gaps and islands problems with DAX:

A recent post on the Power BI community website asked if it was possible to compress a group of numbers into text that described the sequential ranges contained within the numbers. This might be a group of values such as 1, 2, 3, 4, 7, 8, 9, 12, 13:  (note there are gaps) with the expected result grouping the numbers that run in a sequence together to produce text like “1-4, 7-9, 12-13”.  Essentially to identify gaps when creating the text.  This seemed like an interesting challenge and here is how I solved it using DAX.

Read on for the solution, which is conceptually very similar to the T-SQL solution but a bit different in implementation.

Comments closed

Creating Azure SQL Database Managed Instances Via ARM Templates

Jovan Popovic shows how us how to build a Managed Instance of Azure SQL Database using Powershell and an ARM template:

Values that you need to change in this request are:

  • name – name of your Azure SQL Managed Instance (don’t include domain).

  • properties/administratorLogin – SQL login that will be used to connect to the instance.

  • properties/subnetId – Azure identifier of the subnet where Azure SQL Managed Instance should be placed. Make sure that you properlyconfigure network for Azure SQL Managed Instance

  • location – one of the valid location for Azure data centers, for example: “westcentralus”

  • sku/name: GP_Gen4 or GP_Gen5

  • properties/vCores: Number of cores that should be assigned to your instance. Values can be 8, 16, or 24 if you select GP_Gen4 sku name, or 8, 16, 24, 32, or 40 if you select GP_Gen5.

  • properties/storageSizeInGB: Maximum storage space for your instance. It should be multiple of 32GB.

  • properties/licenceType: Choose BasePrice if you don’t have SQL Server on-premises licence that you want to use, or LicenceIncluded if you can have discount for your on-premises licence.

  • tags(optional) – optionally put some key:value pairs that you would use to categorize instance.

Click through for the template and a quick Powershell script which shows how to use the template.

Comments closed

The Elitist Shuffle And Recommenders

Rodrigo Agundez shows us a way of displaying fresh recommendations without retraining the recommender system:

Suppose you have 10,000 items in total that can be recommended to your user, you run the recommendation system over all the items and those 10,000 items get ranked in order of relevance of the content.

The application shows 5 items on the entry screen. The first time the user opens the application after the re-scoring process the top 5 ranked items are shown. It is decided that from now on (based on user control groups, investigation, AB testing, etc.) until the next re-scoring process the entry screen should not be the same every time and remain relevant for the user.

Based on an investigation from the data scientist it turns out that somewhat relevant items appear until item 100. Then the idea is to somehow shuffle those 100 items such that the top 5 items shown are still relevant but not the same.

Click through for an example in Python and how it compares favorably to a couple other shuffling algorithms.

Comments closed

Scaling Kafka With Consumer Groups

Suhita Goswami explains how to use consumer groups to scale processing from Apache Kafka:

Kafka builds on the publish-subscribe model with the advantages of a message queuing system. It achieves this with:

  • the use of consumer groups
  • message retention by brokers

When consumers join a group and subscribe to a topic, only one consumer from the group actually consumes each message from the topic. The messages are also retained by the brokers in their topic partitions, unlike traditional message queues.

Multiple consumer groups can read from the same set of topics, and at different times catering to different logical application domains. Thus, Kafka provides both the advantage of high scalability via consumers belonging to the same consumer group and the ability to serve multiple independent downstream applications simultaneously.

Consumer groups are a great solution to the problem of long-running consumers when items to process are independent and can run concurrently.

Comments closed