Author: Kevin Feasel

Controlling Azure Services In R With AzureR

Published 2018-11-12 by Kevin Feasel

Hong Ooi announces a new set of packages called AzureR:

As background, some of you may remember the AzureSMR package, which was written a few years back as an R interface to Azure. AzureSMR was very successful and gained a significant number of users, but it was never meant to be maintainable in the long term. As more features were added it became more unwieldy until its design limitations became impossible to ignore.

The AzureR family is a refactoring/rewrite of AzureSMR that aims to fix the earlier package’s shortcomings.

The core package of the family is AzureRMR, which provides a lightweight yet powerful interface to Azure Resource Manager. It handles authentication (including automatically renewing when a session token expires), managing resource groups, and working with individual resources and templates. It also calls the Resource Manager REST API directly, so you don’t need to have PowerShell or Python installed; it depends only on commonly used R packages like httr, jsonlite and R6.

This won’t replace the Powershell libraries, but looks like it’d be useful for scenarios like if you need to set up a VM, train a model, and then shut down the VM.

Comments closed

Game Theory With Apache Spark

Published 2018-11-12 by Kevin Feasel

Konor Unyelioglu has a four-part series on solving game theoretical problems with Apache Spark. Part one lays out the scenario:

One application of game theory is finding optimal resource allocation. For example, as discussed in this article, resource management for heterogeneous wireless networks involves sharing network links, e.g. 3G, Wi-Fi, WiMAX, LTE, between mobile devices of different types and different bandwidth needs. In such environments, game theory algorithms can be effectively used to decide which devices must be allocated to which network resources. Similarly, game theory can be used for allocation of cloud computing resources, e.g. CPU, storage, memory or network bandwidth, between resource clients, as discussed in this article (also see here). The concept of Mobile Edge Computing, where mobile devices offload computationally intensive tasks to the small scale computing servers located in the network edge, could utilize game theory concepts for resource allocation, as studied here.

Using game theory for resource allocation is not limited to cloud computing or telecommunications. For example, in a recent study, a technique was developed based on game theory for efficient distribution of water supply to consumers. Optimum decision making for traffic flow control at major traffic intersections can also be modeled using concepts from game theory, as studied in this article.

Part two defines an algorithm for maximizing utility given the finite set of resources:

Consider Q_i(P) defined previously for i = 1, …, N. Let Q_i¹(P) be defined as the K-dimensional vector where the j-th entry is 1 if and only if there exists an element in Q_i(P) where the j-th entry is greater than 0, j = 1,…, K. In other words, if the j-th entry of Q_i¹(P) is 0 then for every element in Q_i(P) the j-th entry must be 0; if the j-th entry of Q_i¹(P) is 1 then for at least one element in Q_i(P) the j-th entry must be 1.

Part 1 starts with the initial price vector at 0, i.e. P = 0, and then at each iterative step finds a new price vector, built on the previous one, that minimizes C(P). At each step, the newly constructed price vector is guaranteed to be no less than the previous one. When the price vector no longer increases, i.e. the newly constructed and previous price vectors are equal, the optimal price P^o has been reached. Along with P^o we also obtain Q_i¹(P^o), i = 1, …, N, which we call optimal assignments. If the j-th entry of Q_i¹(P^o) = 0 then agent i will not be allocated any units of resource type j. On the other hand, if the j-th entry of Q_i¹(P^o) = 1 then agent i may be allocated some units of resource type j in Part 2 of the algorithm, although not necessarily.

Part three lays out some helper methods for solving the problem in Spark:

For an agent i, the method getMaxUtility() below calculates V_i(P) at price P, i.e. it solves the maximization problem:

max _{x ∈ Xi}{U_i(x) – P * x}

where X_i is the consumption set of the agent.

Recall that

U_i = [u_i1 u_i2 … u_iK]^T

U_i(x) = U_i^T * x = ∑ _{j = 1, 2, …, K} (u_ij * x_ij)

U_i(x) – P * x = ∑ _{j = 1, 2, …, K} (u_ij – p_j)* x_ij

Part four shows us the code for the solution and wraps up:

In this article, we discussed an algorithm based on game theory for optimal resource allocation. The algorithm provides a fairness-based equilibrium where every agent (bidder) maximizes its utility and the resource manager (auctioneer) maximizes the price of the resources it is allocating. In addition, all the available units are allocated across all resource types and no agent is forced to take more than it is willing to. The algorithm is based on economist Ausubel’s Efficient Dynamic Auction Method.

We showed via two examples that the algorithm can be applied to different types of resource allocation problems. In one example, we applied the algorithm to allocate cloud computing resources, e.g. CPU, memory, bandwidth, to computing clients. Secondly, we applied the algorithm to a logistics example where various types of goods are transported over shared transportation resources.

If you were to create a parlor game around things guaranteed to show up in Curated SQL, “Game theory with Apache Spark” is way up on the list. If somebody does a post combining Apache Kafka with agorics, that’s an instant link too.

Comments closed

Quick Spark Notes

Published 2018-11-12 by Kevin Feasel

Leela Prasad has a few quick notes on concepts in Apache Spark:

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

There’s some good stuff on accumulators and the SparkSession object in there as well.

Comments closed

Azure Databricks Geospatial Analysis

Published 2018-11-12 by Kevin Feasel

Jose Mendes gives us an example of using Azure Databricks to perform geospatial analysis:

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).

Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.

In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.

It sounds like this is much faster than using U-SQL to perform the same task.

Comments closed

Storage Forecasting In SentryOne

Published 2018-11-12 by Kevin Feasel

Steven Wright shows off a neat feature in the SentryOne product:

With that in mind, SentryOne has been working toward integrating advanced predictive analytics into our products. SentryOne 18.5 includes the first feature set to make use of advanced analytics and machine learning technology—introducing advanced Storage Forecasting.

In its most advanced configuration, SentryOne now applies machine learning algorithms to produce daily usage forecasts for all logical disks on your Windows servers. A forecast is customized for each volume, and each day the system will analyze its previous forecasts to learn how to adjust with each new run. This is done by generating forecasts using multiple algorithms and combining the results into an ensemble forecast. Each day, the system will review how each component forecast performed and weigh it proportionally within the ensemble. Over time, the system learns the unique workload of each volume and fine-tunes its forecasts accordingly.

There are several places within the SentryOne client where you can review and act on this information to prevent outages or perform capacity planning. Let’s start with the Disk Space tab.

With my only information being the blog post (and especially the pictures), I’d guess that they’re probably using some variant of ARIMA to calculate disk utilization.

1 Comment

Creating A SQL Server 2019 Big Data Cluster On Azure

Published 2018-11-12 by Kevin Feasel

Niels Berglund walks us through the setup for SQL Server 2019 Big Data Clusters:

If you, like me, are a SQL Server guy, you are probably quite familiar with installing SQL Server instances by mounting an ISO file, and running setup. Well, you can forget all that when you deploy a SQL Server 2019 Big Data Cluster. The setup is all done via Python utilities, and various Docker images pulled from a private repository. So, you need Python3. On my box I have Python 3.5, and – according to Microsoft – version 3.7 also works. Make you that you have your Python installation on the path.

When you deploy you use a Python utility: mssqlctl. To download mssqlctl, you need Python’s package management system pip installed. During installation you also need a Python HTTP library: Requests. If you do not have it you need to install it:
python -m pip install requests

This isn’t available to the general public quite yet, but when it is publicly available (or if you are part of the Early Access Program), the instructions are nice and clear.

Comments closed

Installing Kubernetes On-Prem

Published 2018-11-12 by Kevin Feasel

Anthony Nocentino shows us how to install Kubernetes on-prem:

Kubernetes is a distributed system, you will be creating a cluster which will have a master node that is in charge of all operations in your cluster. In this walkthrough we’ll create three workers which will run our applications. This cluster topology is, by no means, production ready. If you’re looking for production cluster builds check out Kubernetes documentation. Here and here. The primary components that need high availability in a Kubernetes cluster are the API Server which controls the state of the cluster and the etcddatabase which stores the persistent state of the cluster. You can learn more about Kubernetes cluster components here.

In our demonstration here, the master is where the API Server, etcd, and the other control plan functions will live. The workers, will be joined to the cluster and run our application workloads.

This is an area I need to focus on, given my almost total lack of knowledge in the world of container orchestration.

Comments closed

Views And Derived Tables In SQL Server 2019 Graph

Published 2018-11-12 by Kevin Feasel

Shreya Verma shows examples of using views and derived tables in SQL Server 2019’s graph database functionality:

We will be further expanding the graph database capabilities with several new features. In this blog we will discuss one of those features that is now available for public preview in Azure SQL Database and SQL Server 2019 CTP2.1: use of derived tables and views on graph tables in MATCH queries.

Graph queries on Azure SQL Database now support using view and derived table aliases in the MATCH syntax. To use these aliases in MATCH, the views and derived tables must be created either on a node or edge table which may or may not have some filters on it or a set of node or edge tables combined together using the UNION ALL operator. The ability to use derived table and view aliases in MATCH queries, could be very useful in scenarios where you are looking to query heterogeneous entities or heterogeneous connections between two or more entities in your graph.

It’s good to see the product team expand on what they released in 2017, getting the graph product closer to production-quality.

Comments closed

Kaggle-Maintained Data

Published 2018-11-09 by Kevin Feasel

Noah Daniels announces Maintained by Kaggle data sets:

The “Maintained by Kaggle” badge means that Kaggle is now and will continue to actively maintain that dataset. This includes regular updates to descriptions and metadata, quicker response rates in discussion, and accurate current data from the source. Our goal is to create seamless workflows that allow everyone to do data science on Kaggle and be confident in the data they work with.

They have several data sets available from different open data projects for several cities, as well as NOAA and the World Bank. If you’re looking for data sets to play with, this is a good option.

Comments closed

Faster Scalar Functions In SQL Server 2019

Published 2018-11-09 by Kevin Feasel

Brent Ozar looks at improvements the SQL Server team has made to scalar functions in 2019:

My database has to be in 2019 compat mode to enable Froid, the function-inlining magic. Run the same query again, and the metrics are wildly different:

Runtime: 4 seconds
CPU time: 4 seconds
Logical reads: 3,247,991 (which still sounds bad, but bear with me)

My bias tells me that I still want to avoid scalar functions, but it’s no longer the automatic deal-killer it once was.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28