Press "Enter" to skip to content

Month: November 2018

Wat-Provenance And Debugging Distributed Systems

Adrian Colyer reviews an interesting paper on debugging distributed systems:

Why why-provenance doesn’t work

Relational databases have why-provenance, which sounds on the surface exactly like what we’re looking for.

Given a relational database, a query issued against the database, and a tuple in the output of the query, why-provenance explains why the output tuple was produced. That is, why -provenance produces the input tuples that, if passed through the relational operators of the query, would produce the output tuple in question.

One reason that won’t work in our distributed systems setting is that the state of the system is not relational, and the operations can be much more complex and arbitrary than the well-defined set of relational operators why-provenance works with.

Read the whole thing.

Comments closed

The Value Of Power BI Dataflows

Matt Allington gets to the core benefits of Power BI Dataflows:

Dataflows are:

  1. An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).

  2. In effect dataflows are an online data collection and storage tool.

    • Collection:  It uses Power Query to connect to the data at the source and transform that data as needed.
      • You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
      • You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
    • Storage:  Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
  3. Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.

  4. Tables that are created as a result of the dataflow are stored in an Azure Data Lake.

    • If you don’t know what that is, don’t worry – I don’t understand it either.  The point is it doesn’t matter because it is all done automatically for you by the tool.
  5. Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.

    • If you don’t know what that is, don’t worry – it doesn’t matter now/yet.

    • This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.

Click through for more detail as well as some good uses for Dataflows.

Comments closed

Using Snippets In SSMS

Eduardo Pivaral shows us how to use snippets in SQL Server Management Studio:

If you work with SQL Server on a daily basis, it is very likely you have a lot of custom scripts you have to execute frequently, maybe you have stored them on a folder and you open them manually as you need them, or have saved them on a solution or project file, maybe you execute a custom .bat or PowerShell file to load them when you open SSMS…

Every method has its pros and cons, and on this post, I will show you a new method to load your custom scripts on any open query window on SSMS via Snippets.

Click through for more details, including an example.  Snippets are a good tool implemented adequately in SSMS.  A few third-party extensions make working with snippets better and really valuable (until you’re stuck on a machine without your snippets).

Comments closed

Removing The Azure Module

Max Trinidad has built a function to remove older versions of the Azure module:

As you probably know by now, “Azure RM” modules has been renamed to “Az” Module. Microsoft want you to start using this module moving forward. Currently, this new release is on version 0.5.0, and you’ll need to remove the any previous module(s) installed. Information about Azure PowerShell can be found on the following link.

Now, there’s always been a tedious task when manually removing module dependencies, and there’s no exception with the “Az” module.  So, we can all take advantage to PowerShell and create a script to work around this limitation.

And, below is a few options.

Max also provides us a couple of other options as well.

Comments closed

A Compendium Of Bad (Or Misleading) Performance Tips

Grant Fritchey responds to a long list of performance tips of greater or (mostly) lesser value:

Index the predicates in JOIN, WHERE, ORDER BY and GROUP BY clauses

What about the HAVING clause? Does the column order matter? Should we put a single column or multi-column index? INCLUDE statements? What kind of index, clustered, non-clustered, columnstore, XML, spatial? This piece of the advice is benign but so non-specific it’s almost useless. Let me summarize: Indexes can be good.

Do not use sp_* naming convention

So, this one is true because it will add a VERY small amount of overhead as SQL Server searches the master database first for your object. However, for most of us, most of the time, this is so far down the list of worries about our database as to effectively vanish from sight.

There’s a pretty long list of things here, most of which Grant considers either incomplete, irrelevant, or sometimes flat-out wrong.

Comments closed

Controlling Azure Services In R With AzureR

Hong Ooi announces a new set of packages called AzureR:

As background, some of you may remember the AzureSMR package, which was written a few years back as an R interface to Azure. AzureSMR was very successful and gained a significant number of users, but it was never meant to be maintainable in the long term. As more features were added it became more unwieldy until its design limitations became impossible to ignore.

The AzureR family is a refactoring/rewrite of AzureSMR that aims to fix the earlier package’s shortcomings.

The core package of the family is AzureRMR, which provides a lightweight yet powerful interface to Azure Resource Manager. It handles authentication (including automatically renewing when a session token expires), managing resource groups, and working with individual resources and templates. It also calls the Resource Manager REST API directly, so you don’t need to have PowerShell or Python installed; it depends only on commonly used R packages like httr, jsonlite and R6.

This won’t replace the Powershell libraries, but looks like it’d be useful for scenarios like if you need to set up a VM, train a model, and then shut down the VM.

Comments closed

Game Theory With Apache Spark

Konor Unyelioglu has a four-part series on solving game theoretical problems with Apache Spark.  Part one lays out the scenario:

One application of game theory is finding optimal resource allocation. For example, as discussed in this article, resource management for heterogeneous wireless networks involves sharing network links, e.g. 3G, Wi-Fi, WiMAX, LTE, between mobile devices of different types and different bandwidth needs. In such environments, game theory algorithms can be effectively used to decide which devices must be allocated to which network resources. Similarly, game theory can be used for allocation of cloud computing resources, e.g. CPU, storage, memory or network bandwidth, between resource clients, as discussed in this article (also see here). The concept of Mobile Edge Computing, where mobile devices offload computationally intensive tasks to the small scale computing servers located in the network edge, could utilize game theory concepts for resource allocation, as studied here.

Using game theory for resource allocation is not limited to cloud computing or telecommunications. For example, in a recent study, a technique was developed based on game theory for efficient distribution of water supply to consumers. Optimum decision making for traffic flow control at major traffic intersections can also be modeled using concepts from game theory, as studied in this article.

Part two defines an algorithm for maximizing utility given the finite set of resources:

Consider Qi(P) defined previously for i = 1, …, N. Let Qi1(P) be defined as the K-dimensional vector where the j-th entry is 1 if and only if there exists an element in Qi(P) where the j-th entry is greater than 0, j = 1,…, K. In other words, if the j-th entry of Qi1(P) is 0 then for every element in Qi(P) the j-th entry must be 0; if the j-th entry of Qi1(P) is 1 then for at least one element in Qi(P) the j-th entry must be 1.

Part 1 starts with the initial price vector at 0, i.e. P = 0, and then at each iterative step finds a new price vector, built on the previous one, that minimizes C(P). At each step, the newly constructed price vector is guaranteed to be no less than the previous one. When the price vector no longer increases, i.e. the newly constructed and previous price vectors are equal, the optimal price Po has been reached. Along with Po we also obtain Qi1(Po), i = 1, …, N, which we call optimal assignments. If the j-th entry of Qi1(Po) = 0 then agent i will not be allocated any units of resource type j. On the other hand, if the j-th entry of Qi1(Po) = 1 then agent i may be allocated some units of resource type j in Part 2 of the algorithm, although not necessarily.

Part three lays out some helper methods for solving the problem in Spark:

For an agent i, the method getMaxUtility() below calculates Vi(P) at price P, i.e. it solves the maximization problem:

max x ∈ Xi {Ui(x) – P * x}

where Xi is the consumption set of the agent.

Recall that

  • Ui = [ui1 ui2 … uiK]T
  • Ui(x) = UiT  * x = ∑ j = 1, 2, …, K  (uij * xij)
  • Ui(x) – P * x = ∑ j = 1, 2, …, K  (uij – pj)* xij

Part four shows us the code for the solution and wraps up:

In this article, we discussed an algorithm based on game theory for optimal resource allocation. The algorithm provides a fairness-based equilibrium where every agent (bidder) maximizes its utility and the resource manager (auctioneer) maximizes the price of the resources it is allocating. In addition, all the available units are allocated across all resource types and no agent is forced to take more than it is willing to. The algorithm is based on economist Ausubel’s Efficient Dynamic Auction Method.

We showed via two examples that the algorithm can be applied to different types of resource allocation problems. In one example, we applied the algorithm to allocate cloud computing resources, e.g. CPU, memory, bandwidth, to computing clients. Secondly, we applied the algorithm to a logistics example where various types of goods are transported over shared transportation resources.

If you were to create a parlor game around things guaranteed to show up in Curated SQL, “Game theory with Apache Spark” is way up on the list.  If somebody does a post combining Apache Kafka with agorics, that’s an instant link too.

Comments closed

Quick Spark Notes

Leela Prasad has a few quick notes on concepts in Apache Spark:

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

There’s some good stuff on accumulators and the SparkSession object in there as well.

Comments closed

Azure Databricks Geospatial Analysis

Jose Mendes gives us an example of using Azure Databricks to perform geospatial analysis:

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).

Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled.

In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.

It sounds like this is much faster than using U-SQL to perform the same task.

Comments closed

Storage Forecasting In SentryOne

Steven Wright shows off a neat feature in the SentryOne product:

With that in mind, SentryOne has been working toward integrating advanced predictive analytics into our products. SentryOne 18.5 includes the first feature set to make use of advanced analytics and machine learning technology—introducing advanced Storage Forecasting.

In its most advanced configuration, SentryOne now applies machine learning algorithms to produce daily usage forecasts for all logical disks on your Windows servers. A forecast is customized for each volume, and each day the system will analyze its previous forecasts to learn how to adjust with each new run. This is done by generating forecasts using multiple algorithms and combining the results into an ensemble forecast. Each day, the system will review how each component forecast performed and weigh it proportionally within the ensemble. Over time, the system learns the unique workload of each volume and fine-tunes its forecasts accordingly.

There are several places within the SentryOne client where you can review and act on this information to prevent outages or perform capacity planning. Let’s start with the Disk Space tab.

With my only information being the blog post (and especially the pictures), I’d guess that they’re probably using some variant of ARIMA to calculate disk utilization.

1 Comment