Press "Enter" to skip to content

Author: Kevin Feasel

On Whether Relational Data Belongs In A Data Lake

Melissa Coates debates whether relational data really belongs in a data lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

Click through for considerations on both sides of the fence.

Comments closed

Bias Correction In Standard Deviation Estimates

John Mount explains how to perform bias correction and explains why it happens so rarely in practice:

The bias in question is falling off at a rate of 1/n (where n is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will be falling off at a slower rate of 1/sqrt(n), so it is unlikely this bias is going to be the worst feature of your sample.

But let’s pretend the sample size correction indeed is an important point for a while.

Under the “no bias allowed” rubric: if it is so vitally important to bias-correct the variance estimate, would it not be equally critical to correct the standard deviation estimate?

The practical answer seems to be: no. The straightforward standard deviation estimate itself is biased (it has to be, as a consequence of Jensen’s inequality). And pretty much nobody cares, corrects it, or teaches how to correct it, as it just isn’t worth the trouble.

This is a good explanation of the topic as well as the reason people make these corrections so rarely.

Comments closed

Kerberos Authentication In Apache Cassandra

Justin Cameron announces an open source Kerberos authenticator in Apache Cassandra:

In conjunction with the Cassandra authenticator, we have also published an open-source Kerberos authenticator plugin for the Cassandra Java driver.

The plugin supports multiple Kerberos quality of protection (QOP) levels, which may be specified directly when configuring the authenticator. The driver’s QOP level must match the QOP level configured for the server authenticator, and is only used during the authentication exchange. If confidentiality and/or integrity protection is required for all traffic between the client and Cassandra, it is recommended that Cassandra’s built-in SSL/TLS be used (note that TLS also protects the Kerberos authentication exchange, when enabled).

An (optional) SASL authorization ID is also supported. If provided, it specifies a Cassandra role that will be assumed once the Kerberos client principal has authenticated, provided the Cassandra user represented by the client principal has been granted permission to assume the role. Access to other roles may be granted using the GRANT ROLE CQL statement.

Click through for more details and check out the GitHub repo.

Comments closed

Wat-Provenance And Debugging Distributed Systems

Adrian Colyer reviews an interesting paper on debugging distributed systems:

Why why-provenance doesn’t work

Relational databases have why-provenance, which sounds on the surface exactly like what we’re looking for.

Given a relational database, a query issued against the database, and a tuple in the output of the query, why-provenance explains why the output tuple was produced. That is, why -provenance produces the input tuples that, if passed through the relational operators of the query, would produce the output tuple in question.

One reason that won’t work in our distributed systems setting is that the state of the system is not relational, and the operations can be much more complex and arbitrary than the well-defined set of relational operators why-provenance works with.

Read the whole thing.

Comments closed

The Value Of Power BI Dataflows

Matt Allington gets to the core benefits of Power BI Dataflows:

Dataflows are:

  1. An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).

  2. In effect dataflows are an online data collection and storage tool.

    • Collection:  It uses Power Query to connect to the data at the source and transform that data as needed.
      • You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
      • You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
    • Storage:  Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
  3. Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.

  4. Tables that are created as a result of the dataflow are stored in an Azure Data Lake.

    • If you don’t know what that is, don’t worry – I don’t understand it either.  The point is it doesn’t matter because it is all done automatically for you by the tool.
  5. Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.

    • If you don’t know what that is, don’t worry – it doesn’t matter now/yet.

    • This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.

Click through for more detail as well as some good uses for Dataflows.

Comments closed

Using Snippets In SSMS

Eduardo Pivaral shows us how to use snippets in SQL Server Management Studio:

If you work with SQL Server on a daily basis, it is very likely you have a lot of custom scripts you have to execute frequently, maybe you have stored them on a folder and you open them manually as you need them, or have saved them on a solution or project file, maybe you execute a custom .bat or PowerShell file to load them when you open SSMS…

Every method has its pros and cons, and on this post, I will show you a new method to load your custom scripts on any open query window on SSMS via Snippets.

Click through for more details, including an example.  Snippets are a good tool implemented adequately in SSMS.  A few third-party extensions make working with snippets better and really valuable (until you’re stuck on a machine without your snippets).

Comments closed

Removing The Azure Module

Max Trinidad has built a function to remove older versions of the Azure module:

As you probably know by now, “Azure RM” modules has been renamed to “Az” Module. Microsoft want you to start using this module moving forward. Currently, this new release is on version 0.5.0, and you’ll need to remove the any previous module(s) installed. Information about Azure PowerShell can be found on the following link.

Now, there’s always been a tedious task when manually removing module dependencies, and there’s no exception with the “Az” module.  So, we can all take advantage to PowerShell and create a script to work around this limitation.

And, below is a few options.

Max also provides us a couple of other options as well.

Comments closed

A Compendium Of Bad (Or Misleading) Performance Tips

Grant Fritchey responds to a long list of performance tips of greater or (mostly) lesser value:

Index the predicates in JOIN, WHERE, ORDER BY and GROUP BY clauses

What about the HAVING clause? Does the column order matter? Should we put a single column or multi-column index? INCLUDE statements? What kind of index, clustered, non-clustered, columnstore, XML, spatial? This piece of the advice is benign but so non-specific it’s almost useless. Let me summarize: Indexes can be good.

Do not use sp_* naming convention

So, this one is true because it will add a VERY small amount of overhead as SQL Server searches the master database first for your object. However, for most of us, most of the time, this is so far down the list of worries about our database as to effectively vanish from sight.

There’s a pretty long list of things here, most of which Grant considers either incomplete, irrelevant, or sometimes flat-out wrong.

Comments closed

Controlling Azure Services In R With AzureR

Hong Ooi announces a new set of packages called AzureR:

As background, some of you may remember the AzureSMR package, which was written a few years back as an R interface to Azure. AzureSMR was very successful and gained a significant number of users, but it was never meant to be maintainable in the long term. As more features were added it became more unwieldy until its design limitations became impossible to ignore.

The AzureR family is a refactoring/rewrite of AzureSMR that aims to fix the earlier package’s shortcomings.

The core package of the family is AzureRMR, which provides a lightweight yet powerful interface to Azure Resource Manager. It handles authentication (including automatically renewing when a session token expires), managing resource groups, and working with individual resources and templates. It also calls the Resource Manager REST API directly, so you don’t need to have PowerShell or Python installed; it depends only on commonly used R packages like httr, jsonlite and R6.

This won’t replace the Powershell libraries, but looks like it’d be useful for scenarios like if you need to set up a VM, train a model, and then shut down the VM.

Comments closed

Game Theory With Apache Spark

Konor Unyelioglu has a four-part series on solving game theoretical problems with Apache Spark.  Part one lays out the scenario:

One application of game theory is finding optimal resource allocation. For example, as discussed in this article, resource management for heterogeneous wireless networks involves sharing network links, e.g. 3G, Wi-Fi, WiMAX, LTE, between mobile devices of different types and different bandwidth needs. In such environments, game theory algorithms can be effectively used to decide which devices must be allocated to which network resources. Similarly, game theory can be used for allocation of cloud computing resources, e.g. CPU, storage, memory or network bandwidth, between resource clients, as discussed in this article (also see here). The concept of Mobile Edge Computing, where mobile devices offload computationally intensive tasks to the small scale computing servers located in the network edge, could utilize game theory concepts for resource allocation, as studied here.

Using game theory for resource allocation is not limited to cloud computing or telecommunications. For example, in a recent study, a technique was developed based on game theory for efficient distribution of water supply to consumers. Optimum decision making for traffic flow control at major traffic intersections can also be modeled using concepts from game theory, as studied in this article.

Part two defines an algorithm for maximizing utility given the finite set of resources:

Consider Qi(P) defined previously for i = 1, …, N. Let Qi1(P) be defined as the K-dimensional vector where the j-th entry is 1 if and only if there exists an element in Qi(P) where the j-th entry is greater than 0, j = 1,…, K. In other words, if the j-th entry of Qi1(P) is 0 then for every element in Qi(P) the j-th entry must be 0; if the j-th entry of Qi1(P) is 1 then for at least one element in Qi(P) the j-th entry must be 1.

Part 1 starts with the initial price vector at 0, i.e. P = 0, and then at each iterative step finds a new price vector, built on the previous one, that minimizes C(P). At each step, the newly constructed price vector is guaranteed to be no less than the previous one. When the price vector no longer increases, i.e. the newly constructed and previous price vectors are equal, the optimal price Po has been reached. Along with Po we also obtain Qi1(Po), i = 1, …, N, which we call optimal assignments. If the j-th entry of Qi1(Po) = 0 then agent i will not be allocated any units of resource type j. On the other hand, if the j-th entry of Qi1(Po) = 1 then agent i may be allocated some units of resource type j in Part 2 of the algorithm, although not necessarily.

Part three lays out some helper methods for solving the problem in Spark:

For an agent i, the method getMaxUtility() below calculates Vi(P) at price P, i.e. it solves the maximization problem:

max x ∈ Xi {Ui(x) – P * x}

where Xi is the consumption set of the agent.

Recall that

  • Ui = [ui1 ui2 … uiK]T
  • Ui(x) = UiT  * x = ∑ j = 1, 2, …, K  (uij * xij)
  • Ui(x) – P * x = ∑ j = 1, 2, …, K  (uij – pj)* xij

Part four shows us the code for the solution and wraps up:

In this article, we discussed an algorithm based on game theory for optimal resource allocation. The algorithm provides a fairness-based equilibrium where every agent (bidder) maximizes its utility and the resource manager (auctioneer) maximizes the price of the resources it is allocating. In addition, all the available units are allocated across all resource types and no agent is forced to take more than it is willing to. The algorithm is based on economist Ausubel’s Efficient Dynamic Auction Method.

We showed via two examples that the algorithm can be applied to different types of resource allocation problems. In one example, we applied the algorithm to allocate cloud computing resources, e.g. CPU, memory, bandwidth, to computing clients. Secondly, we applied the algorithm to a logistics example where various types of goods are transported over shared transportation resources.

If you were to create a parlor game around things guaranteed to show up in Curated SQL, “Game theory with Apache Spark” is way up on the list.  If somebody does a post combining Apache Kafka with agorics, that’s an instant link too.

Comments closed