Press "Enter" to skip to content

Day: November 13, 2018

Bias Correction In Standard Deviation Estimates

John Mount explains how to perform bias correction and explains why it happens so rarely in practice:

The bias in question is falling off at a rate of 1/n (where n is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will be falling off at a slower rate of 1/sqrt(n), so it is unlikely this bias is going to be the worst feature of your sample.

But let’s pretend the sample size correction indeed is an important point for a while.

Under the “no bias allowed” rubric: if it is so vitally important to bias-correct the variance estimate, would it not be equally critical to correct the standard deviation estimate?

The practical answer seems to be: no. The straightforward standard deviation estimate itself is biased (it has to be, as a consequence of Jensen’s inequality). And pretty much nobody cares, corrects it, or teaches how to correct it, as it just isn’t worth the trouble.

This is a good explanation of the topic as well as the reason people make these corrections so rarely.

Comments closed

Kerberos Authentication In Apache Cassandra

Justin Cameron announces an open source Kerberos authenticator in Apache Cassandra:

In conjunction with the Cassandra authenticator, we have also published an open-source Kerberos authenticator plugin for the Cassandra Java driver.

The plugin supports multiple Kerberos quality of protection (QOP) levels, which may be specified directly when configuring the authenticator. The driver’s QOP level must match the QOP level configured for the server authenticator, and is only used during the authentication exchange. If confidentiality and/or integrity protection is required for all traffic between the client and Cassandra, it is recommended that Cassandra’s built-in SSL/TLS be used (note that TLS also protects the Kerberos authentication exchange, when enabled).

An (optional) SASL authorization ID is also supported. If provided, it specifies a Cassandra role that will be assumed once the Kerberos client principal has authenticated, provided the Cassandra user represented by the client principal has been granted permission to assume the role. Access to other roles may be granted using the GRANT ROLE CQL statement.

Click through for more details and check out the GitHub repo.

Comments closed

Wat-Provenance And Debugging Distributed Systems

Adrian Colyer reviews an interesting paper on debugging distributed systems:

Why why-provenance doesn’t work

Relational databases have why-provenance, which sounds on the surface exactly like what we’re looking for.

Given a relational database, a query issued against the database, and a tuple in the output of the query, why-provenance explains why the output tuple was produced. That is, why -provenance produces the input tuples that, if passed through the relational operators of the query, would produce the output tuple in question.

One reason that won’t work in our distributed systems setting is that the state of the system is not relational, and the operations can be much more complex and arbitrary than the well-defined set of relational operators why-provenance works with.

Read the whole thing.

Comments closed

The Value Of Power BI Dataflows

Matt Allington gets to the core benefits of Power BI Dataflows:

Dataflows are:

  1. An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).

  2. In effect dataflows are an online data collection and storage tool.

    • Collection:  It uses Power Query to connect to the data at the source and transform that data as needed.
      • You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
      • You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
    • Storage:  Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
  3. Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.

  4. Tables that are created as a result of the dataflow are stored in an Azure Data Lake.

    • If you don’t know what that is, don’t worry – I don’t understand it either.  The point is it doesn’t matter because it is all done automatically for you by the tool.
  5. Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.

    • If you don’t know what that is, don’t worry – it doesn’t matter now/yet.

    • This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.

Click through for more detail as well as some good uses for Dataflows.

Comments closed

Using Snippets In SSMS

Eduardo Pivaral shows us how to use snippets in SQL Server Management Studio:

If you work with SQL Server on a daily basis, it is very likely you have a lot of custom scripts you have to execute frequently, maybe you have stored them on a folder and you open them manually as you need them, or have saved them on a solution or project file, maybe you execute a custom .bat or PowerShell file to load them when you open SSMS…

Every method has its pros and cons, and on this post, I will show you a new method to load your custom scripts on any open query window on SSMS via Snippets.

Click through for more details, including an example.  Snippets are a good tool implemented adequately in SSMS.  A few third-party extensions make working with snippets better and really valuable (until you’re stuck on a machine without your snippets).

Comments closed

Removing The Azure Module

Max Trinidad has built a function to remove older versions of the Azure module:

As you probably know by now, “Azure RM” modules has been renamed to “Az” Module. Microsoft want you to start using this module moving forward. Currently, this new release is on version 0.5.0, and you’ll need to remove the any previous module(s) installed. Information about Azure PowerShell can be found on the following link.

Now, there’s always been a tedious task when manually removing module dependencies, and there’s no exception with the “Az” module.  So, we can all take advantage to PowerShell and create a script to work around this limitation.

And, below is a few options.

Max also provides us a couple of other options as well.

Comments closed

A Compendium Of Bad (Or Misleading) Performance Tips

Grant Fritchey responds to a long list of performance tips of greater or (mostly) lesser value:

Index the predicates in JOIN, WHERE, ORDER BY and GROUP BY clauses

What about the HAVING clause? Does the column order matter? Should we put a single column or multi-column index? INCLUDE statements? What kind of index, clustered, non-clustered, columnstore, XML, spatial? This piece of the advice is benign but so non-specific it’s almost useless. Let me summarize: Indexes can be good.

Do not use sp_* naming convention

So, this one is true because it will add a VERY small amount of overhead as SQL Server searches the master database first for your object. However, for most of us, most of the time, this is so far down the list of worries about our database as to effectively vanish from sight.

There’s a pretty long list of things here, most of which Grant considers either incomplete, irrelevant, or sometimes flat-out wrong.

Comments closed