Press "Enter" to skip to content

Curated SQL Posts

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R:

We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).

This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into  simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.

I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.

Comments closed

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase:

Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini.

Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.

I’m used to the Hortonworks Data Platform, so this is a quick compendium of things I noticed to were different.

Comments closed

Creating Power Query Custom Connectors

Eugene Meidinger shows how easy it can be to create a customer Power Query connector:

When I heard about custom data connectors for Power Query, I had assumed there would be a lot of work involved. While there is definitely quite a bit of work in implementing advanced features like query folding,  creating your very first connector is simple.

So, first you need Visual studio installed and the Power Query SDK installed as well. Once you do that, you will see Power Query as an option when creating a new project. Visual studio will also have support for .pq or Power Query files.

Click through for an example of the process in action.

Comments closed

Disable Lightweight Pooling

Randolph West explains why enabling lightweight pooling in SQL Server is almost always a bad idea:

When can I enable lightweight pooling then?
Don’t. But if you must, these are the conditions under which Microsoft suggests it may be useful:
– Large multi-processor servers are in use.
– All servers are running at or near maximum capacity.
– A lot of context switching occurs (greater than 20,000 per second).

We can measure context switching with a performance counter in Performance Monitor on Windows, so the last two items on this list can be monitored. Use the Context Switches/sec counter to track context switches.

I’m sure there were a few customers who benefited from this, but I’ve neither seen nor heard of a case where it did actually help.

Comments closed

Displaying Filters In Power BI Tooltips

Marco Russo shows how you can create a filter dump measure to show all active filters as part of your tooltip:

The Tooltips can display a string with multiple lines. This is useful for the DumpFilters measure that creates a new line for every column with a filter. You might wonder why the DumpFilters measure is required considering that Power BI can already display any filters and slicers affecting a visual. The reason is that the DumpFilters measure isolates the filters of a single cell and can show the effects of filters that are not visible in the standard visualization provided by Power BI.

This is interesting reading and a good way of sharing to users how they got to the current view of data.

Comments closed

Formatting Powershell Outputs

Jeffery Hicks shows us how we can format Powershell files through XML:

My PSScriptTools module (which you can install from the PowerShell Gallery) now includes a command called New-PSFormatXML. The command is designed to analyze an object and by default create a table view of all properties, although you can specify which properties to include. The format.ps1xml file will autosize the table but you can remove the directive and use the widths which are best guesses. Expect some trial and error when defining a new view.

Read on for a couple demonstrations.

Comments closed

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Comments closed

Handling Definitional Changes In Predictive Variables

Vincent Granville explains how you can blend two different definitions of a variable of interest together:

The reasons why scores can become meaningless over time is because data evolves. New features (variables) are added that were not available before, the definition of a metric is suddenly changed (for instance, the way income is measured) resulting in new data not compatible with prior data, and faulty scores. Also, when external data is gathered across multiple sources, each source may compute it differently, resulting in incompatibilities: for instance, when comparing individual credit scores from two people that are costumers at two different banks, each bank computes base metrics (income, recency, net worth, and so on) used to build the score, in a different way. Sometimes the issue is caused by missing data, especially when users with missing data are very different from those with full data attached to them.

Click through for a description of the approach and links showing how it works in practice.

Comments closed