Press "Enter" to skip to content

Month: December 2016

Thinking About Backups

Rob Farley has a set of questions you should ask yourself regarding your backups:

Does your disaster testing include a situation where a well-meaning person has taken an extra backup, potentially spoiling differential or log backups?

Does your disaster testing include random scenarios where your team needs to figure out what’s going on and what needs to happen to get everything back?

Something which might be helpful would be to catalog the reason why you restored a particular backup (or when somebody asks you for a backup but you can’t do it), and then have a plan to handle that scenario in the future.

Comments closed

Bandit Algorithms

Tanner Thompson describes usage of a multi-armed bandit algorithm to drive conversions:

The functional idea behind a bandit algorithm is that you make an informed decision every time you assign a visitor to a test arm. Several bandit-type algorithms have been proved to be mathematically optimal; that is, they obtain the maximum future revenue given the data they have at any given point. Gittins indexing is perhaps the foremost of these algorithms. However, the trade-off of these methods is that they tend to be very computationally intensive.

This article doesn’t show any code, but it is useful for thinking about the problem.

Comments closed

Data Science Languages

Alessandro Piva provides preliminary metrics on language usage among self-described data scientists:

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.

These results don’t surprise me much.  R has slightly more traction than Python, but the percentage of people using both is likely to increase.  SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language.  Cf. Hive, Spark SQL, Phoenix, etc.

Comments closed

Interactive Decision Trees

Longhow Lam describes the interactive decision tree in Microsoft R Server 9.0:

Despite all the more modern machine learning algorithms, a good old single decision tree can still be useful. Moreover, in a business analytics context they can still keep up in predictive power. In the last few months I have created different predictive response and churn models. I usually just try different learners, logistic regression models, single trees, boosted trees, several neural nets, random forests. In my experience a single decision tree is usually ‘not bad’, often only slightly less predictive power than the more fancy algorithms.

An important thing in analytics is that you can ‘sell‘ your predictive model to the business. A single decision tree is a good way to to do just that, and with an interactive decision tree (created by Microsoft R) this becomes even more easy.

I’d like the labels in Longhow’s tree to be a little clearer, but I do like this from the perspective of giving end users something to experience.

Comments closed

Power BI Drillthrough

Ginger Grant explains how to create and use hierarchies in Power BI:

Finding where to create hierarchies is the hardest part of creating them in Power BI, especially if one has ever created hierarchies in Excel Power Pivot as they are not it the same place. Hierarchies are not in the Relationships data view, instead they are found in the Report view. Right clicking on the ellipse next to any field in a table displays a menu, and the second item on the menu is New hierarchy. Hierarchies can also be created by clicking and dragging a field on top of another field, which also will create a hierarchy. Once the hierarchy has been created, to add another field to the hierarchy, drag a new value on top of the value with the hierarchy icon. If the value added is not added to the location you want it, click on the ellipse next to the field named and move the field up or down as you wish.

Ginger also shows how to create drillthrough reports once you have hierarchies in place.

Comments closed

SQL Order Of Operations

Lukas Eder explains order of operations in a SQL query:

If you’re not a frequent SQL writer, the syntax can indeed be confusing. Especially GROUP BY and aggregations “infect” the rest of the entire SELECT clause, and things get really weird. When confronted with this weirdness, we have two options:

  • Get mad and scream at the SQL language designers
  • Accept our fate, close our eyes, forget about the snytax and remember the logicaloperations order

I generally recommend the latter, because then things start making a lot more sense, including the beautiful cumulative daily revenue calculation below, which nests the daily revenue (SUM(amount) aggregate function) inside of the cumulative revenue (SUM(...) OVER (...)window function):

Lukas explains things from an Oracle perspective, so not all of this matches T-SQL, but it’s close enough for comparison.

Comments closed

NOLOCK, No Problem?

Arun Sirpal explains that NOLOCK not only takes locks, but also lets you read invalid data:

A Sch-S (schema stability) lock is taken.  This is a lightweight lock; the only lock that can conflict with this is a Sch-m (schema modification) lock. (C = Conflict). This means that a NOLOCK can actually block for example against an ALTER TABLE command.

I would lean heavily toward turning on Read Committed Snapshot Isolation instead of using NOLOCK in most environments.  It’s something you’d need to test, but it does come with fewer bad ramifications.

Comments closed

Parameterized Visibility In SSRS Reports

Monica Rathbun shows how to show or hide clusters of columns in Reporting Services reports:

Ever had users come to you and request another version of a report just to add another field and group data differently? Today, was such the day for me. I really don’t like have multiple versions of the same report out there. So, I got a little fancy with the current version of the report and added a parameter then used expressions to group the data differently and hide columns. For those new to SSRS I’ve embedded some links to MSDN to help you along the way.

This is an easy-to-follow, step-by-step guide.

Comments closed

Synchronous Or Asynchronous Stats Updates

SQL Scotsman explains synchronous versus asynchronous stats updates:

With that said, it would seem that asynchronous statistics are better suited to OLTP environments and synchronous statistics are better suited to OLAP environments.  As synchronous statistics are the default though, people are reluctant to change this setting without good reason to.  I’ve worked exclusively in OLTP environments over the last few years and have never seen asynchronous statistics rolled out as the default.  I have personally been bitten by synchronous statistic updates on large tables causing query timeouts which I resolved by switching to asynchronous statistic updates.

This is an interesting, nuanced take on the issue.  My bias is toward asynchronous stats updates because I have been burned, but it’s interesting to read someone thinking through the implications of this seemingly simple choice.

Comments closed

Passing Parameters To SQL Queries Via Power BI

Chris Webb shows how to use the Value.NativeQuery() function to pass parameters to SQL Server queries:

It looks like, eventually, this will be the way that any type of ‘native’ query (ie a query that you write and give to Power Query, rather than a query that is generated for you) is run against any kind of data source – instead of the situation we have today where different M functions are needed to run queries against different types of data source. I guess at some point the UI will be updated to use this function. I don’t think it’s ‘finished’ yet either, because it doesn’t work on Analysis Services data sources, although it may work with other relational data sources – I haven’t tested it on anything other than SQL Server and SSAS. There’s also a fourth parameter for Value.NativeQuery() that can be used to pass data source specific options, but I have no idea what these could be and I don’t think there are any supported for SQL Server. It will be interesting to see how it develops over the next few releases.

It’s good to know that you can parameterize queries now.

Comments closed