Press "Enter" to skip to content

Month: May 2016

Plus Equals Operator

Andy Mallon shows off the += operator in T-SQL:

This is logically equivalent to the first version of the code, but I find it makes for more readable code. It just looks cleaner.

For those of us who are lazy looking to maximize efficiency, this could save a whole lot of key strokes.

This is true, but if you’re on SQL Server 2012 or later, check out CONCAT for concatenation, as it handles NULL values more elegantly.

Comments closed

Custom Visualizations

Ginger Grant is good with using custom visualizations in Power BI:

Now since Power BI Custom Visualizations are not provided by Microsoft, they feel compelled to give you a warning message letting users know this. Here is the message box you get in Power BI Desktop when using a custom visualization. Notice that I clicked on the check box next to the text Don’t show this dialog again. As Words mean things, checking this box means the warning message never appears again. When you import the visualization into Power BI, no warning messages. Now I can use and propose custom visualizations to clients because they really are neat, and now they contain no warnings. Thanks so much to the Power BI Product team for fixing this major issue.

This is good news.

Comments closed

Interactive Heatmaps

Sahir Bhatnagar uses heatmaply to generate heatmaps:

In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data.

In this post I simulate some gene expression data and visualise it using theheatmaply package in R by Tal Galili. This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice

This went way past my rudimentary heatmap skills, so it’s nice to see what an advanced user can do.

Comments closed

Powershell ETL, Part 2

Max Trinidad has part 2 of his Powershell ETL series:

If you notice, in the above cmdlet the where-clause I’m selecting to use the Column1 property instead of a reasonable label. In my scenario the data in the CSV file contain variable columns fopr its different data types such as: Info, Error, and System. So, it was easy to identify the total number of columns to be 15 columns.

Now, using the cmdlet “Import-Csv” using the parameter “-Header”, you can define a list columns when you build the $Logdata object. We create the $header variable with the column-names separated by comma.

Keep an eye out for part 3.  In the meantime, check out part 1 if you haven’t already.

Comments closed

New XQueryPlanPath Update

Richie Lee has an update of XQueryPlanPath:

I’ve made a few changes to the XQueryPlanPath project. The project parses query plans into xml and then using xpath to find the value of one or more nodes. This could then be used in testing to verify that any changes made to a query would retain a query plan that is considered optimal, and then if any changes break the test you can verify if the change causes sub-optimal effect on your query.

There was however one issue – query plans are like opinions; every SQL Server Instance has one, and none of them think that theirs stinks. So running a test on a dev box will potentially produce a different query plan from that on the build server, to that of production etc. This broadly because of 3 reasons:

Check it out, especially if your XML parsing skills aren’t top-notch.

Comments closed

Hadoop 3

Alex Woodie covers some upcoming changes with Hadoop version 3:

Hadoop 3, as it currently stands (which is subject to change), won’t look significantly different from Hadoop 2, Ajisaka said. Made generally available in the fall of 2013, Hadoop 2 was a very big deal for the open source big data platform, as it introduced the YARN scheduler, which effectively decoupled the MapReduce processing framework from HDFS, and paved the way for other processing frameworks, such as Apache Spark, to process data on Hadoop simultaneously. That has been hugely successful for the entire Hadoop ecosystem.

It appears the list of new features in Hadoop 3 is slightly less ambitious than the Hadoop 2 undertaking. According to Ajisaka’s presentation, in addition to support for erasure coding and bug fixes, Hadoop 3 currently calls for new features like:

  • shell script rewrite;
  • task-level native optimization;
  • the capability to derive heap size or MapReduce memory automatically;
  • eliminating of old features;
  • and support for more than two NameNodes.

The big benefit to erasure coding is that you can potentially cut data usage requirements in half, so that can help in very large environments.  Alex also notes that the first non-beta version of Hadoop 3 is expected to release by the end of the year.

Comments closed

Screenshot Interview Questions

Brent Ozar has the latest in his interview series:

They didn’t give you parameter 26837 – I’m just giving you that so you can see an execution plan.

You don’t have to talk me through the query itself, or what you’d want to do to fix it. In fact, I want you to avoid that altogether.

Instead, tell me what things you need to know before you start tuning, and explain how you’re going to get them.

I think, based on the noise in the comments section, that this is a good question.  Good interview questions are separating in equilibrium (as opposed to pooling).  The question itself is straightforward, but people have such a tendency to jump the gun that they try to answer a question which isn’t being asked.  Then, when reading the question, the set of steps and processes people have is interesting because of how much they differ.

Bonus question:  take your interview answer (“I would do X and Y and Z and then A and B and C and maybe D.”) and apply it to the last time you had this scenario come up.  How many of [A-DX-Z] did you actually do?

Comments closed

New SQL Server Connector Preview

Rebecca Zhang notes that there is a new public preview of the SQL Server Connector:

For those not familiar with the SQL Server Connector, it enables SQL Server to use Azure Key Vault as an Extensible Key Management (EKM) Provider for its SQL encryption keys. This means that you can use your own encryption keys and protect them in Azure Key Vault, a cloud-based external key management system which offers central key management, leverages hardware security modules (HSMs), and allows separation of management of keys and data, for additional security. This is available for the SQL encryption keys used in Transparent Data Encryption (TDE), Column Level Encryption (CLE), and Backup encryption.

When using these SQL encryption technologies, your data is encrypted with a symmetric key (called the database encryption key) stored in the database. Traditionally (without Azure Key Vault), a certificate that SQL Server manages would protect this data encryption key (DEK). With Azure Key Vault integration for SQL Server through the SQL Server Connector, you can protect the DEK with an asymmetric key that is stored in Azure Key Vault. This way, you can assume control over the key management, and have it be in a separate key management service outside of SQL Server.

Check it out, as it might be a solution to some key management issues.

Comments closed

Data Frames

Saravanan Subramanian has an introduction to data frames in R:

The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.

The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.

Comments closed

Warehouses Will Live On

Jesse Seymour argues that in-memory analysis solutions will not entirely replace data warehouses:

The big reason that dimensional modeling increases clarity is that the dimensional model seeks to flatten data as much as possible.  Let’s compare two examples.  Both of these examples are for a fictional health clinic.

The first example is that we want a report on how many male patients were  treated with electric shock therapy by provider, grouped monthly and spanning year to date range.

Those big Kimball-style warehouses do a great job of making it easier for people who are not database specialists to query data and get meaningful, consistent results to known business questions.  The trick to understanding data platforms is that they tend to be complements rather than substitutes:  introducing Spark-R in your environment does not replace your Kimball-style warehouse; it complements it by letting analysts find trends more easily.  Similarly, a Hadoop cluster potentially lets you complement an existing data warehouse in a few ways:  acting as a data aggregator (which allows you to push some ETL work off onto the cluster), a data collector (especially for information which is useful but doesn’t really fit in your conformed warehouse), and a data processor (particularly for those gigantic queries which are not time-sensitive).

Comments closed