Press "Enter" to skip to content

Curated SQL Posts

Polybase With Azure Blob Storage

I look at using Polybase to read data from Azure Blob Storage:

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Spoilers:  I’m still not aware of any core concepts unique to Azure Blob Storage.

Comments closed

In-Memory OLTP For Reporting

Daniel Janek shows that memory-optimized tables aren’t just for OLTP scenarios:

I wouldn’t have thought that Hekaton could take my report query down from 30+ min to 3 seconds but in the end it did. *Note that the source data is static and repopulated just twice a week. With that said I didn’t bother looking into any limitations that “report style” queries may cause OLTP operations. I’ll leave that to you.

With SQL Server 2016 (an important caveat), memory-optimized tables can work great for reporting scenarios.  The important factor is having enough RAM to store the data.

Comments closed

Transactional Replication To Azure SQL Database

Jes Borland has a five-part series on replicating a series of databases in an Availability Group to Azure SQL Database.  Part 1 involves planning:

There are tasks you’ll need to take care of in SQL Server, the AG, and the SQL DB before you can begin.

This blog series assumes you already have an AG set up – it won’t go through the setup of that. It also assumes you have an Azure SQL server and a SQL Database created – it won’t go through that setup either.

Ideally, the publishers, distributor, and subscribers will all be the same version and edition of SQL Server. If not, you have to configure from the highest-version server, or you will get errors.

Part 2 prepares the replication distributor:

The first step in this process is to set up the remote distributor. As I mentioned in the first blog, you do not want your distribution database on one of the AG replicas. You need to set this up on a server that is not part of the AG.

Start by logging on to the distributor server – in this demo, SQL2014demo.

Stay tuned for the remainder of the series.

Comments closed

Introducing Query Store

Thomas LeBlanc takes a brief look at Query Store:

Here, you will see the four default reports that come with this Option.

  1. Regressed Queries – shows query history and changes in statistics

  2. Overall Resource Consumption – history of resources used in the database

  3. Top resource Consuming Queries – Top x of queries using the most resources

  4. Tracked Queries – enables you to see multiple query plans for a T-SQL statement and compare the plans or force a plan

For DBAs, this is one of the biggest reasons to upgrade to 2016.

Comments closed

Clustered Columnstore Index Updates

Koen Verbeeck discusses what updates do to clustered columnstore indexes:

Turns out the majority of the rows belonged to the second scenario. Whoops. The initial run took a little over 20 hours. Not exactly rocket speed. The problem was that for each period, a large number of rows in the clustered columnstore index (CCI) had to be updated, just to set the range of the interval. Updates in a CCI are expensive, as they are split into inserts and deletes. Doing so many updates resulted in a heavily fragmented CCI and with possibly too many rows in the delta storage (which is row storage).

Read the whole thing.  Koen links to a Niko Neugebauer post, which you should also read.  After that, read my warning on trickle loading.  The major querying benefits you get from clustered columnstore indexes is great, but it does come at a cost when you aren’t simply inserting new rows.

Comments closed

vtreat

John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Comments closed

Functional Programming In R

Bruno Rodrigues is working on a book that hits two of my favorite topics:

The book I’ve been working on these pasts months (you can read about it here, and read it for free here) is now available on Leanpub! You can grab a copy and read it on your ebook reader or on your computer, and what’s even better is that it is available for free (but you can also decide to buy it if you really like it). Here is the link on Leanpub.

In the book, I show you the basics of functional programming, unit testing and package development for the R programming language. The end goal is to make your data tidy in a reproducible way!

Looks like I have a book to add to my queue.

Comments closed

Show Hidden Cubes

Chris Webb explains how to show hidden Analysis Services cubes in Management Studio:

Unfortunately this doesn’t make any objects in the cube that are not visible, like measures or dimensions, visible again – it just makes the cube itself visible. However, if you’re working on the Calculations tab of the Cube Editor in SSDT it is possible to make all hidden objects visible as I show here.

Read on for the command and watch out for that caveat.

Comments closed

Building A Multi-Node Hadoop Cluster With Spark

Rao Swati has a step-by-step instruction guide on how to set up a multi-node cluster with Hadoop 2.7.3 and Spark 1.6.2:

Important Notes:

  1. Start-dfs.sh  will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.
  2. Start-yarn.sh  will start NodeManager, ResourceManager on the master node and NodeManager on slaves.
  3. Perform  Hadoop namenode -format  only once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.

If you’d like to set up your own Hadoop cluster rather than using one of the big vendors (Hortonworks, Cloudera, MapR) or a PaaS solution like HDInsight or ElasticMapReduce, this will give you a head start.

Comments closed