Remember chemistry class in high school or college? You might remember having to keep a lab notebook for your experiments. The purpose of this notebook was two-fold: first, so you could remember what you did and why you did each step; second, so others could repeat what you did. A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”
Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here. You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.
I didn’t realize just how useful notebooks were until I started using them regularly.
I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.
Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.
But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.
Philippe includes a couple of examples in Pig, PySpark, and SparkSQL. Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.
Because we’ve added OR conditions into the mix, we’re forced to use the Nested Loop join, which loops over table B for every single row in A. That’s a lot of index scans and it comes with a hefty price tag.
Here’s an absolutely eye-watering beautiful pattern that I found on the Interwebs (though I forgot where) the other day.
This is an interesting use of INTERSECT. Check it out.
In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.
The important thing to remember is that these technologies tend to complement each other rather than supplant them.
In brief, SQL DW is a fully managed data-warehouse-as-a-service that you can provision in minutes and scale up in seconds. With SQL DW, storage and compute scale independently. You can dynamically deploy, grow, shrink, and even pause compute, allowing for cost savings. Also, SQL DW uses the power and familiarity of T-SQL so you can integrate query results across relational data in your data warehouse and non-relational data in Azure blob storage or Hadoop using PolyBase. SQL DW offers an availability SLA of 99.9%, the only public cloud data warehouse service that offers an availability SLA to customers.
The pricing calculator now reflects GA prices.
Digging a little deeper on this one. I would really love to see an enhancement to Resource Governor. Not just any enhancement will do. I need it to be enhanced so it will also affect the reporting services engine and the integration services engine in addition to the database engine. I want to be able to use RG to prevent certain reports from over consuming resources within the SSRS engine. Or for that matter, I want to make sure certain SSIS packages do not consume too much memory. If I can implement constraints on resources for these two engines it would be a huge improvement.
Check it out.
Unlike most SQL databases, which default to weaker isolation levels for performance reasons, VoltDB chooses to provide strong serializable isolation by default: the combination of serializability’s multi-object atomicity, and linearizability’s real-time constraints.
Serializability is the strongest of the four ANSI SQL isolation levels: transactions must appear to execute in some order, one at a time. It prohibits a number of consistency anomalies, including lost updates, dirty reads, fuzzy reads, and phantoms.
If you use VoltDB, it sounds like upgrading to 6.4 is a good idea.
An important facet of the provider is that it behaves like a file system. How many of us have deleted files from the command line? Have you also used a PowerShell one-liner to delete old files, like backups? If you have, you might be familiar with two parameters: -WhatIf and -Confirm. These two switch parameters are extremely helpful because they can keep you from cutting yourself with that sharp PowerShell knife.
With the July 2016 update, the SQL Server provider now supports the use of these two switches. While using them may not be a common situation, it’s good to know that they are there. After all, it could be handy if you wanted to clean up some items from the command line, like maybe a junk database. Now you can both check what you’re going to do before you do it, along with getting a confirmation question when you go for the actual delete:
I am a huge fan of the -WhatIf switch, so that gets a thumbs up from me.
One of the biggest frustrations that people find with SQL DW is that you need (or rather, needed) to use SSDT to connect to it. You couldn’t use SSMS. And let’s face it – while the ‘recommended’ approach may be to use SSDT for all database development, most people I come across tend to use SSMS.
But now with the July 2016 update of SSMS, you can finally connect to SQL DW using SQL Server Management Studio. Hurrah!
…except that it’s perhaps not quite that easy. There’s a few gotchas to be conscious of, plus a couple of things that caused me frustrations perhaps more than I’d’ve liked.
Yes, it’s never quite that easy… Read the whole thing.
Well obviously something somewhere is different. Start by comparing everything on both servers and both databases down to… hang on, here, we’ll write a PowerShell script and then….
Wait, wait, wait!
You have the execution plans? Before we start digging through all the properties everywhere and comparing everything to everything, updating statistics 14 times, and all the rest, what if we look at the execution plans. They’re different, so we should start looking at scans & indexes & statistics &….
Sometimes, it’s the little things that matter.