The heart of his critique is this: data science is changing very fast, and any tool that you learn will eventually become obsolete.
This is absolutely true.
Every tool has a shelf life.
Every. single. one.
Moreover, it’s possible that tools are going to become obsolete more rapidly than in the past, because the world has just entered a period of rapid technological change. We can’t be certain, but if we’re in a period of rapid technological change, it seems plausible that toolset-changes will become more frequent.
The thing I would tie it to is George Stigler’s paper on information theory. There’s a cost of knowing—which the commenter notes—but there’s also a cost to search, given the assumption that you know where to look. Being effective in any role, be it data scientist or anything else, involves understanding the marginal benefit of pieces of information. This blog post gives you a concrete example of that in the realm of data science.
Choose the script that works for you. You can:
Use a simple Extended Events trace to get deadlock graphs via the sqlserver.xml_deadlock_report event
Use a Server Side SQL Trace to get deadlock graphs (for older versions of SQL Server, or people who like SQL Trace)
Use a (much more verbose) Extended Events trace to get errors, completed statements, and deadlock graphs. You only need something like this if the input buffer showing in the deadlock graph isn’t enough, and you need to collect the other statements involved in the transactions. You do this by matching the transaction id for statements to the xactid for each item in the Blocked Process Report. Warning, this can generate a lot of events and slow performance.
I’d default to script #1 and look at #3 in extreme scenarios.
To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me. I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest: Azure Blob Storage.
Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk. As such, this post will fill in the gaps rather than start from scratch. In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening. From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process. That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.
Spoilers: I’m still not aware of any core concepts unique to Azure Blob Storage.
There is not much to say except that this supports pipeline input. Take a close look. PowerShell folks don’t read text, they just focus on the code…So, here it is.
Click through for this code.
I wouldn’t have thought that Hekaton could take my report query down from 30+ min to 3 seconds but in the end it did. *Note that the source data is static and repopulated just twice a week. With that said I didn’t bother looking into any limitations that “report style” queries may cause OLTP operations. I’ll leave that to you.
With SQL Server 2016 (an important caveat), memory-optimized tables can work great for reporting scenarios. The important factor is having enough RAM to store the data.
Jes Borland has a five-part series on replicating a series of databases in an Availability Group to Azure SQL Database. Part 1 involves planning:
There are tasks you’ll need to take care of in SQL Server, the AG, and the SQL DB before you can begin.
This blog series assumes you already have an AG set up – it won’t go through the setup of that. It also assumes you have an Azure SQL server and a SQL Database created – it won’t go through that setup either.
Ideally, the publishers, distributor, and subscribers will all be the same version and edition of SQL Server. If not, you have to configure from the highest-version server, or you will get errors.
The first step in this process is to set up the remote distributor. As I mentioned in the first blog, you do not want your distribution database on one of the AG replicas. You need to set this up on a server that is not part of the AG.
Start by logging on to the distributor server – in this demo, SQL2014demo.
Stay tuned for the remainder of the series.
Here, you will see the four default reports that come with this Option.
Regressed Queries – shows query history and changes in statistics
Overall Resource Consumption – history of resources used in the database
Top resource Consuming Queries – Top x of queries using the most resources
Tracked Queries – enables you to see multiple query plans for a T-SQL statement and compare the plans or force a plan
For DBAs, this is one of the biggest reasons to upgrade to 2016.
Turns out the majority of the rows belonged to the second scenario. Whoops. The initial run took a little over 20 hours. Not exactly rocket speed. The problem was that for each period, a large number of rows in the clustered columnstore index (CCI) had to be updated, just to set the range of the interval. Updates in a CCI are expensive, as they are split into inserts and deletes. Doing so many updates resulted in a heavily fragmented CCI and with possibly too many rows in the delta storage (which is row storage).
Read the whole thing. Koen links to a Niko Neugebauer post, which you should also read. After that, read my warning on trickle loading. The major querying benefits you get from clustered columnstore indexes is great, but it does come at a cost when you aren’t simply inserting new rows.