Using Xgboost In Azure ML Studio

Koos van Strien wants to use the xgboost model in Azure ML Studio:

Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:

  1. Export the trained model

  2. Zip the exported files

  3. Upload to the Azure ML environment

  4. Embed in your Azure ML solution

Read the whole thing.

Integrating Spark With Hive

Rahul Kumar wants to write Scala code to access the Hive datastore:

Hello geeks, we have discussed how to start programming with Spark in Scala. In this blog, we will discuss how we can use Hive with Spark 2.0.

When you start to work with Hive, you need HiveContext (inherits SqlContext), core-site.xml,hdfs-site.xml, and hive-site.xml for Spark. In case you don’t configure hive-site.xml then the context automatically creates metastore_db in the current directory and creates warehousedirectory indicated by HiveConf(which defaults user/hive/warehouse).

Rahul has made his demo code available on GitHub.

Choosing A Data Platform

Lukas Eder discusses when to use a relational database versus some non-relational database:

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

If you want a checklist, here’s how I would approach this question (ceteris paribus and limiting myself to about 100 words):

  1. Are you dealing with streaming millions of rows per second, or streaming from tens of thousands of endpoints concurrently?  Kafka and the Hadoop streaming stack.
  2. Is your problem something that you’ve already solved with a relational database, and your solution works well enough?  Relational database.
  3. Are there multiple “paths” to get to interesting data?  Relational database.
  4. Shopping carts (write-heavy, focused on availability over consistency)?  Riak/Cassandra/Dynamo at large scale, else relational database.
  5. Type duplication?  Relational database.
  6. Petabytes of data being analyzed asynchronously?  Hadoop.
  7. Other data platforms if they fit specific niche requirements around data structure.

There’s a lot more to this discussion than a simple numbered list, but I think it’s reasonable to start with relational databases and move away if and only if there’s a compelling reason.

SQL Server Configuration Section On Azure VM

Jack Li diagnoses an issue in which the SQL Server Configuration section of an Azure Virtual Machine only appeared under certain circumstances:

If you created an SQL Server VM via azure portal, there will be a section called “SQL Server Configuration” which was introduced via blog “Introducing a simplified configuration experience for SQL Server in Azure Virtual Machines”. Here is a screenshot of that setting.  It allows you to configure various things like auto backup, patching or storage etc.

I got a customer who created a SQL VM via powershell.  But that VM doesn’t have the section “SQL Server Configuration”.   Using his powershell script, I was able to reproduce the behavior.  When I created via portal UI, I got the “SQL Server Configuration”.

Read on for the solution.

Bug Scripting Sequences

Daniel Janik notices a small bug in how Management Studio scripts out sequences:

Have you ever scripted an object from SQL Server Management Studio (SSMS)? It does a really good job. You get nice cleanly formatted scripts that start with USE statements to select the database. They even have some simple comments?

Have you ever written a Sequence? Turns out if you script one you’ll notice that Microsoft left you an extra surprise. Double USE statements. Does it matter much? No. Should they fix it? Yes. I noticed this behavior when sequences were first released and it still exists in the latest version of SSMS for SQL 2016 (13.0.1500.23) as of this posting.

Yeah, that’s a tiny bug, but I can see it being annoying.

Traversing Foreign Keys Using Biml

Kevin Feasel


Biml, ETL

Ben Weissman has a two-part series on loading a set of tables based on foreign key constraints.  Part 1 is linear loads:

All our previous posts were running data loads in parallel, ignoring potential foreign key constraints. But in real life scenarios, your datawarehouse may actually have tables refering to each other using such, meaning that it is crucial to create and populate them in the right order.

In this blog post, we’ll actually solve 2 issues at once: We’ll provide a list of tables, will then identify any tables that our listed tables rely on (recursively) and will then create and load them in the right order.

In this sample, we’ll use AdventureWorksDW2014 as our source and transfer the FactInternetSales-table as well as all tables it is connected to through foreign key constraints. Eventually, we will create all these tables including the constraints in a new database AdventureWorksDW2014_SalesOnly (sorting them so we get no foreign key violations) and eventually populate them with data.

Part 2 is parallel loads:

After the first excitment about how easy it actually was to take care of that topology, you might ask yourself: Why does it have to run linear? That takes way too long. And you’re right – and it doesn’t have to.

All we need to do is:

– Create a list of all the tables that we’ve already loaded (which will be empty at that point)
– Identify all tables that do not reference any other tables
– Load these tables, each followed by all tables that only reference this single table – recursively and add them to list of loaded tables
– Once that is done, load all tables that are referencing multiple tables where all required tables have been loaded before – and again, add them to the list
– Repeat this until no table is left to load (or for a maximum of 10 times in this example)
– If, for whichever reason, any tables are left, load them sequentially using the TopoSort function:

This is a very interesting way of using Biml to traverse the foreign key tree.  I’ve normally used recursive CTEs in T-SQL to do the same, but I’ll have to play around with this method.

Docker On Windows Server

Elton Stoneman walks us through how to run Docker on Windows Server 2016:

There are two Windows Base images on the Docker Hub – microsoft/nanoserver andmicrosoft/windowsservercore. We’ll be using an IIS image shortly, but you should start with Nano Server just to make sure all is well – it’s a 250MB download, compared to 4GB for Server Core.

docker pull microsoft/nanoserver 

Check the output and if all is well, you can run an interactive container, firing up PowerShell in a Nano Server container:

Docker will also run on Windows 10 Pro, Enterprise, or Education editions.  That’s sad news for people who upgraded for free to Home Edition.

Sort Spill To Level 15K

Paul White explains spill levels:

At this point, you might be wondering what combination of tiny memory grant and enormous data size could possibly result in a level 15,000 sort spill. Trying to sort the entire Internet in 1MB of memory? Possibly, but that is way too hard to demo. To be honest, I have no idea if such a genuinely high spill level is even possible in SQL Server. The goal here (a cheat, for sure) is to get SQL Server to report a level 15,000 spill.

The key ingredient is partitioning. Since SQL Server 2012, we have been allowed a (convenient) maximum of 15,000 partitions per object (support for 15,000 partitions is also available on 2008 SP2 and 2008 R2 SP1, but you have to enable it manually per database, and be aware of all the caveats).

Read the whole thing.

Context Switching

Ewald Cress has finally snapped:

You start with a blank sheet,
three nuts and a bolt,
a strong sense of fairness,
a large can of Jolt.

And you try to imagine,
as best as you’re able
a rulerless kingdom
that won’t grow unstable.

That’s what happens when you dig into internals for too long.

SSMS Query Shortcuts

Andrew Pruski shows where to set Management Studio query shortcuts:

Following on from my last post Changing connection colours in SSMS I thought I’d write another quick about this cool but also often unused feature in SSMS.

These shortcuts allow you to run pre-determined queries by assigning a hot key within SSMS. To do this in SSMS go to Tools > Options > Environment > Keyboard

I love query shortcuts.  I have three dedicated to different sp_whoisactive commands (one to get everything going on; one to get everything related to my username; and one to get everything going on plus query plans, which I don’t always use because of the additional overhead).


July 2018
« Jun