Levenshtein Distances

Peter Coates provides an extremely fast estimate of Levenshtein Distance:

If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:

  • Confirming suspected near-duplication.

  • Estimating how much two document vary.

  • Filtering through large numbers of documents to look for a near-match to some substantial block of text.

The estimation process is pretty interesting.  Worth a read.


Koos van Strien moves from Python to R to run an xgboost algorithm:

Note that the parameters of xgboost used here fall in three categories:

  • General parameters

    • nthread (number of threads used, here 8 = the number of cores in my laptop)
  • Booster parameters

    • max.depth (of tree)
    • eta
  • Learning task parameters

    • objective: type of learning task (softmax for multiclass classification)
    • num_class: needed for the “softmax” algorithm: how many classes to predict?
  • Command Line Parameters

    • nround: number of rounds for boosting

Read the whole thing.

TCP Chimney Offloading

Wayne Sheffield offers advice on TCP Chimney Offloading:

With all of these changes to the OS, which setting should we use for SQL Server? In general, for all of these operating systems, I recommend that TCP Chimney Offload be disabled – because you can see odd connectivity problems in any other state. Notice in the above quote that Microsoft says that this feature is best used for applications with long-lived connections that transfer large amounts of data – hopefully your OLTP database is performing lots of short-lived connections and they are not transferring large amounts of data (if they are, I can help you with that!).

Definitely worth a read.

HDP 2.5 Sandbox

Kevin Feasel



The Hortonworks Data Platform 2.5 sandbox is now available:


1) Download HDP Sandbox as a VM image(VMware and Virtualbox or Docker
2) Setup and Start the VM image.
3) Try a Sandbox tutorial, check out the list of free tutorials, or jump directly into an Hello to HDP hands-on tutorial.
4) Need more help? Visit the Hortonworks Community Connection(HCC) and interact directly with the community and our development team.

It looks like they’ve bumped up the RAM requirements to 8 GB and have added new tutorials.

Specifying Columns In Entity Framework

Richie Rump shows how to limit the number of columns returned in an Entity Framework query:

This one’s a bit more tricky but let’s walk through it. We’re getting data from the Posts table where the Tags column equals “<sql-server>” and selecting every column from both the Posts and PostTags tables. We can tell because there are no specified properties in the Select. Even though this statement looks more complex it’s only three lines and looks somewhat like a SQL statement. But it’s really a LINQ (Language Integrated Query) statement, specifically a LINQ to Entities statement. This LINQ statement will be translated into this SQL statement:

Read the whole thing.

SSAS Timezone Conversions

Meagan Longoria notes that last processed date is UTC but the Properties page is local time:

The datetime returned by this query is in UTC. My query returns 9/19/2016 7:43:03 PM.

If I go into the properties of my SSAS database, I can see this same info, but the timezone conversion has already been done for me (this server is in Central time zone).

I think that on net, that’s the best way to do it:  store everything in UTC and use the presentation layer to convert those to local times.

SQL Server 2014 SMO And TruncateData

Max Trinidad finds a version inconsistency in SMO:

Hum! I just found out that in SQL Server 2014 (SP2 installed), while migrating from SQL Server 2005, one of my PowerShell script (I’ve been using for a long time) that uses SMO to truncate tables. But, when running it against a SQL Server 2014 database, I’m getting an error:

“..this property is not available on SQL Server 2014.”

For mi surprise, I ran the same PowerShell script against SQL Server 2016 and it works fine.

That seems rather odd.  If this affects you, vote up his UserVoice item.

Changing Identity Start Value

Kenneth Fisher has a good post on what happens when you change the seed value of an identity column:

Well Paul told me this wasn’t the case. Now when Paul tells me something I believe him, but I also like to run tests. So I decided to usesys.fn_PhysLocCracker(%%physloc%%). %%physloc%% returns a varbinary that gives you the location of the row. When passed tosys.fn_PhysLocCracker(%%physloc%%) it returns the database file, page in the file, and slot number where the row can be found. So to start with I create an identity(1,1) and I run 20 inserts, one at a time, checking row locations each time. This is to confirm I’m right about this part.

Clicking through is worth it for the hypnotizing animated GIFs.

SSMS Connection Colors

Andrew Pruski shows how to change window bar colors within SQL Server Management Studio:

A simple but effective setting in SQL Server Management Studio is using custom colours to identify which server you are about to execute a query on. It’s simple to setup but not everyone who uses SSMS is aware of it so I thought I’d quickly run through the steps here.

This is a nice visual way of figuring out you’re in production before you run that truncate table script.

Azure SQL Data Warehouse Setup

Arun Sirpal configures a new instance of Azure SQL Data Warehouse:

The information shown here is the DSQL (Distributed SQL) plan – When you send a SQL query to SQL Data Warehouse, the Control node processes a query and converts the code to DSQL then the Control node sends the command to run in each of the compute nodes.

The returned query plan depicts sequential SQL statements; when the query runs it may involve parallelized operations, so some of the sequential statements shown may run at the same time. More information can be found at the following URL https://msdn.microsoft.com/en-us/library/mt631615.aspx.

Arun also looks at running a simple Power BI report off of Azure SQL Data Warehouse; click through for that.


September 2016
« Aug Oct »