Press "Enter" to skip to content

Day: June 11, 2018

Understanding Binary Trees

Robert Maclean has a couple of posts on binary trees.  In the first post, he explains the basics of a binary tree:

As a binary tree has some flexibility in it, a number of classifications have come up to have a consistent way to discuss a binary tree. Common classifications are:

  • Full binary tree: Each node in a binary tree can have zero, one, or two child nodes. In a fullbinary tree, each node can only have zero or two child nodes.
  • Perfect binary tree: This is a full binary tree with the additional condition that all leaf nodes (i.e. nodes with no children) are at the same level/depth.
  • Complete binary tree: The complete binary tree is where each leaf node is as far left as possible.
  • Balanced binary tree: A balanced binary tree is a tree where the height of the tree is as small a number as possible.

Then, he looks at binary search trees:

So, why should we care about a BST? We should care because searching is really performant in it as each time you move a depth down, you eliminate approximately 50% of the potential nodes.

So, for example, if we wanted to find the item in our example with the key 66, we could start at the root (50) and move right. At that point, we have eliminated 8 possible nodes immediately. The next is to the left from the node with the 70 (total possible nodes removed 12). Next is to the right of the node with the value of 65, and then to 66 to the left of 67. So we found the node with 5 steps.

Going to Big O Notation, this means we achieved a performance of close to O(log n). It is possible to have a worst case of O(n), when the tree is not Optimal or Unbalanced.

Binary search trees are an easy to understand, reasonably efficient model for searching for data.  Even when there are better options, this is an easy algorithm to implement and can often be good enough to solve the problem.

Comments closed

Pareto Efficiency And Mario Kart

The folks at Civis Analytics answer one of the more important questions in life:

Mario Kart was a staple of my childhood — my friends and I would spend hours after school as Mario, Luigi, and other characters from the Nintendo universe racing around cartoonish tracks and lobbing pixelated bananas at each other. One thing that always vexed our little group of would-be speedsters was the question of which character was best. Some people swore by zippy Yoshi, others argued that big, heavy Bowser was the best option. Back then there were only eight options to choose from; fast forward to the current iteration of the Mario Kart franchise and the question is even more complicated because you can select different karts and tires to go with your character. My Mario Kart reflexes aren’t what they used to be, but I am better at data science than I was as a fourth grader, so in this post I’ll use data to finally answer the question “Who is the best character in Mario Kart?”

This post also acts as a primer on Pareto Efficiency, an important concept in economics.

Comments closed

Flattening JSON Data With Databricks

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Microsoft R Open 3.5.0 Released

David Smith announces that Microsoft R Open 3.5.0 is now available:

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future).

Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date.

It’s nice to see Microsoft keeping pace with R changes; they look like they’re averaging about 6-8 weeks from an R point release to an MRO release.

Comments closed

Smarter Indexes Based On Column Cardinality

Eric Blinn has a function which organizes columns in the missing index DMV by cardinality:

Bryan Rebok and Brent Ozar recently opened my eyes to something I didn’t know.  When SQL Server recommends missing indexes to you it puts the columns in order in which they are found in the table.  That’s it.  I always thought there was more logic into it.  But there isn’t.  Upon reading this I had a terrible realization that I’ve made a lot of awful indexes in my time.  I owe the world an apology.  I hope this post can serve as that apology.

I’ve written a function that accepts the equality column list from dm_db_missing_index_details as a parameter and spits those columns back out in order by their cardinality.  This won’t necessarily be the proper order for the columns in every index, but it is far more likely to be correct than the initial result from the DMV.

I’m amazed that the missing index DMV generates column names in such a simplistic manner.

Comments closed

Figuring Out Azure Analysis Services Costs

Chris Webb explains that Azure Analysis Services might not be quite as expensive as you’d first think:

What does this mean for the cost of Azure Analysis Services? Basically, if you’re taking advantage of these features you won’t pay one of the monthly prices quoted on the pricing page linked to at the top of this post. Instead you may do things like:

  • Scale up for one hour every day when you need to process your SSAS database, just to get the extra memory and QPUs needed, then scale down when processing has finished
  • Scale out only on certain days, or certain times of day, to handle increased numbers of users
  • Pause your instance when you are sure that no-one needs to run queries

How do you then calculate the likely cost? For my Azure Analysis Services precon at SQLBits a few months ago I built an Excel workbook that shows how to go about this.

There are some good questions in the comments section, so check those out as well.

Comments closed

BCP And Multiple SQL Server Instances

Manoj Pandey investigates an interesting issue with BCP:

I observed one thing here with BCP (Bulk Copy Program), when you have 2 versions of SQL Server installed on you PC or Server. I had SQL Server 2014 & 2016 installed on one of my DEV server.
So if you are executing Query from SQL 2016 instance, it was inserting records in SQL 2014 instance:

exec master..xp_cmdshell ‘BCP AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c’

But even if you use BCP 2016 version, it was still inserting in SQL 2014 instance:

Read on for the reason as well as how to specify which instance you want to use.

Comments closed