Given a room of 23 random people, what are chances that two or more of them have the same birthday?
This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.
What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.
The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom. In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.
As you can see from the animation, the algorithm is quite simple:
- First, we identify the user and ‘current’ song to start with (red line)
- Next, we identify the other users who have also listened to this song (green line)
- Then we find the other songs which those other users have also listened to (blue, dotted line)
- Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)
The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.
Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers: columnstore all the things). They also link to the GitHub project where you can try it out yourself.
SQL Graph is a similar concept to what is described above, but built in to the core SQL Server engine. This means 2 new table types NODE and EDGE and a few new TSQL functions in particular MATCH(). SQL Graph at the time of writing is only available in SQL 2017 ctp 2.0. You can read more and download ctp2.0 here https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/19/sql-server-2017-community-technology-preview-2-0-now-available/. Once ctp 2.0 is installed there is nothing else you need to do to enable the new graph syntax and storage.
There is an example you can download from Microsoft which is a similar set up to the example in the image above. However I have used some real data shredded from IMDB the internet movie database. This data is available to download from Kaggle https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
Click through for a video demonstration as well.
Graph extensions are fully integrated in the SQL Server engine. Node and edge tables are just new types of tables in the database. The same storage engine, metadata, query processor, etc., is used to store and query graph data. All security and compliance features are also supported. Other cutting-edge technologies like columnstore, ML using R Services, HA, and more can also be combined with graph capabilities to achieve more. Since graphs are fully integrated in the engine, users can query across their relational and graph data in a single system.
This is interesting. One concern I have had with graph databases is that graphs are storing the same information as relations but in a manner which requires two distinct constructs (nodes and edges) versus one (relations). This seems to be a hybrid approach, where the data is stored as a single construct (relations) but additional syntax elements allow you to query the data in a more graph-friendly manner. I have to wonder how it will perform in a production scenario compared to Neo4j or Giraph.
That is not really going to work out for us…
So I’m not liking the look of this, and going through the results, it seems to me that these results are just not useful. This isn’t the computers fault – it’s done exactly what I’ve told it to do – but a more useful result would be a list of columns and then either a simple ‘Yes’, or a ‘No’.
There’s syntax for this…
This is helpful for normalizing a bunch of wide, related tables into a subclass/superclass pattern.
An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers. The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?
Click through for the example.
It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.
Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.
It’s good to know that this is possible, but I’d switch to R or Python long before.
The mind-blowing part to me is the <> operator – that seemed utterly crazy to me.
And if you add another table, and join to it, guess what happens:
The answer might be a bit surprising.
Kendall’s Tau rank correlation is a handy way of determining how correlated two variables are, and whether this is more than chance. If you just want a measure of the correlation then you don’t have to assume very much about the distribution of the variables. Kendall’s Tau is popular with calculating correlations with non-parametric data. Spearman’s Rho is possibly more popular for the purpose, but Kendall’s tau has a distribution with better statistical properties (the sample estimate is close to a population variance) so confidence levels are more reliable, but in general, Kendall’s tau and Spearman’s rank correlation coefficient are very similar. The obvious difference between them is that, for the standard method of calculation, Spearman’s Rank correlation required ranked data as input, whereas the algorithm to calculate Kendall’s Tau does this for you. Kendall’s Tau consumes any non-parametric data with equal relish.
Kendall’s Tau is easy to calculate on paper, and makes intuitive sense. It deals with the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs of rankings. All observations are paired with each of the others, A concordant pair is one whose members of one observation are both larger than their respective members of the other paired observation, whereas discordant pairs have numbers that differ in opposite directions. Kendall’s Tau-b takes tied rankings into account.
I appreciate Phil putting this series together. I’d probably stick with R, but it’s good to have options.
FYI I’ve tried this at the column and schema levels and it didn’t work.
Using this you can hide the object from SSMS object explorer without restricting its use in any way.
I’m curious if there are any other hidden uses of extended properties. I haven’t been able to find any documentation so if you’ve seen any please let me know!
I don’t think I’ve ever had cause to hide objects from Management Studio, but if you’re looking for next year’s April Fools prank, maybe?