Press "Enter" to skip to content

Month: May 2019

Defining TF-IDF

Bruno Stecanella explains the concept behind TF-IDF:

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

This makes the technique useful for natural language processing, especially in classification problems.

Comments closed

Puzzling Through Older Problems

Kenneth Fisher shares a couple of interview puzzles:

The year is 2004. You’re taking a tech test as an interview for a SQL development job. They have a page in their application that displays up to 20 rows of information. They need a piece of code that will return the rows from a given page. Oh, and it may not always be 20 rows per page. You need to write a piece of code where they can pass in a page number and page size and get back results. So for example, if the page size is 20 and the page is 3 then you need to return back rows 41 to 60.

The answers aren’t on the page, but then again, that’s the point of a puzzle.

Comments closed

The SSMS Magic 8 Ball

Bert Wagner has fun with SSMS:

As a kid, I found Magic 8 Balls alluring. There is something appealing about a who-knows-how-many-sides die emerging from the depths of a mysterious inky blue fluid to help answers life’s most difficult questions.

I never ended up buying a magic eight ball of my own though, so today I’m going to build and animate one in SQL Server Management Studio.

Now you can finally answer those important life questions without leaving Management Studio.

Comments closed

RANKX Across Multiple Columns

Jason Baldessari continues a series on RANKX in DAX:

What happens when we need to rank using multiple criteria?  In the example below, we are going to look at resellers by name and key in a SalesTerritoryGroup, and we are going to rank them based on the number of items they have sold.

Let’s start with the model.  I did slightly modify Rob’s original 3 table model approach.  I now have 5 tables here, but the model conceptually still works the same.  I have a sales table, a date table,  a resellers table,  and two lookup tables, one for geography and one for sales territory.

Read the whole thing.

Comments closed

Causing Error 666 When Loading Into Columnstore Index

Joe Obbish has moved into Erik Darling’s Internet Basement and has a doozy of a first post there:

I need to find a relatively efficient way to advance the CSILOCATOR because I need to do it over 2 billion times, if my theory is correct about the maximum allowed value. Both updating all of the rows in a delta rowgroup and deleting and reinserting advance the CSILOCATOR. I expected that small batch sizes would work best, and they did. For my table’s schema, the sweet spot for updates is about 275 rows and the sweet spot for delete/inserts is about 550 rows. Delete/inserts appeared to be faster than updates for the purpose of constantly reloading the same rows over and over.

Great post, Brent!

Comments closed

Embedding Notebooks on a Website

Eduardo Pivaral shows how to embed the results of a Jupyter notebook created in Azure Data Studio on a website:

Notebooks are a functionality available in Azure Data Studio, that allows you to create and share documents that may contain text, code, images, and query results. These documents are helpful to be able to share database insights and create runbooks that you can share easily.

Are you new to notebooks? don’t know what are the uses for it? want to know how to create your first notebook? then you can get started in ADS notebooks checking my article for MSSQLTips.com here.

Once you have created your first notebooks and share them among your team, maybe you want to share it on your website or blog for public access.
even when you can share the file for download, you can also embed it on the HTML code.

Be sure to read the comments too. Rendering notebooks is…an imperfect operation.

Comments closed

Recompile Hints and Query Store

Erin Stellato answers two questions relating to recompile hints and whether those queries show up in Query Store:

Last week in our IEPTO2 class I was asked about queries with OPTION (RECOMPILE) and Query Store. Specifically: Do queries that have the OPTION (RECOMPILE) hint go into Query Store, AND do queries in a stored procedure created with the RECOMPILE option go into Query Store? I knew the answer to the first question, and was pretty sure I know the answer to the second one, but I wanted to test to verify. Let’s take a look.

Erin gives you a tl;dr version but I’m going to ask you to read the whole thing anyhow.

Comments closed

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data:

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

My spin on it is that tidy data is Boyce-Codd Normal Form but may subsequently be denormalized. This may reintroduce violations of 3NF (as in Mount’s example) and sometimes 2NF, but does not change the shape of the variables themselves—that is, a variable still represents a single thing and exists per observation.

Comments closed

Visualizing Earthquake Data

Giorgio Garziano continues a series on analyzing earthquake data:

This is the third part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show static, interactive and animated earthquakes maps of different flavors by using the functionalities provided by a pool of R packages as specifically explained herein below.

Giorgio looks at 9 separate R mapping packages, so you get your money’s worth here.

Comments closed