# Day: August 9, 2018

### Naive Bayes in the Industry

Now that you have an idea of what exactly Naive Bayes is and how it works, let’s see where it is used in the industry.

Our first industrial use case is News Categorization, or we can use the term ‘text classification’ to broaden the spectrum of this algorithm. News on the web is rapidly growing where each news site has its own different layout and categorization for grouping news. Companies use a web crawler to extract useful text from HTML pages of news articles to construct a Full Text RSS. The contents of each news article is tokenized (categorized). In order to achieve better classification results, we remove the less significant words, i.e. stop, from the document. We apply the naive Bayes classifier for classification of news content based on news code.

It’s a good overview of the topic and a particular implementation in Python.  Naive Bayes is a technique which you want in the bag:  there are a lot of techniques which tend to be better in specific domains, but Naive Bayes is easy to implement and usually provides acceptable performance.

Dave Mason continues his quest to learn R, focusing on vectors.  First, he looks at vector-based mathematical operations:

Now we can determine the number of customers gained vs number of customers lost (plus/minus) for each month of the quarter by subtracting one vector from another. Each vector has the same number of elements (three), and the result is also a vector of three elements:

```> net_customer_gain <- new_customers - customers_lost
> net_customer_gain
Jan Feb Mar
-15  30   3
```

The sum() function can be used to add up all the elements of a vector. Below, we get the total number of new customers and lost customers for the first quarter:

```> sum(new_customers)
[1] 270
> sum(customers_lost)
[1] 252```

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:

```> jersey_numbers[c(1,3,4)]
Pierce  Rondo  Allen
34      9     20
```

The values of the integer vector can be in any order:

```> jersey_numbers[c(4,1,3)]
Allen Pierce  Rondo
20     34      9```

Vectors are a critical part of understanding R.

Derek Jones recently discussed a possible future for the `R` ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of `CRAN` (which I consider vital to `R`, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

I’m not really sold on Jones’s argument, but I do think Mount has a good point.

Now let’s talk about the key benefit of this delay. In the scenario, where a user inadvertently drops a table, we can recover the data quickly from the Secondary Database as long as the Delay period has not elapsed. In this example, we drop the table Sales.Orderlines on BOTH databases and verify that the table no longer exists in BOTH databases.

It’s a fairly expensive technique but potentially still cheaper than restoring from backups.

As you can see, the groups are obtained by scanning the index on the groups table, and the aggregate is obtained by applying a seek in the index on the main table. The higher the density of the grouping set, the more optimal this plan is compared to the default strategy for the grouped query.

Just like we did earlier for the default scan strategy, let’s estimate the number of logical reads and plan cost for the seeks strategy. The estimated number of logical reads is the number of reads for the single execution of the Index Scan operator that retrieves the groups, plus the reads for all of the executions of the Index Seek operator.

The estimated number of logical reads for the Index Scan operator is negligible compared to the seeks; still, it’s CEILING(1e0 * @numgroups / @rowsperpage). Take Query 4 as an example; say the index idx_sid fits about 600 rows per leaf page (actual number depends on actual shipperid values since the datatype is VARCHAR(5)). With 5 groups, all rows fit in a single leaf page. If you had 5,000 groups, they would fit in 9 pages.

Plus some love for the APPLY operator.  Read the whole thing.

The scripts I ran to edit the rest of the databases looked similar to the below:

Once I finished altering all of my database files to their new locations, I stopped the SQL Server Service in Services. I copied and pasted all MDF and LDF files to their correlated new destinations and then started the SQL Server Service once more.

That’s when I ran into the interesting issue of “Recovery in a Pending state”. Some digging and sleuthing brought me back to my scripts.

Read on for those causes.

The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table’s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let’s just go ahead and say it. “This was a sucky implementation.” But as is common, in my typical role as a data architect in the data warehousing team, I get to learn all sorts of interesting techniques for finding and dealing with “data” that has been used in “interesting” ways.

What is kind of interesting is actually figuring out what that duplicated data was. The case that I was dealing with wasn’t a kind of useful packed surrogate value, where you may use a base 62 number, with a-z, A-Z and 0-9 as characters. So 1, 2, … , 9, 0, a, b, c, … x, y, z, A, B.. etc. 1A1 is a different value in that sequence than 1a1, and is greater . Neat technique, and one that I have been threatening to develop using a SEQUENCE object, where you can pack in a lot of sequential data in a small number of bytes. No, this wasn’t a useful case such as this, in this case, one value was lower case, another had leading capitals. So perhaps “active customer” and “Active Customer”. Yeah, seriously, they meant different things.

Louis shows some of the nuance required in making this work.

You just need to select a destination database connection and table name and the export starts. Be warned that it doesn’t default to the current database connection. I’ve fallen for that and overwritten the data in a table with the same filename on a different SQL instance – whoops!

If the option is greyed out when you open the menu it may be that your event data is still loading. If you look closely in the above screenshot you can see I have over 8 million events captured by this session, so it took a while to load before I was able to export.

There are a few gotchas that Matthew shows, but it’s a useful technique.

### Modifications to table variables

UPDATE, INSERT and DELETE on table variables cause a completely serial plan. SELECT statements, on the other hand, don’t neccessarily.

### Scalar functions

Completely serial. Even when they’re used in computed columns in one of the tables. Even when you’re not referencing that actual column.

Not all computed columns generate serial plans – only those with scalar functions.

Read on for a number of other places.  It turns out that this set is pretty stable from 2012 through to 2017.