Press "Enter" to skip to content

Day: September 5, 2017

Learning Naive Bayes

Sunil Ray explains the Naive Bayes algorithm:

What are the Pros and Cons of Naive Bayes?

Pros:

  • It is easy and fast to predict class of test data set. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Read the whole thing.  Naive Bayes is such an easy algorithm, yet it works remarkably well for categorization problems.  It’s typically not the best solution, but it’s a great first solution.  H/T Data Science Central

Comments closed

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Comments closed

Connecting To Kafka Via SSL

Harikiran Nayak shows how to work with secure Kafka connections:

First, the Kafka broker must be configured to accept client connections over SSL. Please refer to the Apache Kafka Documentation to configure your broker. If your Kafka cluster is already SSL-enabled, you can look up the port number in your Kafka broker configuration file (or the broker logs). Look for the listeners=SSL://host.name:port configuration option. To ensure that the Kafka broker is correctly configured to accept SSL connections, run the following command from the same host that you are running SDC on. If SDC is running from within a docker container, log in to that docker container and run the command.

Read on for more.

Comments closed

Finding Out Whodunnit Using The Transaction Log

David Fowler shows us how to figure out which user made a bad data change when you don’t have auditing mechanisms in place:

So it’s looking like things are in a bad way, obviously we could go to a backup and get the old values back but that’s never going to tell us who made the change.  So that transaction log again, how do we actually go about getting our hands dirty and having a look at it.

Well there’s a nice little undocumented function called fn_dblog.  Let try giving that a go and see what we get back. By the way, the two parameters are the first and last LSNs that you want to look between.  Leaving them as NULL with return the entire log.

This is great unless you have connection pooling and the problem happened through an application.  In that case, the returned username will be the application’s username.

Comments closed

CHECKDB On Azure SQL Database

Arun Sirpal ponders running DBCC CHECKDB on Azure SQL Database:

I was exchanging messages with Azure Support and even though I didn’t get a concrete answer to confirm this I ended up asking the question within a Microsoft based yammer group and yes they do automatically carry out consistency checks.

This is great but it is one less thing for me to worry about and if there is serious corruption, you know potential data loss (which would be rare) then they will definitely tell you and work with you.

However, it doesn’t mean you CAN’T run it, I was curious so I ran DBCC CHECKDB on my Azure SQL Databases, but like with any other consistency check it is best to do it OFF-PEAK hours. I would probably take it a step further and wouldn’t even bother running it.

It’s an interesting post, reminding us that administering an Azure database isn’t the same as on-prem.

Comments closed

Two Mobile Reporting Stories On The Microsoft Stack

Paul Turley talks about the two competing mobile stories on the Microsoft stack, namely SSRS with DataZen and Power BI:

This is the story of two products – or rather one product that is now a service and another product that is now a component of another product.  A few years ago, Microsoft began to formulate a mobile usability story among many fragmented tools.  They had a really good reporting product: SSRS, and they had a pretty good self-service BI capability offered as a bunch of Excel add-ins; namely: Power Pivot, Power Query and Power View – but it didn’t do mobile.  They bought Datazen which was a decent mobile reporting and dashboard tool, designed primarily for IT developers and semi-tech-savvy business pros to quickly create mobile dashboards using traditional data sources.  Datazen wasn’t really a self-service BI tool and wasn’t really designed to work with BI data in the true sense.  It was a good power user report tool but was young and needed to be refined and matured as a product.  Datazen became “Reporting Services Mobile Reports” and was integrated into the SSRS platform as a separate reporting experience with a separate design tool, optimized exclusively for use on mobile devices using platform-specific mobile phone and tablet apps.  Since initial roll-out, product development stalled and has not changed at all since it was released with SQL Server 2016 Enterprise Edition.

Paul gives us his current advice, as well as a hint at where things could be going.

Comments closed

New E-Mail Course For Unit Testing T-SQL Code

Ed Elliott has a free e-mail course available to learn how to use tSQLt:

Unit testing helps us to write better code, make rapid changes to our code and has been generally seen as a good idea for about 10 years. Writing tests for T-SQL code is made much easier by using tSQLt but there is quite a high barrier to entry both in terms of the technical skills in getting tSQLt running and also how to approach large code bases of, sometimes, unfriendly T-SQL code and taming the code with unit tests.

I have successfully unit tested T-SQL code in a number of different environments including clean greenfield environments as well as legacy projects and I have written this course to help people get started with unit testing but also help them to turn unit testing into a part of their development process that they can use everyday to improve the quality of their work and the speed at which deployments can be made.

Click through to sign up.

Comments closed

Protecting Stored Procedures Against SQL Injection

Bert Wagner has a two-part series on SQL injection.  In the first post, he shows how to use sp_executesql to parameterize queries:

The important thing to note in the query above is that we are generating a dynamic SQL statement; that is, we are building the SQL query string, and then we are executing it.

Imagine this stored procedure is running in order to display a “Welcome <Full Name>!” message in our app — a website visitor types in their@ParmUserName and we execute the stored procedure to return their full name.

In his second post, Bert shows what to do if you need to run a query off of a dynamically-selected table:

Unfortunately we have to fall back on SQL’s EXEC command.

However, like we discussed last week, we need to be vigilant about what kind of user input we allow to be built as part of our query.

Assuming our app layer is already sanitizing as much of the user input as possible, here are some precautions we can take on the SQL side of the equation:

Read on for more.

Comments closed

Get Security Update List In Powershell

Jana Sattainathan builds a detailed CSV of Microsoft monthly security updates using Powershell:

Once I understood the data well, I realized that the raw data had to be flattened out to expand collections (like KB) at the row level into their own row so that everything has a single value in each row. Then, the grouping is easy.

It made more sense to allow grouping not just by KB but by other columns like Product or CVE. The Group-Object works fine for most cases but since there will be duplicates after the data is grouped, it makes it easier to just do it with HashTables.

Jana provides the entire solution on his site.  When reading it, I felt the urge to switch to a language which offers easier pivoting and aggregation, but the code was clear and understandable.

Comments closed