Press "Enter" to skip to content

Day: April 23, 2019

Naive Bays in R

Zulaikha Lateef takes us through the Naive Bayes algorithm and implementations in R:

Naive Bayes is a Supervised Machine Learning algorithm based on the Bayes Theorem that is used to solve classification problems by following a probabilistic approach. It is based on the idea that the predictor variables in a Machine Learning model are independent of each other. Meaning that the outcome of a model depends on a set of independent variables that have nothing to do with each other. 

Naive Bayes is one of the simplest algorithms available and yet it works pretty well most of the time. It’s almost never the best solution but it’s typically good enough to give you an idea of whether you can get a job done.

Comments closed

Flink: Batch as a Special Case of Streaming

Fabian Hueske and Aljoscha Krettek describe streaming versus batch processing in Apache Flink:

The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.

Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.

Check it out. At the end, the authors also describe Blink, a fork of Flink being (slowly) merged back in and which supports this paradigm.

Comments closed

Multi-Tenant Security in Kudu + Impala

Grant Henke shows how you can combine Apache Impala’s fine-grained authorization with Apache Kudu’s coarse-grained authentication for multi-tenant scenarios:

Kudu supports coarse-grained authorization of client requests based on the authenticated client Kerberos principal. The two levels of access which can be configured are:
1. Superuser – principals authorized as a superuser are able to perform certain administrative functionality such as using the kudu command line tool to diagnose or repair cluster issues.
2.User – principals authorized as a user are able to access and modify all data in the Kudu cluster. This includes the ability to create, drop, and alter tables as well as read, insert, update, and delete data.

Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the two levels. 

Read on to see how to tie it all together.

Comments closed

Load Testing Tools For SQL Server

Brent Ozar shares a list of load testing tools for SQL Server:

One thing I need you to understand first: you have to provide the database and the queries. Almost all of the tools in this post, except the last one, are designed to help you run queries, but they don’t include the queries. The whole idea with load testing is that you’re trying to mimic your own workloads. If you’re just trying to test a server with generic workloads, start with my post, “How to Check Performance on a New SQL Server.”

Click through for a list of tools. I’d also throw in Pigdog from Mark Wilkinson (one of my co-workers). This helped replicate a few issues in SQL Server 2017 around tempdb performance.

Comments closed

Dropping Database Objects with Aplomb

Pamela Mooney has a two-part series on dropping database objects. Part one includes a big setup script:

Some months ago, a fellow DBA came to me and expressed her concern over the normal housecleaning process that occurred at her company.  Developers or product owners would submit changes for objects which were no longer used, only to find that sometimes, they were.  No damage had been done, but the possibility of creating an emergent situation was there.

I could well understand her concern.  Most of us who have been DBAs for any length of time have probably come across a similar scenario.  I thought it would be fun to write a process that reduced the risk of dropping database objects and made rollbacks a snap.

Part 2 handles the actual drops:

Now, the objects in the table will be dropped after 120 days.  But what if you need them to be dropped before (or after)?  Both options work, but I’ll show you the before, and also what happens if there is a problem with the drop.

Check it out and drop with impunity. Or at least the bare minimum of punity.

Comments closed

Importing a Private Key From VARBINARY

Solomon Rutzky tries out various methods of loading certificates and private keys in SQL Server:

These results confirm that:
1. You can import a certificate from a VARBINARY literal
2. You can import a private key when creating a certificate from a VARBINARY literal
3. You cannot import a private key when creating a certificate from an assembly
4. Except when creating a certificate from an assembly, any combination of sources for the certificate (i.e. public key and meta-data) and the private key should be valid

It’s a long post with a lot of detail and quite a few tests, so check it out.

Comments closed

SARGability and Date Functions

Erik Darling shows why you don’t want to use YEAR() or MONTH() in the WHERE clause when querying a large table:

If you’ve been query tuning for a while, you probably know about SARGability, and that wrapping columns in functions is generally a bad idea.

But just like there are slightly different rules for CAST and CONVERT with dates, the repercussions of the function also vary.

The examples I’m going to look at are for YEAR() and MONTH().

Read the whole thing. Maybe “go to brunch” in the middle of it for maximum effect.

Comments closed

Data Breach Causes

Grant Fritchey takes us through some of the immediate causes of data breaches:

Fine, let’s talk about a business then. How about 24 million loan records, including bank account information, email, phones, social security numbers and all the rest. Yeah, that was sitting on an Elasticsearch database with no password of any kind. Oh, and the S3 storage was completely open too. Security? Is that still a thing?

How about exposing your entire client list because you left the password off the database (Elasticsearch again, is it hard to add a password to Elasticsearch). How about stacks of resumes (ElasticSearch, again, and MongoDB).

Those are just breaches from this year. If we go back, we can find more and more. Please, put a password on your systems. 

The OWASP Top 10 application security risks is out there and provides a lot of useful information on how to prevent the problems Grant mentions.

Comments closed