Author: Kevin Feasel

Event Sourcing On Kafka

Published 2018-03-15 by Kevin Feasel

Adam Warski shows how you can use Apache Kafka as your event sourcing data source:

There’s a number of great introductory articles, so this is going to be a very brief introduction. With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these entities. Each event is a fact, it describes a state change that occurred to the entity (past tense!). As we all know, facts are indisputable and immutable. For example, suppose we had an application that saved a customer’s details. If we took an event sourcing approach, we would store every change made to that customer’s information as a stream, with the current state derived from a composition of the changes, much like a version control system does. Each individual change record in that stream would be an immutable, indisputable fact.

Having a stream of such events, it’s possible to find out what’s the current state of an entity by folding all events relating to that entity; note, however, that it’s not possible the other way round — when storing the current state only, we discard a lot of valuable historical information.

Event sourcing can peacefully co-exist with more traditional ways of storing state. A system typically handles a number of entity types (e.g. users, orders, products, …), and it’s quite possible that event sourcing is beneficial for only some of them. It’s important to remember that it’s not an all-or-nothing choice, but an additional possibility when it comes to choosing how state is managed in our application.

It’s a helpful article and works hand in hand with a CQRS pattern.

Comments closed

The Basics Of Kafka Security

Published 2018-03-15 by Kevin Feasel

Stephane Maarek has a nice post covering some of the basics of securing an Apache Kafka cluster:

Once your Kafka clients are authenticated, Kafka needs to be able to decide what they can and cannot do. This is where Authorization comes in, controlled by Access Control Lists (ACL). ACL are what you expect them to be: User A can(‘t) do Operation B on Resource C from Host D. Please note that currently with the packaged SimpleAclAuthorizer coming with Kafka, ACL are not implemented to have Groups rules or Regex-based rules. Therefore, each security rule has to be written in full (with the exception of the * wildcard).

ACL are great because they can help you prevent disasters. For example, you may have a topic that needs to be writeable from only a subset of clients or hosts. You want to prevent your average user from writing anything to these topics, hence preventing any data corruption or deserialization errors. ACLs are also great if you have some sensitive data and you need to prove to regulators that only certain applications or users can access that data.

He also has links to additional resources, which is helpful when you want to dig in further.

Comments closed

Wrapping Up A Data Science Project

Published 2018-03-15 by Kevin Feasel

I have finished my series on launching a data science project. First, I have a post on deploying models as microservices:

The other big shift is a shift away from single, large services which try to solve all of the problems. Instead, we’ve entered the era of the microservice: a small service dedicated to providing a single answer to a single problem. A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem. Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today. The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployRcome into play. These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns. Let me give two examples of patterns that my team has used to solve specific problems.

Then, I talk about the iterative nature of post-deployment life:

At this point in the data science process, we’ve launched a product into production. Now it’s time to kick back and hibernate for two months, right? Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done. First of all, it’s important to keep checking the efficacy of your models. Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change. Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously. You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model. This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

This was a fun series to write and will be interesting to come back to in a couple of years to see how much I disagree with the me of now.

Comments closed

Character Columns And MAX Vs TOP+ORDER Differences

Published 2018-03-15 by Kevin Feasel

Kendra Little digs into a tricky performance problem:

Most of the time in SQL Server, the MAX() function and a TOP(1) ORDER BY DESC will behave very similarly.

If you give them a rowstore index leading on the column in question, they’re generally smart enough to go to the correct end of the index, and — BOOP! — just pluck out the data you need without doing a big scan.

I got an email recently about a case when SQL Server was not smart enough to do this with MAX() — but it was doing just fine with a TOP(1) ORDER BY DESC combo.

The question was: what’s the problem with this MAX?

It took me a while to figure it out, but I finally got to the bottom of the case of the slow MAX.

Great story and very interesting sleuthing work on Kendra’s part.

Comments closed

DISTINCT, GROUP BY, And Transaction Isolation Levels

Published 2018-03-15 by Kevin Feasel

Rob Farley has an interesting post where two similar-looking queries can provide different outputs given certain transaction isolation levels:

Now, it’s been pointed out, including by Adam Machanic (@adammachanic) in a tweet referencing Aaron’s post about GROUP BY v DISTINCT that the two queries are essentially different, that one is actually asking for the set of distinct combinations on the results of the sub-query, rather than running the sub-query across the distinct values that are passed in. It’s what we see in the plan, and is the reason why the performance is so different.

The thing is that we would all assume that the results are going to be identical.

But that’s an assumption, and isn’t a good one.

Rob starts out with READ UNCOMMITTED but then gets into the “normal” READ COMMITTED transaction isolation level that most places use.

Comments closed

Reading Giant Text Files With Powershell

Published 2018-03-15 by Kevin Feasel

Mark Broadbent has a quick Powershell script to read a selection from a giant text file:

I recently ran into this very same problem in which the bcp error message stated:

An error occurred while processing the file “D:\MyBigFeedFile.txt” on data row 123798766555678.

Given that the text file was far in excess of notepad++ limits (or even easy human interaction usability limits), I decided not to waste my time trying to find an editor that could cope and simply looked for a way to pull out a batch of lines from within the file programmatically to compare the good rows with the bad.

Click through for a precise and useful Powershell snippet. The .NET file stream libraries are also good for this kind of thing.

Comments closed

The TIME Data Type

Published 2018-03-15 by Kevin Feasel

Randolph West covers the TIME data type in SQL Server:

This week, we look at the TIME data type. It is formatted as HH:mm:ss.fffffff, where HH is hours between 0 and 23, mmis minutes between 0 and 59, ss is seconds between 0 and 59, and f represents 0 or more fractional seconds, up to a maximum of seven decimal places.

With a maximum length of 5 bytes, TIME can store a value with a granularity of up to 100 nanoseconds.

I tend not to use TIME very often. It’s useful if you need it, but I rarely find myself needing a dateless time.

Comments closed

Working With Azure SQL Managed Instances

Published 2018-03-15 by Kevin Feasel

Jovan Popovic has a couple of posts covering configuration for Azure SQL Managed Instances. First, he looks at how to configure tempdb:

One limitation in the current public preview is that tempdb don’t preserves custom settings after fail-over happens. If you add new files to tempdb or change file size, these settings will not be preserved after fail-over, and original tempdb will be re-created on the new instance. This is a temporary limitation and it will be fixed during public preview.

However, since Managed Instance supports SQL Agent, and SQL Agent can be configured to execute some script when SQL Agent start, you can workaround this issue and create a SQL Agent job that will pre-configure your tempdb.

SQL Agent will start whenever Managed Instance fail-over and the job that contains script above can increase tempdb size before you start running your workload on the new instance.

Then, he covers network configuration:

Managed Instance is your dedicated resource that is placed in Azure Virtual network with assigned private IP address. Before you create Managed Instance, you need to create Azure Virtual network using Azure portal, PowerShell, or Azure CLI.

If you are using Azure portal, make sure that you use Resource Manager deployment model when you create Virtual Network. Classic Virtual Networks are not supported. Once you start creating Virtual Network, make sure that Service Endpoints option is Disabled in Creating Virtual Network Blade (this is default option so don’t change it).

If you want to have only one subnet in your Virtual Network (Virtual Network blade will enable you to define first subnet called default), you need to know that Managed Instance subnet can have between 16 and 256 addresses. Therefore, use subnet masks /28 to /24 when defining your subnet IP ranges for default subnet. If you know how many instances you will have make sure that you have at least 2 addresses per instance + 5 system addresses in the default subnet.

Both posts are useful if you’re interested in getting started with a managed instance.

Comments closed

Flushing The Authentication Cache

Published 2018-03-15 by Kevin Feasel

Arun Sirpal describes an Azure SQL DB-only DBCC command:

This command only applies to Azure SQL Database, at a high level it empties the database authentication cache for logins and firewall rules for the current USER database.

In Azure SQL Database the authentication cache makes a copy of logins and server firewall rules which are in the master database and puts them into memory within the user database. The Database Engine attempts re-authorisation using the originally submitted password and no user input is required.

If this still doesn’t make sense, then an example will really help.

Click through for the helpful example.

Comments closed

Levels And Unique In R

Published 2018-03-14 by Kevin Feasel

Eric Cai demonstrates the difference between levels() and unique() when dealing with factors in R:

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function. Here is my interpretation of the intent behind the creators of base R: The possible values of a factor are fundamental attributes of that variable, which should not be altered because of changes in the data.

There’s some back-and-forth in the comments; my takeaway is that both are useful functions depending upon what, exactly, you want to learn.

Comments closed