Avro Schema Compatibility In Kafka

Kevin Feasel

2018-03-27

Hadoop

Neha Bhardwaj walks us through an error in Kafka:

You might have come across a similar exception while working with AVRO schemas.

Kafka throws this exception due to a compatibility issue since the current schema is not compatible with the earlier schema registered on this topic.

You can check the current schema(s) on the topic using:
curl -X GET <a href=”http://localhost:8081/subjects//versions/”&gt;http://localhost:8081/subjects//versions/

Read on to understand what this error means and how you can fix it if you see it.

Making Hadoop A Relational Database

Kevin Feasel

2018-03-27

Hadoop

Alex Woodie reports on an offering by a company named Splice Machine:

“Everybody is struggling to figure out how to expose what’s been put on the data lake to the business,” Splice Machine founder and CEO Monte Zweben told Datanami at the recent Strata Data Conference. “Our opinion is that you can take infrastructure that people understand, like relational database management systems, and run them directly on the data lake.”

That’s essentially the message that Splice has been pushing since the peak of the Hadoop frenzy in the 2013-2015 timeframe, and it’s the same message that it’s pushing today. The big difference, according to Zweben, is the maturity level. Splice Machine’s open source technology that essentially turns Hadoop into a distributed ACID-compliant relational database is now ready for primetime. Wells Fargo is arguably its biggest paying customer and production use case, but it has dozens more across financial services, healthcare and other industries.

Feasel’s Law strikes again.

Comparing Hadoop On EC2 To EMR

Mark Litwintschik looks at EC2 versus EMR in terms of performance, specifically targeting a solution at less than $3 per hour:

The $3.00 price point was driven by the first method: running a single-node Hadoop installation. I wanted to make sure the dataset used in this benchmark could easily fit into memory.

The price set the limit for the second method: AWS EMR. This is Amazon’s Hadoop Platform offering. It has a huge feature set but the key one is that it lets you setup Hadoop clusters with very little instruction. The $3.00 price limit includes the service fee for EMR.

Note I’ll be running everything in AWS’ Irish region and the prices mentioned are region- and in the case of spot prices, time-specific.

I was a bit surprised at which service won.

Automated Cleanup With Query Store

Grant Fritchey discusses Query Store’s automated cleanup and also looks at an interesting question:

Query Store has mechanisms for automatically cleaning your data. It is possible to cause them to break down. While presenting a session about the Query Store recently, I was asked what happened if you set the size of the Query Store below the amount of data currently in the store. I didn’t know the answer, so we tried it. Things got a little weird.

Click through to see how weird.

Unique Indexes Versus Unique Constraints

Greg Low argues that you should create unique constraints instead of unique indexes whenever possible:

The CREATE INDEX statement is used to do exactly what its name says, it creates an index. But when you say CREATE UNIQUE INDEX, you are doing more than that; you are enforcing a business rule that involves uniqueness.

I have a simple rule on this. Wherever possible business rules like uniqueness, check values, etc. should be part of the design of the table, and not enforced in an external object like an index.

So, rather than a unique index, I’d rather see a unique constraint on the underlying table.

But that’s where real life steps in. I see two scenarios that lead me to occasionally use CREATE UNIQUE INDEX.

Here’s a third:  creating constraints can cause blocking issues.  If you already have a large table and Enterprise Edition, creating a unique index can be an online operation (unless you have a clustered columnstore index on the table), but a unique constraint is always a blocking activity.

Backing Up Azure SQL Databases

Arun Sirpal enumerates the options we have for backups of Azure SQL Databases:

If you have a business requirement which has a need to retain database backups for longer than 35 days, then you have an option to use long-term backup retention. This feature utilises the Azure Recovery Services Vault where you can store up to 10 years’ worth of backups for up to 1000 databases per vault and 25 vaults per subscription.

There are some guidelines that you need to follow to successful set this up:

  • Your vault MUST be in the same region, subscription and resource group as your logical SQL Server, if not then you will not be able to set this up.

  • Register the vault to the server.

  • Create a protection policy.

  • Apply the above policy to the databases that require long-term backup retention.

Arun also looks at restoration options.

The Data Exploration Process

Stacia Varga takes a step back from analyzing NHL data to explore it a little more:

As I mentioned in my last post, I am currently in an exploratory phase with my data analytics project. Although I would love to dive in and do some cool predictive analytics or machine learning projects, I really need to continue learning as much about my data as possible before diving into more advanced techniques.

My data exploration process has the following four steps:

  1. Assess the data that I have at a high level

  2. Determine how this data is relevant to the analytics project I want to undertake

  3. Get a general overview of the data characteristics by calculating simple statistics

  4. Understand the “middles” and the “ends” of your numeric data points

There’s some good stuff in here.  I particularly appreciate Stacia’s consideration of data exploration as an iterative process.

The Blame Game

Kenneth Fisher has a board game for us:

It’s Monday morning and your manager Brent has called his usual emergency all-employee meeting. He looks more than a little bit unhappy, and this time it’s not because someone stole his cruller. Over the weekend he was demonstrating the new anatomy program Mr. Body to some investors and frankly the performance was miserable! Now Brent has only one question.

Who killed Mr. Body’s performance?

We all know Andy Mallon did it.

Check Constraints To Block Leading And Trailing Spaces

Louis Davidson shows how to use check constraints to block people from inserting records with leading or trailing spaces:

Then, let’s say the requirements are as follows:

1. No values that are either empty or only spaces
2. No leading spaces
3. No trailing spaces
4. Allow NULL if column allows NULL

Let’s look at how we could implement all of these independently, as there certainly are cases where you may wish to allow any or all of the situations in a column.

Click through for the scripts, as well as a time comparison to see how much overhead you’re adding.

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031