2018-01-19 – Curated SQL

The common wisdom (according to several conversations I’ve had, and according to a mailing list thread) seems to be: put all events of the same type in the same topic, and use different topics for different event types. That line of thinking is reminiscent of relational databases, where a table is a collection of records with the same type (i.e. the same set of columns), so we have an analogy between a relational table and a Kafka topic.

The Confluent Avro Schema Registry has traditionally reinforced this pattern, because it encourages you to use the same Avro schema for all messages in a topic. That schema can be evolved while maintaining compatibility (e.g. by adding optional fields), but ultimately all messages have been expected to conform to a certain record type. We’ll revisit this later in the post, after we’ve covered some more background.

For some types of streaming data, such as logged activity events, it makes sense to require that all messages in the same topic conform to the same schema. However, some people are using Kafka for more database-like purposes, such as event sourcing, or exchanging data between microservices. In this context, I believe it’s less important to define a topic as a grouping of messages with the same schema. Much more important is the fact that Kafka maintains ordering of messages within a topic-partition.

Read the whole thing.

Comments closed

“Pretty But Useless” Visuals

Published 2018-01-19 by Kevin Feasel

I continue my dashboard visualization series with a bit of an extended rant:

The best use of a pie chart is to show a simple share of a static total. Here, we can see that Daredevil has almost half of the critics’ reviews, and that The Punisher and Jessica Jones are split.

This simple pie chart also shows some of the problems of pie charts. The biggest issue is that people have trouble with angle, making it hard to distinguish relative slices. For example, is Jessica Jones’s slice larger or is The Punisher’s? It’s really hard to tell in this case, and if that difference is significant, you’re making life harder for your viewers.

Second, as slice percentages get smaller, it becomes harder to differentiate slices. In this case, we can see all three pretty clearly, but if we start getting 1% or 2% slices, they end up as slivers on the pie, making it hard to distinguish one slice from another.

Third, pie charts usually require one color per slice. This can lead to an explosion of color usage. Aside from potential risks of using colors which in concert are not CVD-friendly, adding all of these colors has yet another unintended consequence. If you use the same color in two different pie charts to mean different things, you can confuse people, as they will associate color with some category, and so if they see the same color twice, they will implicitly assign both things the same category. That leads to confusion. Yes, careful reading of your legend dissuades people of that notion, but by the time they see the legend, they’ve already implicitly mapped out what this color represents.

Fourth, pie charts often require legends, which increases eye scanning.

Click through to read me complain about other types of visuals, too.

Comments closed

Anomaly Detection With Python

Published 2018-01-19 by Kevin Feasel

Robert Sheldon continues his SQL Server Machine Learning Series:

As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what’s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.

The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results.

Click through to see the scenario that Robert has laid out as an example.

Comments closed

AWS Glue Now Supports Scala

Published 2018-01-19 by Kevin Feasel

Mehul Shah, et al, announce that AWS Glue officially supports Scala:

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

Functional languages tend to be very good for ETL tasks, and Scala is a great choice due to its relationship with Spark.

Comments closed

Dealing With String Parsing In T-SQL

Published 2018-01-19 by Kevin Feasel

Andy Mallon has written a T-SQL function to parse file paths from strings:

Writing & reading code is easier if you understand the logic before attacking the code. I find this to be particularly important when you anticipate complicated code. SQL Server sucks at parsing strings, so I anticipate complicated code.

How do you identify the directory from a file path? That’s just everything up to the last slash–and I like to include that final slash to make it clear it’s a directory.

How do you identify the file name from a file path? That’s just everything after the final slash.

The key here is going to be identifying that final slash, and grabbing the text on either side.

Read on for the function.

Comments closed

Automatic Partition Splitting

Published 2018-01-19 by Kevin Feasel

Marlon Ribunal has a script to split partitioned tables automatically:

So, let’s pretend it’s the month of April 2017 and this is the partition currently populated. Based on the query above, aside from the current partition bucket, we also have another available bucket month for May.

Say we want to maintain 3 available buckets at any given time. The next available bucket is May, so that means we need 2 more partitions to cover for June and July.

Read on for more, including some scripts that you can automate.

Comments closed

SQL Operations Studio January Release

Published 2018-01-19 by Kevin Feasel

Alan Yu announces a new release of SQL Operations Studio:

The January release includes several major repo updates and feature releases, including:

Enable the HotExit feature to automatically reopen unsaved files.
Add the ability to access saved connections from Connection Dialog.
Set the SQL editor tab color to match the Server Group color.
Fix the broken Run Current Query command.
Fix the broken pinned Windows Start Menu icon.

Click through for the download link.

Comments closed

A Story Of Database Corruption

Published 2018-01-19 by Kevin Feasel

Jason Brimhall tells a tale of a corrupt database:

Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:

Msg 823, Level 24, State 2, Line 1

The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?

Definitely worth a read.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: January 19, 2018

Kafka Topic Reuse

“Pretty But Useless” Visuals

Anomaly Detection With Python

AWS Glue Now Supports Scala

Dealing With String Parsing In T-SQL

Automatic Partition Splitting

SQL Operations Studio January Release

A Story Of Database Corruption