Press "Enter" to skip to content

Author: Kevin Feasel

Aggregating And Joining Using Kafka Streams

Michael Noll digs into Kafka Streams, showing how to enrich data and collect aggregates:

The stream of user click events is considered to be a record stream, where each data record represents a self-contained datum.  In contrast, the stream of user geo-location updates is interpreted as a changelog stream, where each data record represents an update (i.e. any previous data records having the same record key will be replaced by the latest update).  In Kafka Streams, a record stream is represented via the so-called KStream interface and a changelog stream via the KTable interface.  Going from the high-level view to the technical view, this means that our streaming application will demonstrate how to perform a join operation between a KStream and a KTable, i.e. it is an example of a stateful computation.  This KStream-KTable join also happens to be Kafka Streams’ equivalent of performing a table lookup in a streaming context, where the table is updated continuously and concurrently.  Specifically, for each user click event in the KStream, we will lookup the user’s region (e.g. “europe”) in the KTable in order to subsequently compute the total number of user clicks per region.

Let’s showcase the beginning (input) and the end (expected output) of this data pipeline with some example data.

This article is fairly detailed, but it covers a rather interesting topic in a good way.

Comments closed

Using Sqoop To Transfer Data

Sai Sriparasa shows how to use Sqoop to transfer data from a Hadoop cluster into a relational database:

Here are a few best practices for exporting with Sqoop:

  • Options file—As commands with Sqoop export and Sqoop import tend to be bigger in size, I recommend storing the commands in an options file. By keeping it in an options file, you can even make it part of a version control pipeline to monitor changes to the command.

  • Field termination—With Sqoop export, I recommend providing field termination metadata using the “–fields-terminated-by” option. Also, other formatting options such as “lines-terminated-by”, “enclosed-by”, “escaped-by”, etc., can be used as required.

  • Mapper tuning—When an export job is submitted, Sqoop creates a Java class and submits a MapReduce job based on input splits; then, each mapper connects to the database to export the data. The default number of mappers is 4, so I recommend tuning the number of mappers depending on the availability of processors on the cluster. Too many mappers might cause the load to increase on the database. We recommend that you monitor the number of connections and keep track of processlist on MySQL.

  • Staging table—The Sqoop export job is broken down into multiple transactions based on the mappers. Each transaction is therefore atomic and does not have any dependencies on other transactions. I recommend using the “–staging-table” option that acts as the buffer table for the separate transactions. After all transactions have been committed, a single transaction move is made to move the data to the final destination. Use the “–clear-staging-table” option to clean up the staging table after the export job.

There’s a lot in here which is Amazon-specific and there are a couple of things you’d have to change to deploy to SQL Server, but there’s a lot of useful information here.  I like that Sai shows how to use the Hadoop credential API instead of doing something silly like saving your password in plaintext.

Comments closed

Getting Started With Spark

Denny Lee announces a new Spark intro guide:

We are proud to introduce the Getting Started with Apache Spark on Databricks Guide. This step-by-step guide illustrates how to leverage the Databricks’ platform to work with Apache Spark. Our just-in-time data platform simplifies common challenges when working with Spark: data integration, real-time experimentation, and robust deployment of production applications.

Databricks provides a simple, just-in-time data platform designed for data analysts, data scientists, and engineers. Using Databricks, this step-by-step guide helps you solve real-world Data Sciences and Data Engineering scenarios with Apache Spark. It will help you familiarize yourself with the Spark UI, learn how to create Spark jobs, load data and work with Datasets, get familiar with Spark’s DataFrames and Datasets API, run machine learning algorithms, and understand the basic concepts behind Spark Streaming.

If you are at all interested in distributed databases, Spark is a must-learn.

Comments closed

Buffer Pools And Availability Groups

Joey D’antoni answers the question, what happens to pages in the buffer pool when you have an AG failover?

Recently at SQL Saturday Philadelphia, we started discussing failover  as it relates to mirroring and Always On Availability Groups. Specifically, we were wondering what would happen if you had a relatively busy readable secondary replica (which would have a lot of pages in the buffer pool on the secondary instance) and if those pages would be flushed from cache or anything like that. So I reached out to the product group and Kevin Farlee from Microsoft was extremely helpful

The answer is not what I was expecting.

Comments closed

Kill Remote Processes

Michael Bourgon has a Powershell script to kill remote processes:

Simple problem: we had to replace the config files for an app that hits the database, pointing it at a CNAME.

However, the file could be in use because the app was running.  And the app only loads the file on startup.  So we came up with this, which worked quite well.  Note that we use WMI (see my other posts on it!) to grab the application info then a WMI method to terminate the app.

Click through for the script.

Comments closed

Filtered Indexes For Uniqueness

Shane O’Neill answers one of my favorite interview questions:

I used to think that this would be a complex requirement, possibly requiring aTRIGGER or two to check the inserted value against whatever is already there; but there is a way to have this functionality and have it the way that SQL Server normally would enforce a uniqueness on a column; by using a UNIQUE INDEX.

In case you’re thinking…

“Oh, a unique index doesn’t check what’s already there, is that it?”

I’m afraid that’s not the case.

This is one of my favorite uses of filtered indexes:  “limited” uniqueness.  In other words, I’m okay with an unlimited number of NULL values but all non-NULL values need to be unique.

Comments closed

New Powershell Cmdlets

Rob Sewell looks into some new Powershell cmdlets for SQL Server management:

Chrissy LeMaire has written about the new SQL Agent cmdlets

Aaron Nelson has written about the new Get-SqlErrorLog cmdlet

Laerte Junior has written about Invoke-SQLCmd

All four of us will be presenting a webinar on the new CMDlets via thePowerShell Virtual Chapter Wed, Jul 06 2016 12:00 Eastern Daylight Time If you cant make it a recording will be made available on YouTube on the VC Channel https://sqlps.io/video

There are 17 new Always Encrypted cmdlets and 25 new cmdlets in total.

Comments closed

Hive With LLAP

Carter Shanklin looks at Hive 2’s performance improvements:

LLAP KEY BENEFITS

  • LLAP enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time, while retaining the ability to scale elastically within a YARN cluster.

  • LLAP, along with Apache Ranger enables fine-grained security for the Hadoop ecosystem, including data masking and filtering, by providing interfaces for external clients like Spark to read.

  • LLAP is great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.

This sounds very much like a response to Spark.

Comments closed

Biml Text Nuggets

Meagan Longoria has started a series on Biml code nuggets and has started with text nuggets:

Text nuggets evaluate the expression they contain and then replace the text nugget with the string representation of the value of the expression. I use them often to switch out names of packages, tasks, and components as well as source and destination tables in SSIS development when creating packages based upon a design pattern.

Text nuggets start with <#= and end with #>. Notice there is an equals sign at the beginning of the text nugget but not at the end.

Text nuggets are very useful. You can include complex business logic in the expressions. And the expression result can be any data type. The BimlScript compiler will automatically convert it to text before replacing the code nugget with the result. Like all code nuggets, text nuggets can be a single line or multiple lines.

Read the whole thing.

Comments closed

Dealing With Foreign Keys

Kenneth Fisher discusses foreign key constraints:

I have to be certain to delete from OrderDetail first, then Order (to maintain the RI) and then load Order first then OrderDetail. No big deal in this simple example, but what if I’m dealing with a dozen tables? Or I’m only re-loading the parent (Order)?

The easiest thing to do is to disable the foreign key, load your data, and then re-enable the foreign key. You might be tempted to skip that last step but don’t. RI is very important and in fact a trusted foreign key can be used by the optimizer to improve your query plan. It’s easy to say that the application doesn’t have bugs that cause problems with referential integrity. It’s not like you’re ever going insert an OrderDetail without an Order right? Unfortunately it’s far to easy for mistakes to happen. Maybe not in the application, it might be a mistake in an update meant to fix something else. Our job is to protect the data, and RI is an important part of that. So by all means disable a foreign key to help with a load but make sure you turn it back on when you are done.

If a foreign key constraint isn’t trusted, the optimizer won’t be able to assume relational integrity, and so it’s possible that the optimizer could make sub-par choices when joining tables with a foreign key constraint.

Comments closed