Press "Enter" to skip to content

Day: June 23, 2017

Kafka Offset Management With Spark Streaming

Guru Medasana and Jordan Hambleton explain how to perform Kafka offset management when using Spark Streaming:

Enabling Spark Streaming’s checkpoint is the simplest method for storing offsets, as it is readily available within Spark’s framework. Streaming checkpoints are purposely designed to save the state of the application, in our case to HDFS, so that it can be recovered upon failure.

Checkpointing the Kafka Stream will cause the offset ranges to be stored in the checkpoint. If there is a failure, the Spark Streaming application can begin reading the messages from the checkpoint offset ranges. However, Spark Streaming checkpoints are not recoverable across applications or Spark upgrades and hence not very reliable, especially if you are using this mechanism for a critical production application. We do not recommend managing offsets via Spark checkpoints.

The authors give several options, so check it out and pick the one that works best for you.

Comments closed

Updates In Apache Kafka

Yeva Byzek announces that Apache Kafka 0.11.0.0 is shipping soon:

We are very excited for the GA for Kafka release 0.11.0.0 which is just days away. This release is bringing many new features as described in the previous Log Compaction blog post.

The most notable new feature is Exactly Once Semantics (EOS).  Kafka’s EOS capabilities provide more stringent idempotent producer semantics with exactly once, in-order delivery per partition, and stronger transactional guarantees with atomic writes across multiple partitions. Together, these strong semantics make writing applications easier and expand Kafka’s addressable use cases. You can learn more about EOS in the online talk on June 29, 2017.

“Exactly once,” if done right, would be crazy—there’s a reason most brokers are either “at least once” or “best effort.”

Comments closed

Spark And H2O

Avkash Chauhan shows how to use sparklyr and rsparkling to tie Spark together with the H2O library in R:

In order to work with Spark H2O using rsparkling and sparklyr in R, you must first ensure that you have both sparklyr and rsparkling installed.

Once you’ve done that, you can check out the working script, the code for testing the Spark context, and the code for launching H2O Flow. All of this information can be found below.

It’s a short post, but it does show how to kick off a job.

Comments closed

Securing S3 Credentials In Spark Jobs

Jason Pohl shows how to protect credentials for connecting to Amazon Web Services S3 buckets when building Spark jobs:

Since Apache Spark separates compute from storage, every Spark Job requires a set of credentials to connect to disparate data sources. Storing those credentials in the clear can be a security risk if not stringently administered. To mitigate that risk, Databricks makes it easy and secure to connect to S3 with either Access Keys via DBFS or by using IAM Roles. For all other data sources (Kafka, Cassandra, RDBMS, etc.), the sensitive credentials must be managed by some other means.

This blog post will describe how to leverage an IAM Role to map to any set of credentials. It will leverage the AWS’s Key Management Service (KMS) to encrypt and decrypt the credentials so that your credentials are never in the clear at rest or in flight. When a Databricks Cluster is created using the IAM Role, it will have privileges to both read the encrypted credentials from an S3 bucket and decrypt the ciphertext with a KMS key.

That’s only one data source, but an important one.

Comments closed

Power BI Supports Interactive R Visuals

David Smith reports on a great update to Power BI:

The above chart was created with the plotly package, but you can also use htmlwidgets or any other R package that creates interactive graphics. The only restriction is that the output must be HTML, which can then be embedded into the Power BI dashboard or report. You can also publish reports including these interactive charts to the online Power BI service to share with others. (In this case though, you’re restricted to those R packages supported in Power BI online.)

Power BI now provides four custom interactive R charts, available as add-ins:

I’d avoided doing too much with R visuals in Power BI because the output was so discordant—Power BI dashboards are often lively things, but the R visual would just sit there, limp and lifeless.  I’m glad to see that this has changed.

Comments closed

Performance Comparison: Comparing Column Differences

Shane O’Neill has a column difference showdown:

The original post for this topic garnered the attention of a commenter who pointed out that the same result could be gathered using a couple of UNION ALLs and those lovely set-based EXCEPT and INTERSECT keywords.

I personally think that both options work and whatever you feel comfortable with, use that.

It did play on my mind though of what the performance differences would be…what would the difference in STATISTICS IO, TIME be? What would the difference in Execution Plans be? Would there even be any difference between the two or are they the same thing? How come it’s always the things I tell myself not to forget that I end up forgetting?

This may not be the most important thing to test, but it does show you a technique.

Comments closed

Using dbatools To Back Up SQL Logins

Claudio Silva has a post showing how to use the Export-SqlLogin cmdlet to back up SQL Server logins on all databases on a set of instances:

With a database restore, the users are within a database and if their SID matches the SQL Login you are ready to go. But with the logins it is a different story!
If you have to reinstall the engine because your master database backup is corrupt or someone hs changed the login password and you want to put it back or even – maybe the most common scenario – you want to keep track of the login permissions you need to have them saved somewhere.

Imagine that you have to re-create a login and give all the permissions that it has, imagine that this login has multiple roles on different databases. Did you know that beforehand? Do you keep track of it? How much time would take to gather all that information from the application owner? How much time will you have to spend resolving all the permission issues until everything is running smoothly? How many tickets will need to be raised? How many users will this be affecting?

Read on for Claudio’s easy solution.

Comments closed

SSMS Keyboard Tricks

Andrew Kelly has a few tricks up his sleeve when it comes to Management Studio shortcuts:

I was giving an internal talk on SSMS productivity trips and there were a few that I believe are seldom used but can be a real time or keystroke saver that I would like to mention. To the best of my knowledge these will work in any version of SSMS from at least 14.0 onward and likely earlier but I can’t verify older versions at this time. Since these are basically Visual Studio shortcuts they also work in Visual Studio and SSDT to the best of my knowledge. The first is the ability to select text in a vertical fashion as shown in the examples below. Before we get to how to do that I know your first question is why would you want to do that. Well I will leave all the possibilities up to you since everyone has slightly different techniques and circumstances. However I am confident that after seeing a few examples it will spark interest in many of you and you will immediately think of times when this will help you code more efficiently.

Read on for five tips.

Comments closed