Batch Consumption from Kafka with Spark

Swapnil Chougule shares a few tips on performing batch processing of a Kafka topic using Apache Spark:

Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, batch computation is being done using Spark. As a result, organizations’ infrastructure and expertise have been developed around Spark.

So, the now question is: can Spark solve the problem of batch consumption of data inherited from Kafka? The answer is yes.

The advantages of doing this are: having a unified batch computation platform, reusing existing infrastructure, expertise, monitoring, and alerting.

Click through to get to the starting point on this as well as a few tips to avoid stumbling blocks.

Building a VPC with AWS

Priyaj Kumar takes us through the process of building a Virtual Private Cloud in AWS:

AWS provides a lot of services, these services are sufficient to run your architecture. The backbone for the security of this architecture is VPC (Virtual Private Cloud). VPC is basically a private cloud in the AWS environment that helps you to use all the services by AWS in your defined private space. You have control over the virtual network and you can also restrict the incoming traffic using security groups.

Overall, VPC helps you to secure your environment and give you a complete authority of incoming traffic. There are two types of VPCs, Default VPC that is by default created by Amazon and Non-Default VPC that is created by you to suffice your security needs.

Now that you have an idea of how VPC works, I will take you through the different services offered by Amazon VPC.

Read on to see how to set one up.

Counting Working Days with DAX

Alberto Ferrari shows how we can ignore weekends in date calculations with DAX:

How is it possible to compute the difference between the two dates, only computing working days and skipping weekends and holidays? Simple math is no longer useful here, and DAX does not offer a predefined function.

A solution to this scenario requires a date table – more details here – with a specific column, IsWorkingDay, which indicates whether that particular day is a working day or not. The following figure shows an example:

Another good use of date tables (AKA calendar tables), which are also quite useful in T-SQL queries.

PolyBase and Hive Shim Errors

I ran into a problem with Hive 3 and PolyBase:

My initial plan was to google things. The specific error: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.

After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.

Click through for the solution.

Formatting Lists of Values with DAX

Alberto Ferrari and Patrick LeBlanc have a great video on formatting lists of filter values in DAX like 2003, 2005-2007, 2009:

Alberto Ferrari joins Patrick to walk through how you can use DAX to format a list of values within Power BI Desktop. This takes the concatenate values quick measure to the next level.

Transmuting Adam into Alberto shows Patrick’s ultimate power.

Triggering KB 4462481

Joe Obbish shows how you can recreate the error described in KB 4462481:

Consider a query execution that meets all of the following criteria:

1. A parallel INSERT INTO… SELECT into a columnstore table is performed

2. The SELECT part of the query contains a batch mode hash join

3. The query can’t immediate get a memory grant, hits the 25 second memory grant timeout and executes with required memory

The query may appear to get stuck. 

Click through for Joe’s demo. The fix? Update to SQL Server 2017 CU11.

Workload Capture with WorkloadTools

Gianluca Sartori continues a series on WorkloadTools:

Last week I showed you how to use WorkloadTools to analyze a workload. As you have seen, using SqlWorkload to extract performance data from your workload is extremely easy and it just takes a few keystrokes in your favorite text editor to craft the perfect .json configuration file.

Today I’m going to show you how to capture a workload and save it to a file. If you’ve ever tried to perform this task with any other traditional benchmarking tool, like RML Utilities or Distributed Replay, your palms are probably sweaty already, but fear not: no complicated traces to set up, no hypertrophic scripts to create extended events captures. WorkloadTools makes it as easy as it can get.

Saving a workload to a file might look superfluous when you think that WorkloadTools has the ability to perform replays in real-time (I’ll discuss this feature in a future post), but there are situations when you want to replay the same exact workload multiple times, maybe changing something in the target database between each benchmark to see precisely what performance looks like under different conditions.

Gianluca’s technique does seem a lot less fussy than the Microsoft techniques.

SQL Server and Ubuntu 18.04

Randolph West confirms that SQL Server on Linux will run on Ubuntu 18.04 even though it is not (yet) supported:

Although these screenshots show SQL Server 2019 preview CTP 2.3, this also applies to SQL Server 2017 on 18.04.2, because that’s what I had installed before upgrading the SQL Server version. However, as my friend Jay Falck pointed out on Twitter, Microsoft has stated publicly that it is not yet certified for production use:

Important, this does not change the support state of SQL Server 2017 on Ubuntu 18.04. Work to certify Ubuntu 18.04 with SQL Server 2017 is planned and we will announce when it will be supported for production use on this page. Until such as an announcement occurs, SQL Server 2017 on Ubuntu 18.04 should be considered experimental and for non-production use only.

Read on for Randolph’s thoughts on the issue.


March 2019
« Feb