Press "Enter" to skip to content

Month: October 2018

Whither Running Kafka On Kubernetes

Gwen Shapira walks through some of the costs and benefits of using Kubernetes to host your Apache Kafka brokers:

First, if you are running most of your other applications and microservices on Kubernetes, it becomes the organizational path of least resistance. This is just like how organizations who standardized on VMs have found it very difficult to allocate physical machines with local disks for Kafka.

I see situations with larger organizations where deploying Kafka outside of Kubernetes causes significant organizational headache that involves many approvals. When this is the case, I usually say that this isn’t a good hill to die on. It is possible to run Kafka on Kubernetes, so just do it. You’ll get your environment allocated faster and will be able to use your time to do productive work rather than fight an organizational battle.
And if things go wrong, you’ll get much better service from your internal infrastructure teams, because you’ll be running in an environment that is familiar to them.

Read on for more benefits as well as a few drawbacks.

Comments closed

Medium-Term Effects Of The Cloudera-Hortonworks Merger

Alex Woodie describes some of the ramifications of Cloudera’s merger with Hortonworks:

Whatever camp you sit in, the merger undoubtedly caught the attention of the 2,500 organizations that have adopted Cloudera’s Distribution of Hadoop (CDH) or the Hortonworks Data Platform (HDP) over the years — not to mention the thousands of other companies that have adopted open source Apache Hadoop platforms or Hadoop ecosystem components in the cloud. These Global 2000 companies have invested billions of dollars into building giant clusters to store and process many exabytes worth of data, and they’re not going to just turn them off overnight because the two biggest players suddenly decided to merge.

At the same time, these customers need to be reassured that Cloudera has a plan to maintain the investments they’ve already made in HDP and CDH platforms, both in a short-term or tactical sense, as well as in terms of Cloudera’s long-range strategy to evolve its platform to meet emerging future compute and storage needs.

Read on for more detail.

Comments closed

Spark Streaming On Azure Databricks

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics

  • Spark Streaming / Storm on HDInsight

  • Spark Streaming on Databricks

  • Azure Functions

Click through for more.

Comments closed

Data Modeling In Cassandra

Charmy Garg walks us through some of the basics of modeling tables in Cassandra:

Two basic goals in Cassandra which we should keep in mind:

  • Spread data evenly around the cluster – You want every node in the cluster to have roughly the same amount of data. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key.

  • Minimize the number of partitions read – Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible. Why is this important? [Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.]

Charmy also has a couple of pitfalls that people used to the relational database model may hit.

Comments closed

Enhancements To Polybase In SQL Server 2019

Rajendra Gupta has a multi-part series on Polybase enhancements with SQL Server 2019.  Part one covers installation of SQL Server 2019 and Azure Data Studio:

You need to install Oracle JRE 7 update 51 or higher to install Polybase. If it is not installed, you will get below error message while checking the rules for installation.

To fix this error, go to ‘Java SE Runtime Environment 8 Downloads‘ and download Java SE Runtime Environment 8u191E. Double click on the setup file to install it.

Part two shows us how to install Oracle Express Edition and query it via SQL Server:

As discussed, so far below are the requirements to access Oracle database using PolyBase with Azure Data Studio

  • SQL Server 2019 preview 4

  • Azure Data Studio with SQL Server 2019 extension

  • Oracle Data Source

  • Polybase services should be running with SQL Server database services

Part three is forthcoming, as Rajendra mentions at the end of part 2.

Comments closed

Tying The Power BI Threads Together

Eugene Meidinger has a corkboard with a bunch of pushpins connecting photographs and newspaper articles together with string:

Part of that announcement was them talking about the Common Data Service. When I first heard about CDS months ago, I was again confused. It sounded like some weird semantic layer for the data in Dynamics CRM. Maybe useful if your data lives in Dynamics 365, otherwise who the heck cares.

Oooooh boy was I wrong. Microsoft is aiming for something much, much more ambitious than an awkward pseudo-database layer for people who don’t like SQL. They are aiming for a common shape for all of your business data. They want to want to create a lingua franca for all of your business data, no matter where it is. Especially if it’s hiding in Salesforce.

Now, do I expect them to succeed? I’m not sure. I’ve learned the hard way that every business is a unique snowflake, even two business in exact same industry. But if anyone can do it, Microsoft has a good shot. They’ve been buying up CRM / ERP solutions for decades.

There’s some good stuff in here, including the realization that Power BI is not strictly intended for database developers.

Comments closed

Azure SQL Managed Instance Prerequisites

Frank Gill has started a series on Azure SQL Managed Instances and has two posts up already.  First, an introduction:

The drawbacks of Azure SQL Database make it difficult to migrate existing applications, because of the number of application changes required.  Azure SQL Database is designed to be used for new development in Azure and for multi-tenant environments, where each tenant requires their own copy of a database.

The benefits of SQL Server on an Azure VM make it much easier to migrate an existing application to Azure.  However, the VMs underlying the application still have to be managed by the client.  This fails to take advantage of the management of resources in Azure, and uses Azure as a VM host.

A third option, Azure SQL Managed Instance, was released at the beginning of October 2018.  Managed Instance combines the best of the previous options.  With Managed Instance, the infrastructure is fully managed and the majority of the SQL Server feature set is available.  The full list of differences between a traditional install of SQL Server and Managed Instance can be found here.  A number of the most dramatic differences are listed below.

Then a post covering pre-requisites:

Before creating an Azure SQL Managed Instance, a number of prerequisite resources must be provisioned.  These are:

  • An Azure Virtual Network

  • A dedicated subnet for Managed Instances

  • A route table

It looks like this is part of a longer series Frank is building out, so stay tuned.

Comments closed

Creating A Panel For Slicers In Power BI

Matt Allington shows us how to create a collapsable panel in Power BI:

There is nothing worse than having a Power BI report that has 50% of the space taken up with slicers.  When this happens, you only get half the page to visualise the actual data.  But on the flip side, if you don’t have the slicers it can be harder for the report users to filter the data they want to see.  Many users don’t like using the built in filter pane on the right hand side.  All is not lost – there is a great way that you can have the best of both worlds by creating a collapsible slicer pane that you can show and hide on demand.

Now I didn’t invent this concept – I learnt it from looking at what others have done, such as Amanda Cofsky, Miguel Myers, Mike and Seth from http://powerbi.tips and also Adam and Patrick from GuyInACube.  There are lots of great resources out there to learn tricks like this, so you should check those out.

You can see one simple interpretation of this solution below. The user can hide and collapse the slicer pane by using the arrow keys (#1 and #2 below).

Click through for the demo.

Comments closed

Visualizing A Correlation Matrix With corrplot

Kristian Larsen demonstrates the corrplot package in R:

First we need to read the packages into the R library. For descriptive statistics of the dataset we use the skimr package and for visualization of correlation matrix we use the corrplot package. We will work with windspeed dataset from the bReeze package:

# Read packages into R library
library(bReeze)
library(corrplot)
library(skimr)

Click through for the demo.

Comments closed

Getting The Right R Version For Packages

Colin Gillespie shows a couple methods for figuring out the minimum version of R needed for a set of packages:

In R, there is a handy function called available.packages() that returns a matrix of details corresponding to packages currently available at one or more repositories. Unfortunately, the format isn’t initially amenable to manipulation. For example, consider the readr package

readr_desc = available.packages() %>% as_tibble() %>% filter(Package == "readr")

I immediately converted the data to a tibble, as that

  • changed the rownames to a proper column

  • changed the matrix to a data frame/tibble, which made selecting easier

There’s a good use of R functionality to delve into package requirements, as well as a script to try it out yourself.

Comments closed