Category: Hadoop

2 — You shouldn’t send large messages or payloads through Kafka
According to Apache Kafka, for better throughput, the max message size should be 10KB. If the messages are larger than this, it is better to check the alternatives or find a way to chop the message into smaller parts before writing to Kafka. Best practice to do so is using a message key to make sure all chopped messages will be written to the same partition.

Read the whole thing.

Comments closed

Spark And Splitting DataFrames

Published 2019-01-28 by Kevin Feasel

Giovanni Lanzani explains that one technique to split a data frame doesn’t quite work as expected:

Recently I was delivering a Spark course. One of the exercises asked the students to split a Spark DataFrame in two, non-overlapping, parts.
One of the students came up with a creative way to do so.
He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this, monotonically_increasing_id — you can find how to use it in the docs.

Read on to see how this didn’t quite work right, why it didn’t work as expected, and one alternative.

Comments closed

A Functional Approach To PySpark

Published 2019-01-24 by Kevin Feasel

Tristan Robinson shows us how we can implement a transform function which makes Python code look a little bit more functional:

After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.
The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.

Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.

Comments closed

AMD vs Intel CPUs For Data Processing Jobs

Published 2019-01-23 by Kevin Feasel

Hariharan Iyer and Abhishek Srivastava run some tests against AWS’s new AMD-powered EC2 instances:

Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)

At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.

Comments closed

Azure Databricks And Active Directory

Published 2019-01-22 by Kevin Feasel

Tristan Robinson wraps up a two-parter on Azure Databricks security:

With the addition of Databricks runtime 5.1 which was released December 2018, comes the ability to use Azure AD credential pass-through. This is a huge step forward since there is no longer a need to control user permissions through Databricks Groups / Bash and then assigning these groups access to secrets to access Data Lake at runtime. As mentioned previously – with the lack of support for AAD within Databricks currently, ACL activities were done on an individual basis which was not ideal. By using this feature, you can now pass the authentication onto Data Lake, and as we know one of the advantages of Data Lake is the tight integration into Active Directory so this simplifies things. Its worth noting that this feature is currently in public preview but having tested it thoroughly, am happy with the implementation/limitations. The feature also requires a premium workspace and only works with high concurrency clusters – both of which you’d expect to use in this scenario.

It looks like this is the way to go forward with securing Azure Databricks. Read the whole thing.

Comments closed

Azure Databricks Security

Published 2019-01-21 by Kevin Feasel

Tristan Robinson looks at what’s currently available in terms of security on Azure Databricks:

You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature.

To be charitable, there appears to be room for implementation improvement.

Comments closed

Testing Kafka Streams Applications

Published 2019-01-17 by Kevin Feasel

Yeva Byzek continues her series on testing Kafka-based streaming applications:

When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.
Instead, developers can unit test their Kafka Streams applications with utilities provided by kafka-streams-test-utils. Introduced in KIP-247, this artifact was specifically created to help developers test their code, and it can be added into your continuous integration and continuous delivery (CI/CD) pipeline.

Streaming applications need tested just like any other.

Comments closed

Auto ML With SQL Server 2019 Big Data Clusters

Published 2019-01-16 by Kevin Feasel

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem:

H2O provides popular open source software for data science and machine learning on big data, including Apache Spark^TM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.
H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.

The post only has a few lines of code but there are a lot of working parts under the surface.

Comments closed

Erasure Coding In Hadoop

Published 2019-01-15 by Kevin Feasel

Guy Shilo explains erasure coding, a new feature in Hadoop 3:

The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain.
Erasure encoding is disabled by default and you can enable it for only certain directories in HDFS. Some articles like this one suggest thatbest practice is to enable Erasure coding only for “cold” data that you do not write often, and for “hot” data use regular replication. However, in my tests I did not witness any problem dealing with hot data (maybe it’s evident in larger scales).

Click through for the full story on how it works.

Comments closed

Converting CSV To ORC

Published 2019-01-15 by Kevin Feasel

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto:

Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. Hive is made up of over one million lines of Java and 100K lines of C++ code. Any libraries they share are out-weighted by the unique approaches they’ve taken in the architecture surrounding their SQL parsers, query planners, optimizers, code generators and execution engines when it comes to tabular form conversion.
I recently benchmarked Spark 2.4.0 and Presto 0.214 and found that Spark out-performed Presto when it comes to ORC-based queries. In this post I’m going to examine the ORC writing performance of these two engines plus Hive and see which can convert CSV files into ORC files the fastest.

The results surprised me.

Comments closed