Press "Enter" to skip to content

Category: Hadoop

Copying Cassandra Data to HDFS

Landon Robinson shows how you can use Spark to extract data from Cassandra and move it into HDFS:

Cassandra is a great open-source solution for accessing data at web scale, thanks in no small part to its low-latency performance. And if you’re a power user of Cassandra, there’s a high probability you’ll want to analyze the data it contains to create reports, apply machine learning, or just do some good old fashioned digging.

However, Cassandra can prove difficult to use as an analytical warehouse, especially if you’re using it to serve data in production around the clock. But one approach you can take is quite simple: copy the data to Hadoop (HDFS).

Read on to learn how.

Comments closed

Troubleshooting Kafka Listeners

Robin Moffatt has some tips for configuring listeners in Kafka:

Apache Kafka® is a distributed system. Data is read from and written to the leader for a given partition, which could be on any of the brokers in a cluster. When a client (producer/consumer) starts, it will request metadata about which broker is the leader for a partition—and it can do this from anybroker. The metadata returned will include the endpoints available for the Leader broker for that partition, and the client will then use those endpoints to connect to the broker to read/write data as required.

It’s these endpoints that cause people trouble. On a single machine, running bare metal (no VMs, no Docker), everything might be the hostname (or just localhost), and it’s easy. But once you move into more complex networking setups and multiple nodes, you have to pay more attention to it.

Click through for more tips.

Comments closed

Scala 2.13 Changes

Anmol Sarna takes us through what’s new in Scala 2.13:

Last, but not the least, the team has invested heavily in compiler speedups during the 2.13 cycle which resulted in some major changes with respect to the compiler.

Compiler performance in 2.13 is 5-10% better compared to 2.12, thanks mainly to the new collections.

There are a lot of changes in this version. I wonder how long before Spark supports it fully.

Comments closed

Thoughts on Hadoop’s Future

Mark Litwintschik ties together a set of thoughts on the present and future of Hadoop:

At no point in Hadoop’s history has there been such a rich variety of features being offered as today and never before has it been so stable and battle-tested.

Hadoop projects are made up of millions of lines of code which have been written by thousands of contributors. In any given week there are 100s of developers working on the various projects. Most commercial database offerings are lucky to have a handful of engineers making any significant improvements to their code bases every week.

Mark takes a broad ecosystem approach (which I fully endorse) and so he sees the glass as more than half-full.

Comments closed

SQL Server 2019 CTP 3.1 Released

Anshul Rampal announces CTP 3.1 of SQL Server 2019:

The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.

Click through to learn more about what has changed.

Comments closed

Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark:

Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.

However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.

There’s good advice in here, so check it out.

Comments closed

Creating an Azure Databricks Cluster

Brad Llewellyn shows how you can create an Azure Databricks cluster:

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we’re going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves do not store any code or data.  Instead, they operate the physical resources that are used to perform the computations.  So, it’s possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment.  Let’s start by creating a small cluster.

Read on for an example.

Comments closed

Databricks Runtime 5.4

Todd Greenstein announces Databricks Runtime 5.4:

We’ve partnered with the Data Services team at Amazon to bring the Glue Catalog to Databricks.   Databricks Runtime can now use Glue as a drop-in replacement for the Hive metastore. This provides several immediate benefits:
– Simplifies manageability by using the same glue catalog across multiple Databricks workspaces.
– Simplifies integrated security by using IAM Role Passthrough for metadata in Glue.
– Provides easier access to metadata across the Amazon stack and access to data catalogued in Glue.

There are some interesting changes in here.

Comments closed

Running Confluent Platform with .NET

Niels Berglund shows how you can install Confluent Platform as a Docker container and use the .NET client against it:

What we see in Figure 16 are the various project related files, including the source file Program.cs. What is missing now is a Kafka client. For .NET there exists a couple of clients, and theoretically, you can use any one of them. However, in practice, there is only one, and that is the Confluent Kafka DotNet client. The reason I say this is because it has the best parity with the original Java client. The client has NuGet packages, and you install it via VS Code’s integrated terminal: dotnet add package Confluent.Kafka --version 1.0.1.1:

Definitely use the Confluent client. The others were from a time when there was no official driver; most aren’t even maintained anymore.

Comments closed

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark:

There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?

Let’s consider the cases where using Spark would be no less than a nightmare.

No tool is perfect at everything. Click through for a few use cases where the Spark experience degrades quickly.

Comments closed