Hadoop – Page 71 – Curated SQL

This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.
In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.

Ed includes all of the scripts needed to test this out, too.

Comments closed

Feeding IoT Data into Delta Lake

Published 2019-07-08 by Kevin Feasel

Saeed Barghi shows how you can stream sensor data from Azure IoT Hub into Databricks Delta Lake:

IoT devices produce a lot of data very fast. Capturing data from all those devices, which could be at millions, and managing them is the very first step in building a successful and effective IoT platform.
Like any other data solution, an IoT data platform could be built on-premise or on cloud. I’m a huge fan of cloud based solutions specially PaaS offerings. After doing a little bit of research I decided to go with Azure since it has the most comprehensive and easy to use set of service offerings when it comes to IoT and they are reasonably priced. In this post, I am going to show how to build the architecture displayed in the diagram below: connect your devices to Azure IoT Hub and then ingest records into Databricks Delta Lake as they stream in using Spark Streaming.

Click through for the instructions.

Comments closed

Copying Cassandra Data to HDFS

Published 2019-07-08 by Kevin Feasel

Landon Robinson shows how you can use Spark to extract data from Cassandra and move it into HDFS:

Cassandra is a great open-source solution for accessing data at web scale, thanks in no small part to its low-latency performance. And if you’re a power user of Cassandra, there’s a high probability you’ll want to analyze the data it contains to create reports, apply machine learning, or just do some good old fashioned digging.
However, Cassandra can prove difficult to use as an analytical warehouse, especially if you’re using it to serve data in production around the clock. But one approach you can take is quite simple: copy the data to Hadoop (HDFS).

Read on to learn how.

Comments closed

Troubleshooting Kafka Listeners

Published 2019-07-03 by Kevin Feasel

Robin Moffatt has some tips for configuring listeners in Kafka:

Apache Kafka^® is a distributed system. Data is read from and written to the leader for a given partition, which could be on any of the brokers in a cluster. When a client (producer/consumer) starts, it will request metadata about which broker is the leader for a partition—and it can do this from anybroker. The metadata returned will include the endpoints available for the Leader broker for that partition, and the client will then use those endpoints to connect to the broker to read/write data as required.
It’s these endpoints that cause people trouble. On a single machine, running bare metal (no VMs, no Docker), everything might be the hostname (or just localhost), and it’s easy. But once you move into more complex networking setups and multiple nodes, you have to pay more attention to it.

Click through for more tips.

Comments closed

Scala 2.13 Changes

Published 2019-07-02 by Kevin Feasel

Anmol Sarna takes us through what’s new in Scala 2.13:

Last, but not the least, the team has invested heavily in compiler speedups during the 2.13 cycle which resulted in some major changes with respect to the compiler.
Compiler performance in 2.13 is 5-10% better compared to 2.12, thanks mainly to the new collections.

There are a lot of changes in this version. I wonder how long before Spark supports it fully.

Comments closed

Thoughts on Hadoop’s Future

Published 2019-07-02 by Kevin Feasel

Mark Litwintschik ties together a set of thoughts on the present and future of Hadoop:

At no point in Hadoop’s history has there been such a rich variety of features being offered as today and never before has it been so stable and battle-tested.
Hadoop projects are made up of millions of lines of code which have been written by thousands of contributors. In any given week there are 100s of developers working on the various projects. Most commercial database offerings are lucky to have a handful of engineers making any significant improvements to their code bases every week.

Mark takes a broad ecosystem approach (which I fully endorse) and so he sees the glass as more than half-full.

Comments closed

SQL Server 2019 CTP 3.1 Released

Published 2019-06-27 by Kevin Feasel

Anshul Rampal announces CTP 3.1 of SQL Server 2019:

The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.

Click through to learn more about what has changed.

Comments closed

Controlling Partition and File Counts in Spark

Published 2019-06-25 by Kevin Feasel

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark:

Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.
However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.

There’s good advice in here, so check it out.

Comments closed

Creating an Azure Databricks Cluster

Published 2019-06-25 by Kevin Feasel

Brad Llewellyn shows how you can create an Azure Databricks cluster:

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. We will dig into each of these in due time. For this post, we’re going to talk about Clusters. Clusters are where the work is done. Clusters themselves do not store any code or data. Instead, they operate the physical resources that are used to perform the computations. So, it’s possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment. Let’s start by creating a small cluster.

Read on for an example.

Comments closed

Databricks Runtime 5.4

Published 2019-06-24 by Kevin Feasel

Todd Greenstein announces Databricks Runtime 5.4:

We’ve partnered with the Data Services team at Amazon to bring the Glue Catalog to Databricks. Databricks Runtime can now use Glue as a drop-in replacement for the Hive metastore. This provides several immediate benefits:
– Simplifies manageability by using the same glue catalog across multiple Databricks workspaces.
– Simplifies integrated security by using IAM Role Passthrough for metadata in Glue.
– Provides easier access to metadata across the Amazon stack and access to data catalogued in Glue.

There are some interesting changes in here.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Spark and dotnet in a Single Container

Feeding IoT Data into Delta Lake

Copying Cassandra Data to HDFS

Troubleshooting Kafka Listeners

Scala 2.13 Changes

Thoughts on Hadoop’s Future

SQL Server 2019 CTP 3.1 Released

Controlling Partition and File Counts in Spark

Creating an Azure Databricks Cluster

Databricks Runtime 5.4