Category: Hadoop

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

Large volumes of streaming data is ingested into Kafka (at variable rates)
Data is sent to Cassandra for long term persistence
Streams processing is triggered by the incoming events in real-time
Historic data is requested from Cassandra
Historic data is retrieved from Cassandra
Historic data is processed, and
A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection. I’ll be interested in following this series.

Comments closed

Troubleshooting KSQL

Published 2018-09-28 by Kevin Feasel

Robin Moffatt walks us through a few scenarios where KSQL queries aren’t returning any data:

Probably the most common question in the Confluent Community Slack group’s #ksql channel is:

Why isn’t my KSQL query returning data?

That is, you’ve run a CREATE STREAM, but when you go to query it…
ksql> SELECT * FROM MY_FIRST_KSQL_STREAM;

…nothing happens. And because KSQL queries are continuous, your KSQL session appears to “hang.” That’s because KSQL is continuing to wait for any new messages to show you. So if your run a KSQL SELECT and get no results back, what could be the reasons for that?

Robin gives us five reasons why this might be.

Comments closed

Apache Pulsar Now A Top-Level Project

Published 2018-09-27 by Kevin Feasel

George Leopold reports on Apache Pulsar:

Apache Pulsar is touted as a highly scalable, low-latency messaging platform running on commodity hardware. Besides Yahoo (NASDAQ: AABA), current enterprise users include Zhaopin Ltd., the Chinese online recruitment service. Zhaopin said Apache Pulsar addresses “the shortcomings of existing messaging systems, such as message durability, low latency.”

Other early enterprise users said they are using the messaging system as a bridge between public and private clouds as they roll out hybrid cloud strategies. Other early uses include stream processing and analysis of industrial Internet of Things sensor data. Most emerging use cases seek to move beyond slow batch processing, Pulsar supporters said.

Now that it’s a top-level Apache project, it’ll be interesting to see if it eats away at Kafka’s market share.

Comments closed

Hadoop + SQL Server In 2019

Published 2018-09-27 by Kevin Feasel

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Comments closed

Writing To Elasticsearch With Spark Streaming

Published 2018-09-25 by Kevin Feasel

Anuj Saxena has an example of writing data from a Spark Streaming pipeline out to Elasticsearch:

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in another blog. But today we’ll focus on saving streaming data to Elasticseach using Spark Structured Streaming. Elasticsearch added support for Spark Structured Streaming 2.2.0 onwards in version 6.0.0 version of “Elasticsearch For Apache Hadoop” dependency. We will be using these versions or higher to build our sbt-scala project.

Click through for an example.

Comments closed

Databricks Delta Now Available On Azure

Published 2018-09-25 by Kevin Feasel

Cihan Biyikoglu and Singh Garewal announce the availability of Databricks Delta on Azure Databricks:

Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:

Increased query performance – Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.

Improved data reliability – By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.

Reduced system complexity – Through the unification of batch and streaming in a common pipeline architecture – being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.

The Azure version of Databricks is quickly reaching parity with the classic AWS-hosed version.

Comments closed

It’s All ETL (Or ELT) In The End

Published 2018-09-20 by Kevin Feasel

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world:

In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.

Extract: Data is available from a source system

Transform: We want to filter, cleanse or otherwise enrich this source data

Load: Make the data available to another application

There are two key concepts here:

Data is created by an application, and we want it to be available to other applications

We often want to process the data (for example, cleanse and apply business logic to it) before it is used

Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?

Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now. The set of tools expand, but the nature of the job remains similar.

Comments closed

Flint: Time Series With Spark

Published 2018-09-17 by Kevin Feasel

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark:

Time series analysis has two components: time series manipulation and time series modeling.

Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data cleaning and feature engineering. Typical functions in time series manipulation include:

Joining: joining two time-series datasets, usually by the time

Windowing: feature transformation based on a time window

Resampling: changing the frequency of the data

Filling in missing values or removing NA rows.

Time series modeling is the process of identifying patterns in time-series data and training models for prediction. It is a complex topic; it includes specific techniques such as ARIMA and autocorrelation, as well as all manner of general machine learning techniques (e.g., linear regression) applied to time series data.

Flint focuses on time series manipulation. In this blog post, we demonstrate Flint functionalities in time series manipulation and how it works with other libraries, e.g., Spark ML, for a simple time series modeling task.

Basho went all-in on a time-series product for Riak and it did not work out well for them. I’ll be curious to see if Flint has more staying power.

Comments closed

ElasticMapReduce And RStudio

Published 2018-09-14 by Kevin Feasel

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node:

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.

Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.

You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

Click through for detailed, step-by-step instructions on how to do this.

Comments closed

Hortonworks Data Analytics Studio

Published 2018-09-14 by Kevin Feasel

Will Xu and Syed Mahmood announce Hortonworks Data Analytics Studio:

DAS leverages open-source technologies such as Apache Hive to share and extend the value of a modern data architecture in heterogeneous environments. It helps infrastructure administrators manage and optimize the performance of their Hive workloads by delivering visibility into query patterns and storage hotspots. DAS improves performance by uncovering inhibitors to query speed as well as providing recommendations to improve its efficiency.

In the past, Hive view did not provide full auto-complete capability during authoring time. We’ve addressed this shortcoming in DAS. This is not a trivial task especially on large databases, however through a number of caching optimizations we were able to make it work smoothly even with thousands of tables.

This product feels more like Management Studio or SQL Operations Studio than prior Hive UIs. That’s definitely a good thing.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31