Category: Hadoop

Apache Kudu

Published 2016-09-27 by Kevin Feasel

At a high level, Kudu is a new storage manager that enables durable single-record inserts, updates, and deletes, as well as fast and efficient columnar scans due to its in-memory row format and on-disk columnar format. This architecture makes Kudu very attractive for data that arrives as a single record at a time or that may need to be modified at a later time.

Today, many users try to solve this challenge via a Lambda architecture, which presents inherent challenges by requiring different code bases and storage for the necessary batch and real-time components. Using Kudu and Impala together completely avoids this problematic complexity by easily and immediately making data inserted into Kudu available for querying and analytics via Impala. (For more technical details on how Impala and Kudu work together for analytical workloads, see this post.)

I’d jokingly say “Someday, somebody’s going to reinvent the relational database inside of Hadoop.” But it seems like that’s less of a joke than a medium-term prediction.

Comments closed

HBase Transactions

Published 2016-09-27 by Kevin Feasel

George Leopold describes Omid:

The transaction manager utilizes a lock-free approach to support multiple clients and relies on a centralized conflict detection component to resolve write-set collisions among concurrent transactions. Developers added that Omid requires no modifications to the underlying HBase key-value data store.

It also features a simplified API that mimics transaction manager APIs in relational databases. Client and server configuration processes also were simplified to help both application developers and system administrators.

Filing this one under the “What’s old is new again” category.

Comments closed

Comparing Impala To Redshift

Published 2016-09-26 by Kevin Feasel

Mostafa Mokhtar, et al, have a comparison of Apache Impala to Amazon Redshift:

For this analysis, we used TPC-DS on a 3TB dataset and selected 70 out of 99 the queries that run without any modifications or uses variants on both Redshift and Impala. We wanted to use a larger dataset (similar to what we’ve used in previous benchmarks), but due to Redshift’s data load times, we had to reduce the data size. (Note: This benchmark is derived from the TPC-DS benchmark and, as such, is not directly comparable to published TPC-DS results.)

This is coming from one of the two vendors, so take it with however many grains of salt you’d like.

Comments closed

HDP 2.5 Sandbox

Published 2016-09-22 by Kevin Feasel

The Hortonworks Data Platform 2.5 sandbox is now available:

GET STARTED IN 4 STEP

1) Download HDP Sandbox as a VM image(VMware and Virtualbox or Docker
2) Setup and Start the VM image.
3) Try a Sandbox tutorial, check out the list of free tutorials, or jump directly into an Hello to HDP hands-on tutorial.
4) Need more help? Visit the Hortonworks Community Connection(HCC) and interact directly with the community and our development team.

It looks like they’ve bumped up the RAM requirements to 8 GB and have added new tutorials.

Comments closed

Clickstream Anomaly Detection

Published 2016-09-21 by Kevin Feasel

Chris Marshall shows how to perform anomaly detection using AWS Kinesis Analytics:

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection. However, understanding your data domain is paramount when performing data analytics. The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them. Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST. For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter. Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score. You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

This kind of scenario is pretty cool—you could also do things like detecting service outages in streams (fewer than X events in a window, where X is some very small number relative to your overall data) or changes in advertising campaigns.

Comments closed

Deadlocks In Apache Ignite

Published 2016-09-21 by Kevin Feasel

Prachi Garg discusses Deadlock-Free Transactions in Apache Ignite:

When transactions in Ignite are performed with concurrency mode -OPTIMISTIC and isolation level -SERIALIZABLE, locks are acquired during transaction commit with an additional check allowing Ignite to avoid deadlocks. This also prevents cache entries from being locked for extended periods and avoids “freezing” of the whole cluster, thus providing high throughput. Furthermore, during commit, if Ignite detects a read/write conflict or a lock conflict between multiple transactions, only one transaction is allowed to commit. All other conflicting transactions are rolled back and an exception is thrown, as explained in the section below.

This sounds pretty similar to how SQL Server’s In-Memory OLTP works.

Comments closed

Apache Hadoop 3 Alpha

Published 2016-09-21 by Kevin Feasel

Andrew Wang looks at the new alpha for Hadoop:

Support for Multiple Standby NameNodes

HDFS NameNode high availability with QuorumJournalManager uses a Paxos quorum to store the NameNode edit log. With a three-node quorum, this change means we can tolerate the loss of any one node and still continue operation.

However, business-critical deployments may wish to run with higher levels of fault-tolerance, e.g. a five-node quorum to be able to tolerate the loss of any two nodes.

QuorumJournalManager already supports an arbitrary number of nodes, but fault tolerance was limited since HDFS was only able to run a single active and single standby NameNode. Hadoop 3 eliminates this restriction by supporting running multiple standby NameNodes. This improves the fault tolerance of HDFS.

This one is huge to me. It was a sad day when I learned that the “secondary” NameNode was no such thing.

Comments closed

Improving HBase Cluster Restart Time

Published 2016-09-19 by Kevin Feasel

Nitin Verma explains how to re-create an HBase cluster a bit faster:

When flush ‘table’ operation is triggered, all the regions belonging to that table will flush independently. Once the HFile corresponding to a region is flushed, it records the max sequence id in metadata and notifies the WAL corresponding to the regionserver. WAL maintains a mapping table for regions and their corresponding flushed sequence id’s. When the HBase cluster restarts, the hMaster will distribute flushed sequence id’s per region to the recovery threads splitting the WAL, so that they can skip the edits which have already been persisted in HFiles.

This is particularly important for clusters which frequently spin up and down, a feature of Platform-as-a-Service solutions like HDInsight.

Comments closed

S3 Or EBS?

Published 2016-09-19 by Kevin Feasel

Devadutta Ghat, et al, compare Amazon S3 versus Elastic Block Storage (EBS) on the basis of cost and Apache Impala performance:

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem

OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again

Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

It looks like EBS is a solid choice for many workloads.

Comments closed

Change Data Capture With Apache NiFi

Published 2016-09-15 by Kevin Feasel

Satish Bomma uses Apache NiFi to perform change data capture on a MySQL database:

The main things to configure is DBCPConnection Pool and Maximum-value Columns

Please choose this to be the date-time stamp column that could be a cumulative change-management column

This is the only limitation with this processor as it is not a true CDC and relies on one column. If the data is reloaded into the column with older data the data will not be replicated into HDFS or any other destination.

This processor does not rely on Transactional logs or redo logs like Attunity or Oracle Goldengate. For a complete solution for CDC please use Attunity or Oracle Goldengate solutions.

That last paragraph in the snippet is key: it’s not a true replacement for CDC-friendly products. It is, however, a good example for showing how to use NiFi to connect to a relational database and pump data out of it.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31