Hadoop – Page 83 – Curated SQL

Cloudera Data Platform

Published 2019-01-14 by Kevin Feasel

Alex Woodie reports on the new Cloudera’s business plan:

“Once we’ve delivered that and got past it, we then want to get to a second subsequent version, which you can start to upgrade and migrate to, and that will be the go-forward platform,” he said. “Obviously the key part of CDP is delivering not just the workloads you have today but new and intuitive experiences around key workloads such as data warehousing, data flow, the edge or streaming, AI and machine learning.”
The company also announced that CDH 5.x and 6.x and HDP 3.x will be supported through January 2022, which is in-line with previous guidance the company has given. This company believes that three years is plenty of time for customers to plan their migration paths from older CDH and HDP versions to the unified CDP product. Support for HDP 2.x will end before that time.

Also of interest: the integration of Hortonworks Data Flow into CDH and Cloudera Data Science Workbench into HDP.

Comments closed

Converting Spark RDDs To DataFrames

Published 2019-01-14 by Kevin Feasel

Franco Cano shows us how we can convert a Resilient Distributed Dataset in Spark to a DataFrame:

Sometimes you need transform you RRDs to DataFrames because DataFrames have a lot optimization options.
Let’s see how this is done.

This works, but there is a performance cost to converting a large RDD to a DataFrame (or vice versa). With that in mind, sticking to one type when you can is typically better.

Comments closed

Data Transformation Tools In The Azure Space

Published 2019-01-11 by Kevin Feasel

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Comments closed

Apache Airflow Now A Top-Level Project

Published 2019-01-10 by Kevin Feasel

Fokko Driesprong announces that Apache Airflow is now a top-level Apache project:

Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.

Airflow has been getting some hype lately, especially in the AWS space.

Comments closed

Databricks Library Utilities For Notebooks

Published 2019-01-10 by Kevin Feasel

Srinath Shankar and Todd Greenstein announce a new feature in Databricks Runtime 5.1:

We can see that there are no libraries installed and scoped specifically to this notebook. Now I’m going to install a later version of SciPy, restart the python interpreter, and then run that same helper function we ran previously to list any libraries installed and scoped specifically to this notebook session. When using the list() function PyPI libraries scoped to this notebook session are displayed as <library_name>-<version_number>-<repo>, and (empty) indicates that the corresponding part has no specification. This also works with wheel and egg install artifacts, but for the sake of this example we’ll just be installing the single package directly.

This does seem easier than dropping to a shell and installing with Pip, especially if you need different versions of libraries.

Comments closed

A Compendium Of Kafka Links

Published 2019-01-09 by Kevin Feasel

Manas Dash shares some interesting Kafka-related articles, case studies, and books:

Articles
1. Kafka in a Nutshell. Published on September 25, 2015, by Kevin Sookocheff. Kevin’s article is all about Kafka in a nutshell. He says “Kafka is quickly becoming the backbone of many organization’s data pipelines — and with good reason. By using Kafka as a message bus we achieve a high level of parallelism and decoupling between data producers and data consumers, making our architecture more flexible and adaptable to change.” If you have not read about Kafka yet, you must go through it. This is more like an executive summary of the what, where, and why of Kafka.

Read on for several more articles, as well as a few case studies and two books.

Comments closed

Generating Test Data In Kafka

Published 2019-01-07 by Kevin Feasel

Yeva Byzek takes us through the Kafka Connect Datagen connector:

Short of using real data from a real source, you do have a few options on how to generate more interesting test data for your topics. One option is to write your own client. Kafka has many programming language options—you choose: Java, Python, Go, .NET, Erlang, Rust—the list goes on. You can write your own Kafka client applications that produce any kind of records to a Kafka topic, and then you’re set.
But wouldn’t it be great if you could generate data locally to just fill topics with messages? Fortunately, you’re in luck! Because we have those data generators.

Click through for a demonstration.

Comments closed

Cloudera And Hortonworks Officially Merged

Published 2019-01-07 by Kevin Feasel

Arun Murthy gives the used-to-be-Hortonworks perspective on the now-official merger of Cloudera and Hortonworks:

Our merger did not arise out of the blue. Our respective missions were well aligned, and together the new Cloudera has the scale it needs to service the constantly changing needs of the world’s most demanding organizations and to grow even more dominant in the market.
New open-source standards such as Kubernetes, container technology and the growing adoption of cloud-native architectures are major parts of Cloudera’s strategy. Our primary initiative out of the gate is to deliver a 100-percent open-source unified platform, which leverages the best features of Hortonworks Data Platform (HDP) 3.0 and Cloudera’s CDH 6.0. Cloud-native and built for any cloud – with a public cloud experience across all clouds – the unified platform embodies our shared “cloud everywhere” vision.

I’m more a fan of the Hortonworks tooling like Ambari than I am of Cloudera’s alternatives, so it will be interesting to see what happens going forward. The good news for recalcitrant types like me is that HDP will be around for a couple of years yet.

Comments closed

Kafka And Exactly-Once Delivery

Published 2019-01-02 by Kevin Feasel

Rahul Agarwal explains what “exactly-once” means in terms of message-passing systems:

Until recently most organizations have been struggling to achieve the holy grail of message delivery, the exactly-once delivery semantic. Although this has been an out-of-the-box feature since Apache Kafkas 0.11, people are still slow in picking up this feature. Let’s take a moment in understanding exactly-once semantics. What is the big deal about it and how does Kafka solve the problem?
Apache Kafka offers following delivery guarantees. Let’s understand what this really means:

In a distributed system, having true exactly-once processing is extremely difficult to achieve.

Comments closed

Choosing Azure Data Lake Analytics Versus Azure Databricks

Published 2019-01-02 by Kevin Feasel

Ginger Grant helps us make the decision between using Azure Data Lake Analytics and Azure Databricks:

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data. Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics. With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Click through for the comparison.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: Hadoop