Hadoop – Page 52 – Curated SQL

You can expect that the total number of vCores available to YARN limits the number of containers you can run concurrently, that’s not true in some cases.
Let’s consider one of them – Capacity Scheduler with DefaultResourceCalculator (Memory only).

The name “Memory only” does give away the game a bit.

Comments closed

Don’t Install Hadoop on Windows

Published 2020-05-11 by Kevin Feasel

Hadi speaks truth:

A few days ago, I published the installation guides for Hadoop, Hive, and Pig on Windows 10. And yesterday, I finished installing and configuring the ecosystem. The only consequence I have is that “Think 1000 times before installing Hadoop and related technologies on Windows!”.

The biggest problem is that Microsoft got flaky about this. Back in 2012-2013, they backed running Hadoop on Windows as part of getting HDInsight up and running. I even remember the HDInsight emulator which could run on a local desktop. By 2014 or so, they shifted directions and decided it wasn’t worth the effort. Because Apache Spark (which does have pretty decent Windows support, at least for development) really wants Hive, you can fake it with winutils.

Comments closed

Project Metamorphosis: Elastic Kafka Clusters

Published 2020-05-08 by Kevin Feasel

Jay Kreps explains what Confluent has been up to lately:

What is Project Metamorphosis?
Let me try to explain. I think there are two big shifts happening in the world of data right now, and Project Metamorphosis is an attempt to bring those two things together.
The first one, and the one that Confluent is known for, is the move to event streaming.
Event streams are a real revolution in how we think about and use data, and we think they are going to be at the core of one of the most important data platforms in a modern company. Our goal at Confluent is to build the infrastructure that makes that possible and help the world take advantage of it. That’s why we exist.
But event streaming isn’t the only paradigm shift we’re in the midst of. The other change comes from the movement to the cloud.

Click through for the high-level. I can see this even more directly competing with Kinesis and Event Hubs.

Comments closed

Technology Choices for Streaming Pipelines

Published 2020-05-08 by Kevin Feasel

The Hadoop in Real World team takes us through different tools available when working on streaming pipelines:

Businesses want to get insights as quickly as possible and do not want to wait for a day, like before, to bring up a report to understand what happened till yesterday. They require a more proactive approach that can help to act immediately when something significant happens and also to prevent the system from any faults/downtime before it occurs. Imagine you are buying some product from an e-retailer and you have gone till the point to make payment and something happened that caused the payment not to go through successfully. At that very moment, you are having a second thought about whether to buy the product now or later. Suppose, if the business is getting a report of this occurrence next day, it would not be of much use for them as the customer would have already bought it from somewhere or decided against it. This is where real-time events and insights come in. If it were a real-time report, the team would have called up the customer and made the purchase by offering some discounts, which in turn would have changed the mind of the customer.

Click through for a high-level discussion of these tools.

Comments closed

Security Practices for Azure Databricks

Published 2020-05-07 by Kevin Feasel

Abhinav Garg and Anna Shrestinian walk us through good security practices when using Azure Databricks:

Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta Lake, MLflow, Koalas and Apache Spark^TM, Azure Databricks is a first party PaaS on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure cloud services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Business Analysts and SecOps / Cloud Engineering.
In this article, we will share a list of cloud security features and capabilities that an enterprise data team could utilize to bake their Azure Databricks environment as per their governance policy.

Much of this is fairly straightforward, but it is nice to have it all in one place.

Comments closed

Dynamic File Pruning on Delta Lake

Published 2020-05-06 by Kevin Feasel

Ali Afroozeh, et al, take us through Dynamic File Pruning in Databricks Runtime 6.1:

In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. This can be achieved because Delta Lake automatically collects metadata about data files managed by Delta Lake and so, data can be skipped without data file access. Prior to Dynamic File Pruning, file pruning only took place when queries contained a literal value in the predicate but now this works for both literal filters as well as join filters. This means that Dynamic File Pruning now allows star schema queries to take advantage of data skipping at file granularity.

There are some interesting performance results here. I’d also be curious to see how robust the results are as queries get more complicated

Comments closed

Kafka and Zookeeper

Published 2020-05-04 by Kevin Feasel

Ramandeep Kaur explains what Apache Kafka uses Apache Zookeeper to do:

ZooKeeper allows developers to focus on the core application logic, and it implements various protocols on the cluster so that the applications need not implement them on their own. These services are used in some form or another by distributed applications.

Ramandeep hits on KIP-500 at the end of her post as well.

Comments closed

Database Integrity in Cloudera Data Platform

Published 2020-05-01 by Kevin Feasel

Gokul Kamaraj and Liliana Kadar continue a series on operational database tooling in Hadoop:

Referential integrity is supported through the implementation of ‘constraints’ as well as enforcing business rules for attributes in the table.
Constraints are configurable, and you can use it across different tables. Keep in mind that you have to choose a behavior depending on the specific configuration given to that constraint.

This is rather underdeveloped compared to relational database platforms, but it’s still an improvement over the olden days, in which referential integrity was “write code which does that after the fact.”

Comments closed

Reading Query Plans in Spark

Published 2020-05-01 by Kevin Feasel

Daniel Ciocirlan has a primer on query plans in Apache Spark:

Let’s go over some examples of query plans and how to read them. Let’s go back to the one we’ve just shown:
== Physical Plan == *(1) Project [(id#0L * 5) AS id#2L]
+- *(1) Range (1, 1000000, step=1, splits=6)
We read this plan backwards, bottom to top:

Spark does have some UI components which make this a bit easier, but you’ll probably end up in a situation where you need to read it in this format.

Comments closed

Developing for Databricks with VS Code

Published 2020-04-29 by Kevin Feasel

Gerhard Brueckl tells us what comes after notebooks for users with development backgrounds:

For those users Databricks has developed Databricks Connect (Azure docs) which allows you to work with your local IDE of choice (Jupyter, PyCharm, RStudio, IntelliJ, Eclipse or Visual Studio Code) but execute the code on a Databricks cluster. This is awesome and provides a lot of advantages compared to the standard notebook UI. The two most important ones are probably the proper integration into source control / git and the ability to extend your IDE with tools like automatic formatters, linters, custom syntax highlighting, …
While Databricks Connect solves the problem of local execution and debugging, there was still a gap when it came to pushing your local changes back to Databricks to be executed as part of a regular ETL or ML pipeline. So far you had to either “deploy” your changes by manually uploading them via the Databricks UI again or write a script that uploads it via the REST API (Azure docs).

Gerhard has a nice extension for Visual Studio Code which helps with this. I’m also a huge fan of the DatabricksPS module, so I’ll happily plug that here.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Negative vCores in YARN with the Capacity Scheduler

Don’t Install Hadoop on Windows

Project Metamorphosis: Elastic Kafka Clusters

Technology Choices for Streaming Pipelines

Security Practices for Azure Databricks

Dynamic File Pruning on Delta Lake

Kafka and Zookeeper

Database Integrity in Cloudera Data Platform

Reading Query Plans in Spark

Developing for Databricks with VS Code