Category: Hadoop

Where Kafka Connect Fits

Published 2021-09-17 by Kevin Feasel

Shivani Sarthi explains the value of Kafka Connect:

Kafka connect is not just a free, open source component of Apache Kafka. But it also works as a centralised data hub for simple data integration between databases, key-value stores etc. The fundamental components include-
– Connectors
– Tasks
– Workers
– Converters
– Transforms
– Dead letter Queue
Moreover it is a framework to stream data in and out of Apache Kafka. In addition, the confluent platform comes with many built-in connectors,used for streaming data to and from different data sources.

Click through for information on each component.

Comments closed

Type 1 SCDs in Delta Lake

Published 2021-09-16 by Kevin Feasel

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.
Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:
https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed

Scaling Hadoop Beyond 10,000 Nodes

Published 2021-09-13 by Kevin Feasel

Keqiu Hu, et al, take us through a problem of scale:

At LinkedIn, we use Hadoop as our backbone for big data analytics and machine learning. With an exponentially growing data volume, and the company heavily investing in machine learning and data science, we have been doubling our cluster size year over year to match the compute workload growth. Our largest cluster now has ~10,000 nodes, one of the largest (if not the largest) Hadoop clusters on the planet. Scaling Hadoop YARN has emerged as one of the most challenging tasks for our infrastructure over the years.
In this blog post, we will first discuss the YARN cluster slowdowns we observed as we approached 10,000 nodes and the fixes we developed for these slowdowns. Then, we will share the ways we proactively monitored for future performance degradations, including a now open-sourced tool we wrote called DynoYARN, which reliably forecasts performance for YARN clusters of arbitrary size. Finally, we will describe Robin, an in-house service which enables us to horizontally scale our clusters beyond 10,000 nodes.

Read on to learn about the problems they experienced and how they resolved them.

Comments closed

Order, Sort, Cluster, and Distribute in Hive

Published 2021-09-13 by Kevin Feasel

The Hadoop in Real World team give us three methods (and one synonym) to organize results in Hive:

Hive provides 3 options to order or sort the result of records – order by, sort by, cluster by and distribute by. Which option you choose has performance implications. So it is important to understand the difference between the options and choose the right one for the use case at hand.

Click through for a high-level overview of the techniques.

Comments closed

Data Lakehouse Point-of-Sale Analytics Demo

Published 2021-09-10 by Kevin Feasel

Bryan Smith and Rob Saker share a pattern:

Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Prior to the pandemic, 71% of retailers named lack of real-time visibility into inventory as a top obstacle to achieving their omnichannel goals. The pandemic only increased demand for integrated online and in-store experiences, placing even more pressure on retailers to present accurate product availability and manage order changes on the fly. Better access to real-time information is the key to meeting consumer demands in the new normal.
In this blog, we’ll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at scale with a data lakehouse.

It’s a cool scenario, at the least.

Comments closed

Finding HDFS Free Space

Published 2021-09-10 by Kevin Feasel

The Hadoop in Real World team has two methods of finding free space in HDFS:

It is a quite common need to find the available free space in HDFS. There are a couple of ways to find this information.
If you are using a cluster manager like Ambari or Cloudera Manager then you can go to the HDFS service and see this information in the summary section.

Click through for an image of that, as well as a command line technique.

Comments closed

Databricks Serverless SQL

Published 2021-09-09 by Kevin Feasel

Nikhil Jethava and Kevin Clugage announce serverless SQL on Databricks:

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution.
Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries.

Things are getting interesting between Databricks and Azure Synapse Analytics, as both now have serverless SQL and Spark offerings. Synapse Analytics has the better implementation for serverless SQL and Databricks the superior Spark implementation, so it becomes a question of which weakness you take in order to gain the strength.

1 Comment

LATERAL VIEW in Hive

Published 2021-09-08 by Kevin Feasel

The Hadoop in Real World team provides a quick example of a powerful feature in Apache Hive:

Lateral view is used in conjunction with user-defined table generating functions such as explode(). A UDTF generates zero or more output rows for each input row.
Click here if you like to know the difference between UDF, UDAF and UDTF
A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

In other words, LATERAL joins are the SQL standard for Microsoft’s CROSS APPLY operator. I normally dislike having different names for the same thing due to the risk of confusion, but in fairness to Microsoft on this one, my recollection is that the common name came after SQL Server 2005, which already had CROSS APPLY and OUTER APPLY.

Comments closed

Databricks Autologging

Published 2021-08-31 by Kevin Feasel

Corey Zumar and Kasey Uhlenhuth announce a new product:

Machine learning teams require the ability to reproduce and explain their results–whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid prototyping. As a result, data teams encounter three primary challenges when recording this information: (1) standardizing machine learning artifacts tracked across ML teams, (2) ensuring reproducibility and auditability across a diverse set of ML problems and (3) maintaining readable code across many logging calls.

Read on to see how Databricks Autologging can satisfy these issues.

Comments closed

Creating a Kafka Producer and Consumer with C#

Published 2021-08-26 by Kevin Feasel

Jim Galasyn shows how to use the Confluent.Kafka NuGet package to connect to a Kafka cluster from C#:

Sometimes you’d like to write your own code for producing data to an Apache Kafka^® topic and connecting to a Kafka cluster programmatically. Confluent provides client libraries for several different programming languages that make it easy to code your own Kafka clients in your favorite dev environment.
One of the most popular dev environments is .NET and Visual Studio (VS) Code. This blog post shows you step by step how to use .NET and C# to create a client application that streams Wikipedia edit events to a Kafka topic in Confluent Cloud. Also, the app consumes a materialized view from ksqlDB that aggregates edits per page. The application runs on Linux, macOS, and Windows, with no code changes.

Now, if only the .NET package supported a bunch of stuff which has come out over the past few years (the big one being Streams)… That’s no knock on the maintainers, mind you—they’ve done a good job given available resources—but it’s still unfortunate. At least there’s an unofficial implementation and hey, the original Confluent.Kafka .NET package started out as one of those too.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31