Hadoop – Page 70 – Curated SQL

MLflow 1.1 Released

Published 2019-07-24 by Kevin Feasel

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.
Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
– Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Monitoring Backpressure in Apache Flink

Published 2019-07-24 by Kevin Feasel

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:
– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.
– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

Apache Phoenix on Cloudera

Published 2019-07-24 by Kevin Feasel

Krishna Maheshwari announces that Cloudera will officially support Apache Phoenix on its CDH and its upcoming Cloudera Data Platform:

Cloudera’s CDH releases have included Apache HBase which provides a resilient, NoSQL DBMS for customers operational applications that want to leverage the power of big-data. These applications have grown into mission important and mission critical applications that drive top-line revenue and bottom-line profitability. These applications include customer facing applications, ecommerce platforms, risk & fraud detection used behind the scenes at banks or serving AI/ML models for applications and enabling further reinforcement training of the same based on actual outcomes.
However, for many customers, HBase has been too daunting a journey

Phoenix is one of my favorite examples of Feasel’s Law in action.

Comments closed

Databricks Runtime 5.5

Published 2019-07-22 by Kevin Feasel

Bilal Aslam and Yifan Cao announce Databricks Runtime 5.5:

Secrets API in R notebooks
The Databricks Secrets API [Azure|AWS] lets you inject secrets into notebooks without hardcoding them. As of Databricks Runtime 5.5, this API is available in R notebooks in addition to existing support for Python and Scala notebooks. You can use the dbutils.secrets.get function to obtain secrets. Secrets are redacted before printing to a notebook cell.

There are some good updates in this release. Read on for the full list.

Comments closed

Hooking SQL Server to Kafka

Published 2019-07-17 by Kevin Feasel

Niels Berglund has an interesting scenario for us:

We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table.
In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database using, as we mentioned above, the SQL Server Extensibility Framework.

Click through for the scenario in depth and how to use Java to tie together SQL Server and Kafka.

Comments closed

Notebooks in Azure Databricks

Published 2019-07-16 by Kevin Feasel

Brad Llewellyn takes us through Azure Databricks notebooks:

Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R. However, selecting a language in this drop-down doesn’t limit us to only using that language. Instead, it makes the default language of the notebook. Every code block in the notebook is run independently and we can manually specify the language for each code block.

Before we get to the actually coding, we need to attach our new notebook to an existing cluster. As we said, Notebooks are nothing more than an interface for interactive code. The processing is all done on the underlying cluster.

Read on to learn how Databricks uses the notebook metaphor heavily in how you interact with it.

Comments closed

Reading and Writing CSV Files with spark-dotnet

Published 2019-07-16 by Kevin Feasel

Ed Elliott continues a series on Spark for .NET:

How do you read and write CSV files using the dotnet driver for Apache Spark?
I have a runnable example here:
https://github.com/GoEddie/dotnet-spark-examples
Specifcally:
https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv

The quoted links will take you straight to the code, but click through to see Ed’s commentary.

Comments closed

How .NET Code Talks to Spark

Published 2019-07-15 by Kevin Feasel

Ed Elliott has a great diagram showing how user-written .NET code communicates with Spark over the Java VM:

4. Spark-dotnet Java driver listens on tcp port
The spark-dotnet Java driver listens on a TCP socket. This socket is used to communicate between the Java VM and the dotnet code, the dotnet code doesn’t run in the Java VM but is in a separate process communitcating with the Java VM via that TCP postrt. The year is 2019, we serialize and deserialize data all the time and don’t even know it, hell notepad probably even does it.

It’s serialization & deserialization as well as TCP sockets all the way down.

Comments closed

Cloudera and 100% Open Source Software

Published 2019-07-12 by Kevin Feasel

Alex Woodie notes a change at Cloudera:

The old Cloudera developed and distributed its Hadoop stack using a mix of open source and proprietary methods and licenses. But the new Cloudera will be 100% open source, just like Hortonworks, its one-time Hadoop rival that it acquired in January. But will developing its data platform completely in the open differentiate it from cloud competitors?
In a blog post published yesterday under the title “Our Commitment to Open Source Software,” Cloudera executives Charles Zedlewski and Arun Murthy laid out the company’s new plan to develop and distribute everything in the open.

This was one of the big reasons I preferred Hortonworks over Cloudera when they were separate companies: Hortonworks had this model. Hopefully it leads Cloudera to success.

Comments closed

Kafka Docker on Kubernetes

Published 2019-07-11 by Kevin Feasel

Bill Ward gives us a step-by-step set of instructions for installing Kafka Docker on Kubernetes:

In this ultimate guide I will give you a simple step-by-step tutorial on installing Kafka Docker on Kubernetes. This post includes a complete video walk-through.
There has been a lot of interest lately about deploying Kafka to a Kubernetes cluster. If you are wanting to take the deep dive yourself then you found the right article. Now that we have Kafka Docker, deploying a Kafka cluster to Kubernetes is a snap.

This makes it even easier to get started with Kafka in a development environment.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop