The future is not clear for either Cloudera and MapR. While there are similarities in the two companies’ positions, there are big differences too.
Cloudera does not have a permanent CEO at the moment, and it still hasn’t shipped the new converged Hadoop distribution, dubbed Cloudera Data Platform (CDP), that will replace the old Cloudera and Hortonworks distributions. During its first quarter ended April 30, Cloudera said customers are holding off investing in the old Hadoop products since they know the new CDP is due by the end of the year. That fact led Cloudera to dramatically lower its revenue expectations for the year, which upset stockholders, who pushed Cloudera’s stock (NYSE: CLDR) down 40% the following day.
The way I’m phrasing it is that the Hadoop ecosystem is strong (with the successes of companies like Databricks and Confluent), but core Hadoop companies are struggling.
Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.
The release is publicly available starting today. Install MLflow 1.0 using PyPl, read our documentation to get started, and provide feedback on GitHub. Below we describe just a few of the new features in MLflow 1.0. Please refer to the release notes for a full list.
And it looks like they’re going to keep pushing on it from there.
To perform the steps below, I set up a single Ubuntu 16.04 machine on AWS EC2 using local storage. In real-life scenarios you will probably have all these components running on separate machines.
I started the instance in the public subnet of a VPC and then set up a security group to enable access from anywhere using SSH and TCP 5601 (for Kibana). Finally, I added a new elastic IP address and associated it with the running instance.
The example logs used for the tutorial are Apache access logs.
This is a great walkthrough on setup and basic configuration. If you don’t have something in place to manage logs, the ELK stack is fine.
Starting with CDH 6.2, Cloudera now includes the ability to use Intel’s newly released Optane Memory as an alternate destination for the 2nd tier of the bucket cache. This deployment configuration enables you to have ~3x the size of the cache for constant cost (as compared to off-heap cache on DRAM). It does incur some additional latency compared to the traditional off-heap configuration, but our testing indicates that by allowing more (if not all) of the data’s working set to fit in the cache the set up results in a net performance improvement when the data is ultimately stored on HDFS (using HDDs).
When deploying to the cloud or using on-prem object storage, the performance improvement will be even better as object storage tends to be very expensive for random reads of small amounts of data.
There aren’t too many changes to HBase in the blog post, but the two mentioned are pretty good ones.
If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
POPULATION <> 'A'
It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.
So, what is Azure Databricks? To answer this question, let’s start all the way at the bottom of the hole and climb up. So, what is Hadoop? Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations. It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things. However, Hadoop solutions are extremely complex to manage and develop. So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems. Apache Spark is one of these platforms. You can read more about Apache Hadoop here and here.
It’s Hadoop turtles all the way down.
One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.
You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.
Notebooks are everywhere. And for good reason.
Mistake #5: Configuring different names for the schemas topic in different Schema Registry instances
There is a commit log with all the schema information, which gets written to a Kafka topic. All Schema Registry instances should be configured to use the same schemas topic, whose name is set by the configuration parameter
kafkastore.topic. This topic is the schema’s source of truth, and the primary instances read the schemas from this topic. The name of this topic defaults to
_schemas, but sometimes customers choose to rename it. This has to be the same for all Schema Registry instances, otherwise it may result in different schemas with the same ID.
Read on for sixteen more.
Apache Kafka is a distributed open source messaging bus that was written in Java and Scala. The software implements a publish and subscribe messaging system that’s capable of moving large amounts of event data from sources to sinks, in a high-throughput manner with minimal latency and strong consistency guarantees. The software relies on Apache Zookeeper for management of the underlying cluster.
Kafka is based on the concept of producers and consumers. Event data originating from producers is stored timestamped partitions that are housed within Kafka topics. Meanwhile, consumer processes can read the data stored in Kafka partitions. Kafka automatically replicates partitions across multiple brokers (or nodes in the cluster), which allows Kafka to scale its message streaming service in a fault-tolerant manner.
Click through for descriptions of several good options. And if you want a big list, queues.io has one for you.
Modern Microservices are all about making systems event-driven: instead of making remote requests and waiting for the response (services and components calling each other and tell each other what to do), we can send notifications to related microservices when an event occurs.
These events are facts about the business. For example, an ATM or online transaction, a new log entry, or a customer registering for a new mobile plan. They are the data points collected by organizations that make their datasets. The good thing is, we can store these events in the very same infrastructure that we use to broadcast them: Apache Kafka. The better thing is we can even process them in the same infrastructure with Stream Processing applications. This means our applications and systems are linked via this central data pipeline, that is capable of real time data broadcast and processing and all data sources are shared via this data pipeline.
Read the whole thing.