How old is really old?
The Azure Data Lake Store binaries have been broadly certified for Hadoop distributions after 3.0 and above. We are really in uncharted territory for lower versions. So the farther away you go from 3.0 the higher the likelihood of them not working. My personal recommendation is to go no lower than 2.6. After that your mileage may really vary.
This is a good article, and do check it out. A very small mini-rant follows: Hadoop version 2.6 is not old. Nor is 2.7. 2.7 is the most recent production-worthy branch and 3.0 isn’t expected to go GA until August.
YARN introduces the notion of opportunistic containers in addition to the current guaranteed containers. An opportunistic container is queued at the NodeManager waiting for resources to become available, and run opportunistically so long as resources are available. They are preempted, if and when needed, to make room for guaranteed containers. Running opportunistic containers between the completion of a guaranteed container and the allocation of a new one should improve cluster utilization.
There are a couple other new features, including support for Azure Data Lake Store.
So to give a concrete example, if the default file system was
/user/filename.txtwould resolve to
Why does the default file system matter? The first answer to this is purely convenience. It is a heck lot easier to simply say
adl://amitadls.azuredatalakestore.net/in code and configurations. Secondly, many components in Hadoop use relative paths by default. For instance there are a fixed set of places, specified by relative paths, where various applications generate their log files. Finally, many ISV applications running on Hadoop specify important locations by relative paths.
Read on to see how.
Hadoop was developed for deployment over Linux running on bare metal. Cloud deployment implies virtual machines, and for Hadoop it’s a huge difference.
As detailed in other articles (for instance, Your Cluster Is an Appliance or Understanding Hadoop Hardware Requirements), bare-metal deployments have an inherent advantage over virtual machine deployments. The biggest of these is that they can use direct attached storage, i.e., local disks.
Not every Hadoop workload is storage I/O bound, but most are, and even when Hadoop seems to be CPU bound, much of the CPU activity is often either directly in service of I/O, i.e., marshaling, unmarshaling, compression, etc., or in service of avoiding I/O, i.e., building in-memory tables for map-side joins.
Read the whole thing.
Apache HBase is an open Source No SQL Hadoop database, a distributed, scalable, big data store. It provides real-time read/write access to large datasets. HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. HBase provides many features as a big data store. But in order to use HBase, the customers have to first load their data into HBase.
There are multiple ways to get data into HBase such as – using client API’s, Map Reduce job with TableOutputFormat or inputting the data manually on HBase shell. Many customers are interested in using Apache Phoenix – a SQL layer over HBase for its ease of use. The current post describes about how to use phoenix bulk load with HDinsight clusters.
Phoenix provides two methods for loading CSV data into Phoenix tables – a single-threaded client loading tool via the psql command, and a MapReduce-based bulk load tool.
Anunay explains both methods, allowing you to choose based on your data needs.
The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:
- 2 x 8 cores @2.60GHz
- 64GB of RAM
- 2 x 24 SAS drives
Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:
- Hadoop core 2.6.0
- Impala 2.5.0
- Hive 1.1.0
- HBase 1.2.0 (configured JVM heap size for region servers = 30GB)
- (not from CDH) Kudu 1.0 (configured memory limit = 30GB)
Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.
I would have liked to have seen ORC included as a file format for testing. Regardless, I think this article shows that there are several file formats for a reason, and you should choose your file format based on most likely expected use. For example, Avro or Parquet for “write-only” systems or Kudu for larger-scale analytics.
This release also adds support for Spark 2 including version Spark 2.1. Zeppelin now also links to Spark History Server UI from Zeppelin so users can more easily track Spark jobs. The Livy interpreter now supports specifying packages with the job.
The major security improvement in Zeppelin 0.7.0 is using Apache Knox’s LDAP Realm to connect to LDAP. Zeppelin home page now lists only the nodes to which the user is authorized to access. Zeppelin now also has the ability to support PAM based authentication.
The full list of improvements is available here
This visualization platform is growing up nicely.
Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!
You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.
Sparklyr is an exciting technology for distributed data analysis.
In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.
Data at rest
- Data residing on Amazon S3—S3 client-side encryption with EMR
- Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)
Data in transit
Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR
Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption
Turns out this is rather straightforward.
Security Best Practice #2: Require Authentication for All Services. While it’s important for ports to be accessible exclusively from the network segment(s) that require access, you need to go a step further to ensure that only specific users are authorized to access the services running on these ports. All MapR services — regardless of their accessibility — should require authentication. A good way to enforce this for MapR platform components is by turning on security. Note that MapR is the only big data platform that allows for username/password-based authentication with the user registry of your choice, obviating the need for Kerberos and all the complexities that Kerberos brings (e.g., setting up and managing a KDC). MapR supports Kerberos, too, so environments that already have it running can use it with MapR if preferred.
There’s nothing here which is absolutely groundbreaking, but they are good practices.