Press "Enter" to skip to content

Category: Hadoop

Using Polybase To Insert Into HDFS

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales:  with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster?  And if so, how close to linear scale can I get?

Comments closed

Connecting Apache Drill To Power BI

Bryan Smith shows how to connect Apache Drill to Power BI:

Clicking Next takes me to the From ODBC dialog.  Here, I click on the Advanced options item, ignoring the Data Source Name (DSN) drop-down, and enter a connection string with the appropriate substitution for the host parameter:

driver={MapR Drill ODBC Driver};connectiontype=Direct;;port=31010;authenticationtype=No Authentication

Notice the connection string employs a Direct connection type, indicating that the app will speak directly to one of the nodes in the cluster (as identified by the host parameter) and not to the ZooKeeper service. ZooKeeper is in use on the cluster but is not exposed externally, given the network security group changes made during my earlier deployment.  Even if ZooKeeper were exposed, it tracks the nodes of the cluster using their internal names so that any app outside the virtual network containing the cluster would not be able to leverage the information in ZooKeeper to form a connection.  The only option that works here is the Direct connection type.

It’s worth reading the whole thing, as well as checking out the UserVoice suggestion for implementing full Apache Drill support.

Comments closed

Debugging Spark In HDInsight

Sajib Mahmood gives various methods for debugging Spark applications running on an HDInsight cluster:

Spark Application Master

To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.

These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.

Comments closed

Hadoop And Active Directory

RK Kuppala explains how to integrate a Hadoop cluster with Active Directory:

This post explains kerberizing an existing Hadoop cluster using Ambari. Kerberos helps with the Authentication part of enterprise security (while authorization, auditing and data protection being the remaining parts).

HDP uses Kerberos, which is an industry standard for authenticate users and resources and providing strong identity for users. Apache Ambari can kerberize an existing cluster by using an existing MIT key distribution center (KDC) or Microsoft’s Active Directory.

This was a lot easier than I expected.

Comments closed

Taxi Rides And Amazon Athena

Mark Litwintschik looks at using Amazon Athena to process the New York City taxi rides data set:

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics.  Check out Mark’s blog post for a good way of getting started with Athena.

Comments closed

Hortonworks Data Flow 2.1

Wei Wang and Haimo Liu announce Hortonworks Data Flow version 2.1:

In the release of HDF 2.1, data flow administrators within the enterprise can identify that in order for certain potential processors to be added to a working data flow system, additional authorization would be required.

In addition, HDF 2.1 supports over 180 processors including newly introduced Connect/Listen/PutWebSocket, Put/FetchElasticsearch5, ValidateCsv, etc.

HDF is Hortonworks’s big play on simplifying streaming operations in Hadoop.

Comments closed

ETL With Spark

Eric Maynard demonstrates that moving data across Hadoop clusters can be sped up by using Spark:

By leveraging Spark for distribution, we can achieve the same results much more quickly and with the same amount of code. By keeping data in HDFS throughout the process, we were able to ingest the same data as before in about 36 seconds. Let’s take a look at Spark code which produced equivalent results as the bash script shown above — note that a more parameterized version of this code code and of all code referenced in this article can be found down below in the Resources section.

Read the whole thing.

Comments closed

Log Aggregation With Kafka And Redis

Asaf Yigal has a two-part series on comparing Apache Kafka and Redis for moving log events into Elasticsearch.  Part 1 explains the technologies:

Redis is a bit different from Kafka in terms of its storage and various functionalities. At its core, Redis is an in-memory data store that can be used as a high-performance database, a cache, and a message broker. It is perfect for real-time data processing.

The various data structures supported by Redis are strings, hashes, lists, sets, and sorted sets. Redis also has various clients written in several languages which can be used to write custom programs for the insertion and retrieval of data. This is an advantage over Kafka since Kafka only has a Java client. The main similarity between the two is that they both provide a messaging service. But for the purpose of log aggregation, we can use Redis’ various data structures to do it more efficiently.

Part 2 compares the two technologies and explains which works better when:

Kafka heavily relies on the machine memory (RAM). As we see in the previous graph, utilizing the memory and storage is an optimal way to maintain a steady throughput. Its performance depends on the data consumption rate. In the case that consumers don’t consume data fast enough, Kafka will have to read from a disk and not from memory which will slow down its performance.

As you might expect, the answer for which technology to use is “it depends.”

Comments closed

Apache Ranger On ElasticMapReduce

Varun Rao explains role-based access control using Apache Ranger on Amazon ElasticMapReduce:

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

If your job includes securing a Hadoop cluster, this is a nice read, even if you don’t use EMR.

Comments closed

Polybase MapReduce Containers

I have a post looking at how Polybase generates MapReduce containers:

Once we did that and I restarted all of the services, I ended up getting an interesting error message from SQL Server:

Msg 7320, Level 16, State 110, Line 2
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to JobSubmitter_SubmitJob: Error [org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=512

The error message is pretty clear:  the Polybase service wants to create containers that are 1536 MB in size, but the maximum size I’m allowing is 512 MB.  Therefore, the Polybase MapReduce operation fails.

Long story short, I needed enough RAM to be able to give 4 1/2 GB to YARN for creating MapReduce containers in order to run my query.

Comments closed