Press "Enter" to skip to content

Category: Cloud

Using Hive LLAP On ElasticMapReduce

Jigar Mistry shows how to configure and use Hive LLAP on AWS’s ElasticMapReduce:

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL  over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

  • For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.

  • It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.

  • It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.

  • LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.

  • It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.

  • It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.

  • It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

Read on for more details, including the bootstrap action you need to take and how to use LLAP once you have it configured.

Comments closed

Going Back From The Cloud

Arun Sirpal notes that you can take a cloud database back to on-premises:

The Challenge: I am going to write about a way to move from Azure SQL Database (Platform as a service) back to a local SQL Server. I did encounter errors on the way but more importantly I have written how to avoid/solve them.

Another key point I made sure that there were no connections to the database when doing the below as I didn’t want in-flight data movement whilst doing it. If you can’t do this, then you should create a copy of the database and work from that.

It’s not a trivial operation, but Arun does walk us through the steps.

Comments closed

More On S3Guard

Aaron Fabbri describes how S3Guard works:

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop filesystem, S3 behaves different than HDFS.  One of the key differences is in the level of consistency provided by the underlying filesystem.  Unlike HDFS, S3 is an eventually consistent filesystem.  This means that changes made to files on S3 may not be visible for some period of time.

Many Hadoop components, however, depend on HDFS consistency for correctness. While S3 usually appears to “work” with Hadoop, there are a number of failures that do sometimes occur due to inconsistency:

  • FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing.  This is a big problem with Spark, for example.

  • Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.

  • Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.

Worth reading if you’re looking at using S3 to store data for Hadoop.  Also check out an earlier post on the topic.

Comments closed

Auto-Install Docker And SQL Tools On Linux

Andrew Pruski has a script on GitHub:

So I’ve created a repository on GitHub that pulls together the code from Docker to install the Community Edition and the code from Microsoft to install the SQL command line tools.

The steps it performs are: –

  • Installs the Docker Community Edition

  • Installs the SQL Server command line tools

  • Pulls the latest SQL Server on Linux image from the Docker Hub

Read on for more details and some limitations.

Comments closed

Web Editor For Azure Analysis Services

James Serra shows off a preview of the Azure Analysis Services web designer for tabular models:

Microsoft has released a preview of the Azure Analysis Services web designer.  This is a browser-based experience that will allow developers to start creating and managing Azure Analysis Services (AAS) semantic models quickly and easily.  SQL Server Data Tools (SSDT) and SQL Server Management Studio (SSMS) will still be the primary tools for development, but this new designer gives you another option for creating a new model or to do things such as adding a new measure to a development or production AAS model.

A highly requested feature is that you can import a Power BI Desktop file (.pbix) into an Analysis Services database.  And once imported you can reverse engineer to Visual Studio.  Note for PBIX import only Azure SQL Database, Azure SQL Data warehouse, Oracle, and Teradata are supported at this time and Direct Query models are not yet supported for import (Microsoft will be adding new connection types for import every month).

Read on for more details.

Comments closed

Azure SQL Database Deployment Account Errors

Steve Jones troubleshoots an issue with Azure SQL Database:

 I’ve had most builds work really well. I tried a number of things, but kept getting a few items in the build. There were login errors or network errors, both of which bothered me since I could manually log in with SSMS from the same machine as my build agent.

I suspected a few things here, one of which was the use of named pipes for the Shadow database and TCP for Azure SQL Database.

Eventually, I decided to fall back with msbuild, ignoring VSTS, and make sure all my parameters were correct.

Read on for the rest of the story.

Comments closed

Multi-Migration To Azure

Kevin Hill shows how to migrate multiple databases to Azure SQL Database:

Since I might be doing a similar thing in the near future, I decided to see if I could push 4-5 small test databases to Azure SQLDB at the same time (SQL on Azure VM test may come later).  If this happens with my client it will be hundreds of databases, with the same schema but wildly varying sizes.

The “Plan”

  • Create 5 sample databases

  • Create .bacpacs on local drive

  • Create Azure Logical Server if not already done (see step 4)

  • Create a Notepad of SQLPackage.exe commands,  one for each .bacpac

  • Run with “start” in CMD to fire off separate windows

  • Wait.

  • Enjoy.

Worth reading the whole thing.

Comments closed

Stream Analytics Into Power BI

Rolf Tesmer shows off how to use Azure Stream Analytics to push data in real time via the Power BI API into your Power BI dashboard:

You can push data to the Power BI streaming dataset API in a few ways… but they generally boil down to these 3 options

  1. Directly call the API from code
  2. Directly call the API from an Azure Logic App
  3. Use Azure Stream Analytics to push data into the API

This blog post extends on my previous post – and thus I will be leveraging Option #3 above.

Definitely worth checking out if you are interested in real-time Power BI dashboards.

Comments closed

Data Lake Analysis With Excel And Power BI

Sachin C Sheth announces support for Azure Data Lake Store within Excel and Power BI:

Until now, if you had to analyze data stored in ADLS with Excel, you would have to copy it into a relational data store like Azure SQL Data Warehouse or download the data onto a machine, and then use Excel to analyze that data. This was rather cumbersome involving additional cost and time. With this new support, you can now access files stored in ADLS with Excel in-place, without having to copy them to other stores or locations. You can quickly get advanced insights into raw or prepared data. Models and queries you have created using Excel that ran against local data, can be run seamlessly against data stored in ADLS.

Security capabilities of ADLS allow administrators to control access to the data stored in ADLS in a discretionary manner. With this you can limit the access that Excel users have for the data in ADLS. In this manner, data in the ADLS-based data lake continues to be the single source of truth with no redundant copies and can be analyzed by analytics tools of your own choice .

Click through for a demo video.

Comments closed

Azure SQL Database Performance Tips

Arun Sirpal looks at some performance issues in Azure SQL Database:

Let’s assume that you are not driven by logins, workers and session counts how does one select the right level? What exactly does DTUs (Database Transaction Units) mean? I suggest reading this post by Andy Mallon https://sqlperformance.com/2017/03/azure/what-the-heck-is-a-dtu

I am going to undersize my database and create a S0 database and run some day to day tasks – let’s see what happens. I will open up connections and issue some queries via my application. I would not class these queries as bad, what I am trying to drive here is getting the sizing right for your workload.

This is one of the trickier things to get, I think.  We’re taking an existing workload and want to make sure it doesn’t fall over…but we aren’t measuring in terms of DTUs locally.  I know that there are some tools that help the conversion process, but if you’re starting a new product or don’t have a great handle on normal workload, it’s really easy to fall into the Scylla and Charybdis of undersizing and overpaying.

Comments closed