Press "Enter" to skip to content

Category: Cloud

Azure Data Lake Store Best Practices

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Comments closed

HDInsight 3.6 Available

Ashish Thapliyal points out some Hive improvements in HDInsight 3.6:

2 Create a new Hive table from scratch or alter Table

Create a new table by, clicking on the ‘+’ icon, which opens the create table wizard. Enter table name, column name and choose a data type from the dropdown. You can pick folloiwng advanced hive settings directly from the UI

  • Transactional : Turn on transaction support in Hive, by checking this flag. Note that the table must be bucketed and stored using an ACID compliant format (such as ORC).

  • Location : Hive stores the table data for managed tables in the Hive warehouse directory in HDFS which is configured in hive-site.xml with property hive.metastore.warehouse.dir. The default location is /apps/hive/warehouse. The location can be changed using the Location text field.

  • File Format : The default file format for CREATE TABLE statement is ORC. choose a format from the file format dropdown.

  • Row Format : Select a row format such as Field terminator, Lines terminator, and Stored File type.

  • Table can be altered to add new columns or change the column name or column datatype.

  • Tables can also be renames and altred

Read on for more improvements, including a graphical plan viewer and improved autocomplete.

Comments closed

Taking Advantage Of Azure Elasticity

Arun Sirpal migrated a number of Azure SQL Databases into an elastic pool and configured a series of elastic jobs to support them:

I want to show you how I went from having multiple single SQL databases in Azure to a database elastic pool within a new dedicated SQL Server. Once setup I create and use elastic jobs. This post is long but I am sure you will find it useful.

APPROACH TAKEN

  • Create a new “logical” SQL Server.

  • Create a new elastic pool within this logical SQL Server.

  • Move the data from the old single SQL databases to the above elastic pool (couple of ways to do this but I used built-in backups).

  • Confirm application connection.

  • Decommission single SQL databases.

  • Create / setup an elastic job.

  • Check the controller database.

Definitely worth reading if you are looking at hosting multiple databases in Azure.

Comments closed

Handling Runbook Alerts

Grant Fritchey shows how to set up alerting when an Azure automation job fails:

Believe it or not, there’s not an immediately obvious “Oh, you had an error in your Automation script, here’s how you alert someone” setting in the Azure portal. Now, you could simply put error handling in your PowerShell script. In fact, it’s probably not at all a bad idea to do that as well. However, what you would not get setting things up that way is a mechanism for managing the alerts, history, additional possible responses (like firing off another Runbook, although there is way to do that from the PowerShell too). Instead, what I want is way to manage alerts through the Azure fabric.

If you do a search, there is an Azure Alert service. However, it didn’t seem to be really what I was looking for. Further, I found it extremely difficult (OK, I couldn’t make it work) to connect the alerts directly to the Jobs related to my Runbooks. Instead, after quite a bit of research, what I found is a combination of Azure Log Analytics with the Operations Management Suite (OMS) will do exactly what I’m looking for.

Click through to read how to set this up.

Comments closed

Learning Azure

Grant Fritchey notes that web searches won’t always take you to the latest version of documentation:

If you’re learning Azure and you research things using a search engine, then I strongly recommend you use the ability to limit your searches to the last year. Otherwise, you may be getting incomplete or incorrect data. At this precise moment, I’d say you need to limit your searches to Google (although I honestly hate recommending one of these tools over the other, let’s keep the competition fierce) because I was able to easily get the correct information within a couple of mouse clicks.

Grant’s post makes sense, and so does the search engine behavior:  in Grant’s case, those older cmdlet documentation links have been around longer and older resources tend to have a larger number of relevant linkbacks and clicks.  That’s also visible in SQL Server documentation, where sometimes you’ll land on the 2008R2 or 2012 version of documentation rather than 2016 or vNext.

Meanwhile, Victoria Holt has a bunch of resources for the Azure curious:

Here are a whole set of links to kick start your learning of Microsoft Azure services.

Introduction video

Changes to computer thinking – Stephen Fry explains cloud computing

That’s a good set of starting links.

Comments closed

Scalable Data Analytics

David Smith covers a recent Microsoft Data Science team talk at Strata:

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you’ll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You’ll also learn how to create prediction engines from these trained models using the mrsdeploy package.

Check out the post as well as the tutorial David links.

Comments closed

Encrypting Kinesis Records

Temitayo Olajide shows how to use Amazon’s Key Management Service to encrypt and decrypt Kinesis messages:

In this post you build encryption and decryption into sample Kinesis producer and consumer applications using the Amazon Kinesis Producer Library (KPL), the Amazon Kinesis Consumer Library (KCL), AWS KMS, and the aws-encryption-sdk. The methods and the techniques used in this post to encrypt and decrypt Kinesis records can be easily replicated into your architecture. Some constraints:

  • AWS charges for the use of KMS API requests for encryption and decryption, for more information see AWS KMS Pricing.

  • You cannot use Amazon Kinesis Analytics to query Amazon Kinesis Streams with records encrypted by clients in this sample application.

  • If your application requires low latency processing, note that there will be a slight hit in latency.

Check it out, especially if you’re thinking about streaming sensitive data.

Comments closed

Introduction To Amazon Kinesis

Jen Underwood describes Amazon Kinesis:

Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis is ideal for Internet of Things (IoT) use cases. It can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, Raspberry Pi gadgets, devices, social media, operational logs, metering data and more.

With Amazon Kinesis, you can build real-time dashboards, capture exceptions, execute algorithms, and generate alerts. With point-and-click menus, you can ingest data, query it and then send output to a variety of destinations including but not limited to Amazon S3, Amazon EMR, Amazon DynamoDB, or Amazon Redshift.

Kinesis is powerful, especially if you’re already locked into the AWS platform.  My preference is Apache Kafka, but Kinesis is definitely worth learning about.

Comments closed

doAzureParallel

JS Tan announces a new R package:

For users of the R language, scaling up their work to take advantage of cloud-based computing has generally been a complex undertaking. We are therefore excited to announce doAzureParallel, a lightweight R package built on Azure Batch that allows you to easily use Azure’s flexible compute resources right from your R session. The doAzureParallel package complements Microsoft R Server and provides the infrastructure you need to run massively parallel simulations on Azure directly from R.

The doAzureParallel package is a parallel backend for the popular foreach package, making it possible to execute multiple processes across a cluster of Azure virtual machines with just a few lines of R code. The package helps you create and manage the cluster in Azure, and register it as a parallel backend to be used with foreach.

It’s an interesting alternative to building beefy R servers.

Comments closed