Azure Data Lake Store Best Practices

Ust Oldfield provides recommendations on how to size and lay out files in Azure Data Lake Store:

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

This is the type of thing that you can easily forget about, but it makes a huge difference down the line.

Related Posts

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala: We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node: https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-head.sh And run the following action […]

Read More

Bacpacing In Azure

Derik Hammer shows how to use a bacpac file to deploy an existing database to Azure SQL Database: The recommended method for working with Azure is always PowerShell. The Azure portal and SSMS are tools there for your convenience but they do not scale well. If you have multiple databases to migrate, potentially from multiple […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930