Michael Rys has a boatload of new updates for Azure Data Lake:
The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!
In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.
Important aspects of processing files at scale include:
-
the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!
-
the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).
Read on for the full changelog.