Data Lake Organization Tips

Melissa Coates has some great advice for people working with data lakes:

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Click through for all of Melissa’s advice in FAQ form.

Related Posts

Registering SignalR to the Cosmos DB Change Feed

Hasan Savran shows us how we can hook up SignalR to view the Cosmos DB Change Feed: SignalR allows server code to send asynchronous notifications to client-side web applications. By using it, Azure Functions can send real-time messages to your web applications. Prices can get change whenever data changes in database. Notices can be sent […]

Read More

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities: In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all […]

Read More

Categories

January 2019
MTWTFSS
« Dec Feb »
 123456
78910111213
14151617181920
21222324252627
28293031