Press "Enter" to skip to content

Category: Data Lake

Azure Data Lake ACLs

Saveen Reddy introduces file and folder level Access Control Lists for Azure Data Lake storage:

We’ve emphasized that Azure Data Lake Store is compatible with WebHDFS. Now that ACLs are fully available, it’s important to understand the ACL model in WebHDFS/HDFS because they are POSIX-style ACLs and not Windows-style ACLs.  Before we five deep into the details on the ACL model, here are key points to remember.

  • POSIX-STYLE ACLs DO NOT ALLOW INHERITANCE. For those of you familiar with POSIX ACLs, this is not a surprise. For those coming from a Windows background this is very important to keep in mind. For example, if Alice can read files in folder /foo, it does not mean that she can rad files in /foo/bar. She must be granted explicit permission to /foo/bar. The POSIX ACL model is different in some other interesting ways, but this lack of inheritance is the most important thing to keep in mind.

  • ADDING A NEW USER TO DATA LAKE ANALYTICS REQUIRES A FEW NEW STEPS. Fortunately, a portal wizard automates the most difficult steps for you.

This is an interesting development.

Comments closed

Reading From The Data Lake

Bill Vorhies discusses technologies to analyze and use data in a data lake:

So the takeaway that many DB developers would have you believe is ‘Hadoop Good’, ‘RDBMS Bad’.

But wait.  RDBMS EDW hasn’t gone away and won’t. That’s where we keep our single version of the truth, the business data that record legal transactions with customers, suppliers, and employees.  We also get strong SLAs, strong fault tolerance, and highly curated data based on strong ETL, provenance, and governance.  Those are all things that are missing in our Data Lake.

Anybody who sells you on one technology to solve all problems is shilling snake oil.  Bill’s answer is an Adjunct Data Warehouse, which sits separate from the Enterprise Data Warehouse.  You go to the EDW when you risk getting fired or going to jail if the data’s wrong; you go to the ADW when you need data not in your EDW, or when you need larger-scale analytics in which it’s okay to be 1% off.

Comments closed

Visiting Production

Randolph West discusses production access:

During a recent client meeting about a database migration, I realised that I have never logged into a SQL Server on their production environment. My involvement has been strictly dealing with setting up the new environment and log shipping the backups.

I get that I’m not a full-service DBA for this client, but it got me wondering about the many security discussions I’ve seen and participated in, in the past: that not even a junior DBA might need access to production database systems, if it’s not within the scope of his or her work.

Limiting production access is a smart move, but it’s important to realize the downstream consequences:  the people who still have access to production will (at least in the short term) have to perform a lot of the tasks that others were doing previously, including data fixes, research, etc.  It’s important to be prepared for that.

Comments closed

Overlapping Ranges Using U-SQL

Michael Rys explains how to merge overlapping ranges of data using U-SQL:

If you look at the problem, you will at first notice that you want to define something like a user-defined aggregation to combine the overlapping time intervals. However, if you look at the input data, you will notice that since the data is not ordered, you will either have to maintain the state for all possible intervals and then merge disjoint intervals as bridging intervals appear, or you need to preorder the intervals for each user name to make the merging of the intervals easier.

The ordered aggregation is simpler to scale out, but U-SQL does not provide ordered user-defined aggregators (UDAGGs) yet. In addition, UDAGGs normally produce one row per group, while in this case, I may have multiple rows per group if the ranges are disjoint.

Luckily, U-SQL provides a scalable user-defined operator called a reducer which gives us the ability to aggregate a set of rows based on a grouping key set using custom code.

There are some good insights here, so read the whole thing.

Comments closed

Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Comments closed

Data Lakes

Jen Stirrup has a great primer on data lakes and factors to consider before you jump into the idea:

The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.

As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.

I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.”  It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes.  The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant.  It will also, to my estimate, be the more common version.

Comments closed

Get Started With U-SQL

Microsoft is pushing U-SQL pretty hard.  Here’s a tutorial by Jonathan Gao to whet your appetite:

U-SQL is a language that unifies the benefits of SQL with the expressive power of your own code to process all data at any scale. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. It enables you to process unstructured data by applying schema on read, insert custom logic and UDF’s, and includes extensibility to enable fine grained control over how to execute at scale. To learn more about the design philosophy behind U-SQL, please refer to this Visual Studio blog post.

You do need Data Lake Tools for Visual Studio, but it looks like you can run it locally.

The VS blog had something a month ago on the topic.  I’m not saying get on it…yet…

Comments closed