Press "Enter" to skip to content

Category: Warehousing

S3 and Redshift Data Movement with Role Chaining

Sudipta Mitra, et al, talk AWS security:

This post presents an approach that you can apply at scale to achieve fine-grained access controls to resources in S3 buckets and Amazon Redshift schemas for tenants, including groups of users belonging to the same business unit down to the individual user level. This solution provides tenant isolation and data security. In this approach, we use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. We utilize ASSUMEROLE and role chaining to provide fine-grained access control when data is being copied and unloaded between Amazon Redshift and Amazon S3, so the data flows within each tenant’s namespace. Role chaining also streamlines the new tenant onboarding process.

Read on for an overview and tutorial.

Comments closed

Understanding the Data Lakehouse

Tom Jordan explains what data lakehouses are:

When we are thinking about data platforms, there are many different services and architectures that can be used – sometimes this can be a bit overwhelming! Data warehouses, data models, data lakes and reports are all typical components of an enterprise data platform, which have different uses and skills required. However, in the past few years a new architecture has been rising; the data lakehouse. This is an architecture that borrows ideas and concepts from several different areas, which we will be exploring in greater detail in this blog.

Click through to learn more about the origin of this term and how it draws + differs from both a data lake and a data warehouse.

Comments closed

Role-Based Access Controls in Redshift

Milind Oke, et al, describe RBAC in Amazon Redshift:

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. Instead, enterprises want to design their overarching security posture based on the specific duties performed via roles and assign these elevated privilege roles to different users. By assigning different privileges to different roles and assigning these roles to different users, enterprises can have more granular control of elevated user access.

In this post, we explore the role-based access control (RBAC) features of Amazon Redshift and how you can use roles to simplify managing privileges required to your end-users. We also cover new system views and functions introduced alongside RBAC.

Read on to learn about system-defined roles as well as creating user-customizable roles.

Comments closed

Lakehouse, Mesh, and Fabric

James Serra is back in blue:

(NOTE: I have returned to Microsoft and am working as a Solution Architect in Microsoft Industry Solutions, formally known as Microsoft Consulting Services (MCS), where I help customers build solutions on Azure. Contact your Microsoft account executive for more info. That being said: the views and opinions in this blog are mine and not that of Microsoft).

There certainly has been a lot of discussion lately on the topic of Data Lakehouse, Data Mesh, and Data Fabric, and how they compare to the Modern Data Warehouse. There is no clear definition of all these data architectures, and I have created a presentation using my own take that I have been presenting frequently internally at Microsoft and externally to customers and at conferences. Hopefully these presentations, blog posts, and videos can help clarify all these data architectures for you:

Click through for several useful resources to help differentiate these topics.

Comments closed

get_json_object and json_tuple in Hive

The Hadoop in Real World team looks at a pair of Hive functions:

Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive. 

Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.

Click through for that script as well as seeing how you use each of get_json_object() and json_tuple() and which might be better-suited for your purposes.

Comments closed

Bucketing Data in Hive

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node

In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Comments closed

Static versus Dynamic Partitioning in Hive

The Hadoop in Real World team explains the difference between two partitioning strategies:

The difference between static and dynamic partitioning only exists when the partition is being created based on how the partitions are added to the table. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.

Click through for the difference.

Comments closed

Dedicated SQL Pool Index, Distribution, and Partition Guidance

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Synapse vs Snowflake

Travis Manning has a throw-down:

Data warehousing has become a hot topic for most organizations as data volume grows exponentially, and yet the capacity to manually manage it all but diminishes. The ecosystem is replete with options, each with a host of features and integrations. In this article, we will discuss two of the most common (and commonly discussed!) data warehousing services, Azure Synapse and Snowflake Data Warehouse (DW). For this article, we will try to focus on use cases, and which option is appropriate in that context.

Click through for the product comparison. One big difference not covered is pricing uncertainty. If you have a good understanding of the number of executions and computational complexity of your queries, as well as data quantities, Snowflake can be very competitively priced. But what can happen is that the competitive price turns into a much-less-competitive price by the time you’re fully up to speed.

1 Comment