Press "Enter" to skip to content

Category: Warehousing

Query Scheduling with Apache Hive

Zoltan Haindrich and Jesus Camacho Rodriguez walk us through scheduled queries in Apache Hive:

To fulfill that purpose, recently Apache Hive introduced a new feature called scheduled queries. Using SQL statements, users can schedule Hive queries to run on a recurring basis, monitor their progress, and optionally disable a query schedule.

In a nutshell, every scheduled query in Hive consists of (i) a unique name to identify the schedule, (ii)  the actual SQL statement to be executed, and (iii) the schedule at which the query should be executed defined by a Quartz cron expression. In addition, a scheduled query belongs to a namespace, i.e., a collection of HiveServer2 instances that are responsible to execute the query.

Read on for examples of how you might create, use, and learn about scheduled queries running on a system.

Leave a Comment

Role-Playing Dimensions in Power BI

Martin Schoombee explains the concept of role-playing dimensions and then explains how that works in the Power Bi world:

In technical terms a role-playing dimension is when a dimension table has multiple (foreign key) relationships to the same fact table. In more non-technical terms, it is when you have the same attribute (“Date” for instance) that can relate to the same metrics in different ways.

If you look at the data model below as an example, you can see the Date entity could be used to reference either the Invoice Date or Delivery Date from the Sales entity, changing the perspective of the metrics we’re looking at.

It’s a little surprising to me that there isn’t an easier way to handle this concept. Role-playing dimensions are a core part of the Kimball model, and they’re common enough that you’d expect support to be a bit simpler.

Leave a Comment

Thoughts on Snowflake Database Provisioning

David Stelfox takes us through some thoughts on provisioning instances of Snowflake:

For this example, I’ve chosen an open dataset of 2017 taxi rides in New York City. There are a few options for interacting with Snowflake: a dialog box approach in the web-based GUI, using SQL statements in the Worksheets tab in the GUI or a CLI called SnowSQL. For this example, I used SQL statements as I find them easier to follow what’s happening. Once you have set up your account (or trial) and logged in, you need to create your first database.

Click through for some how-to as well as thoughts about cost and performance.

Leave a Comment

The Key Concepts of Azure Synapse Analytics

Simon Whiteley takes a look at what Azure Synapse Analytics really is:

You might have seen that I’ve been pretty busy recently, digging into the new Azure Synapse Analytics preview, announced back at Microsoft Build 2020. I’ve explored the spark engine, SQL serverless/On-Demand and various other bits… but I’m still getting the same question of “Cool!…. but what actually is it?”. One of the problems here is that Azure SQL Data Warehouse was rebranded as “Azure Synapse Analytics”… but it’s not the same as the full workspace. Having two products, both talked about in Marketing, one generally available, one still in preview – it’s no wonder people are still confused!

Simon also has a video, which I recommend so that you can enjoy the funny way he pronounces “Synapse.” That said, next time I’m in the UK, it’ll be just as fair for someone to point out the funny way I pronounce “Synapse.” Also, you should watch the video because Simon knows the topic cold and does a great job of explaining things.

Comments closed

The Power BI Pro’s Guide to Azure Synapse Analytics

Brett Powell gives a quick overview of a new whitepaper:

The Power BI Professional’s Guide to Azure Synapse Analytics, a white paper I wrote describing the Synapse Analytics platform and its benefits and use cases for Power BI professionals, was published a couple weeks ago. This post discusses a few themes from this paper and also shares a couple notes that weren’t included.

There are some interesting notes in here, so check those out and also get ahold of the whitepaper to understand how Power BI relates to the artist formerly known as Azure SQL Data Warehouse.

Comments closed

Azure Synapse Analytics in Preview

Simon Whiteley clarifies a Build announcement:

Today’s the day! There’s much buzz & excitement as we FINALLY get to see Azure Synapse Analytics in public preview, ready for us all to get our hands on it. There’s a raft of other announcements that come hand & hand with it too.

What’s that? You thought Azure Synapse Analytics was already available? You’ve been using all year and don’t see what the fuss is about??

I’m expecting this to be the common reaction. The marketing story for Synapse has been… interesting… to say the least. I’ve been asked several times in the last week exactly what the new story is and, given today’s news, I thought I’d clarify.

The big picture is the version of Azure Synapse Analytics I’ve been interested in for a bit, so it’s nice to see the movement here.

Comments closed

Monitoring Azure Synapse Analytics SQL Pools with Power BI

Brett Powell has a pair of Power BI templates for monitoring Azure Synapse Analytics:

Upon clicking ‘Load’ you’ll either need to provide your credentials for this source (if you don’t have this data source saved from previous use) or the queries will execute and the following report pages will be available:

– Executions
– Waits
– Sessions
– Waits Detail
– Execution Detail
– Memory
– ExecutionDrillThrough (hidden)

Click through to see what the templates look like and how to obtain them.

Comments closed

Workload Isolation in Azure Synapse Analytics

Niko Neugebauer explains how resource governance works with Azure Synapse Analytics SQL Pools:

Carrying on with the Azure Synapse series on the workload identification, classification and isolation started with
Query Identification in Azure SQL DW (Synapse Analytics), in this post I wanted to focus on the workload groups and the workload isolation (aka Resource Governance).

Before advancing and looking into Azure Synapse Analytics “Resource Governor” (my own naming, my fault – and yeah, I shall keep it naming properly), we need to look at the resource classes in Azure Synapse Analytics.
But even before that et me start with WTH – Where is the Heck of Resource Governance in Azure SQL Database ? (Don’t throw at me those Managed Instances, which is a SQL Server with Availability Group running in tuned VM in the background – I want & need the Azure SQL Database to have the proper Resource Governance.

Click through for an explanation plus demonstration.

Comments closed

The Flink-Hive Integration

Bowen Li takes us through Apache Flink 1.10’s integration with Apache Hive:

On the other hand, Apache Hive has established itself as a focal point of the data warehousing ecosystem. It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered and defined. As business evolves, it puts new requirements on data warehouse.

Thus we started integrating Flink and Hive as a beta version in Flink 1.9. Over the past few months, we have been listening to users’ requests and feedback, extensively enhancing our product, and running rigorous benchmarks (which will be published soon separately). I’m glad to announce that the integration between Flink and Hive is at production grade in Flink 1.10 and we can’t wait to walk you through the details.

Click through to see how it works.

Comments closed

Sort Keys and Join Types in Amazon Redshift

Derik Hammer takes us through query tuning a nasty job on Amazon Redshift:

My team built a process to load from a couple of base tables, in our Amazon Redshift enterprise data warehouse, into an other table which would act as a data mart entity. The data was rolled up and it included some derived fields. The SQL query had some complicity [complexity?, ed.] to it.

This process ran daily and was being killed by our operations team after running for 22 hours.

I stepped in to assist with performance tuning and discovered that join choices, such as INNER vs. OUTER joins have a big impact on whether Redshift can use its sort keys or not.

Click through for more details and what Derik ended up doing.

Comments closed