Press "Enter" to skip to content

Day: October 26, 2021

The Architecture of Apache Hive

The Hadoop in Real World team explains what the Apache Hive architecture looks like:

Metastore database

Metastore database is not part of HiveServer2 (and it is not shown in the picture). Every Hive installation needs to have an RDBMS like Derby (good for dev environments only), Oracle or MySQL.

Hive stores the metadata of the tables and database that is managed by Hive in the metastore database. Note that this database doesn’t hold the actual data. The data will reside in HDFS.

Click through for the full architecture.

Comments closed

Sixth Normal Form to Avoid NULLs

I have a response to an article:

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

The Joys and Pains of Filtered Indexes

Chad Callihan takes us through filtered indexes:

We would all be happy if tables in our databases only contained the data we need. In reality, we tend to have data that needs kept for legal purposes or “just in case” something happens. Are you querying a table where queries only care about a fraction of the data?

When optimizing a query that is written to return a specific set of data, you may improve performance by creating a filtered index. Let’s look at an example of a filtered index and a warning that you may come across when using it.

I have a distinct love-hate relationship with filtered indexes. They’re so useful but so much of that utility comes with significant strings attached.

Comments closed

Resolving Collation Conflicts

Steve Stedman fights with collations:

One possible cause of the “Cannot resolve the collation conflict” error message is that your database collation doesn’t match the TempDB Collation.

[…]

Msg 468, Level 16, State 9, Line 11

Cannot resolve the collation conflict between “SQL_Latin1_General_CP437_CI_AS” and “Latin1_General_CI_AS” in the equal to operation.

Read on to see an example of how you can cause this problem and what you can do about it. I agree with Steve’s general advice to keep collations the same across an instance if you can.

Comments closed

All about Synchronous Stats Updates

Paul Randal shares some thoughts about synchronous stats updates:

The SQL Server query optimizer makes use of statistics during query compilation to help determine the optimal query plan. By default, if the optimizer notices a statistic is out-of-date because of too many changes to a table, it will update the statistic immediately before query compilation can continue (only the statistics it needs, not all the statistics for the table).

Note that “too many” is non-specific because it varies by version and whether trace flag 2371 is enabled – see the AUTO_UPDATE_STATISTICS section of this page for details.

Read on to learn more, including the problems that synchronous stats updates can cause, what you can do to avoid them, and ways you can tell that synchronous stats updates are a problem in your environment.

Comments closed

Power BI Push Datasets and Real-Time Dashboards

Marco Russo and Alberto Ferrari don’t have time to wait:

How many times have you heard an executive request a panel with the company’s sales data in real time? How frequently has this single request – which is more often a preference than an important business requirement – affected the overall architecture of your analytical solution?

In the Power BI world, requirements for real time often drive the creation of a pure DirectQuery model, with no aggregations to avoid data latency. This choice is incredibly expensive: the computational cost of each individual query is borne by the data source, which is often a relational database like SQL Server. On top of its cost, with this approach you will face scalability, performance, and modeling issues. Indeed, the relational database on top of which DirectQuery runs is mostly designed for transactional processing instead of being optimized for the workload of analytical processes. Optimizing the model is both difficult and expensive. Finally, using DirectQuery creates specific modeling constraints and the need for modeling workarounds to obtain good performance.

Creating an entire model using DirectQuery for the sole purpose of achieving a few real-time dashboards is definitely excessive. The primary scenario where relying on DirectQuery makes sense is when it is not feasible to import data quickly enough to satisfy the latency requirements for the majority of the reports. When the entire model can be in import mode, and a small number of dashboards require DirectQuery, there are better options available.

Definitely worth the read.

Comments closed

Page Allocation Reports in SSMS

Eitan Blumin has updated an open source project:

Back in April 2020, I created an open-source project called “SQL Server Page Allocation Reports“. It consisted of a set of SQL queries and some Power BI reports that can be used for visualizing the size and locations of your data and transaction log pages.

Well, recently I also added SSMS Custom Reports into the mix. So, it’s time to revisit this project and see what’s new!

Click through to see what’s new.

Comments closed