Press "Enter" to skip to content

Day: November 4, 2020

Real-Time Data Warehousing in Cloudera

Justin Hayes gives us an overview of using Cloudera Data Platform for real-time data warehousing:

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.

– the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving

– the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction

– the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth

– mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

Read on for more.

Comments closed

Join Execution in Apache Spark

Ajay Gupta takes us through join operations in Apache Spark:

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

Click through for more information on the mechanics of joining, including trade-offs between types of physical join operators.

Comments closed


Eric Cobb diagnoses an ugly issue:

I recently ran into a situation where a new SQL Server would crash hard every time it would get under a load.

Here is a synopsis of what we were seeing:

This is a physical server and has 512GB of RAM installed. We have SQL Server 2016 installed, and fully patched (SP2 CU15 at this time). When load testing the server, it would start throwing errors such as:

“Failed allocate pages: FAIL_PAGE_ALLOCATION”


“There is insufficient system memory in resource pool ‘default’ to run this query.”


“Failed to allocate BUFs”

It would then write a memory dump to the log, and in most cases the server would become completely unresponsive and would have to be rebooted.

Read on to learn under what conditions this happens as well as the solution to the problem.

Comments closed

Considerations Before using SQL Server on Containers

Joy George Kunjikkur wants you to slow your roll a little:

It is easy to get started on development and simple testing using SQL containers. It was discussed in the previous post. But before putting into production and start developing real applications we had to make sure the below things at least.

Read on for those considerations. I think they are reasonable and generally agree with the bottom-line conclusion.

Comments closed

Migrating SSIS to Azure Data Factory

Koen Verbeeck has some articles for us:

For quite some time now, there’s been the possibility to lift-and-shift your on-premises SSIS project to Azure Data Factory. There, they run in an Integration Runtime, a cluster of virtual machines that will execute your SSIS packages. In the beginning, you only had the option to use the project deployment model and host your SSIS catalog in either an Azure SQL DB, or in a SQL Server Managed Instance.

But over time, features were added and now the package deployment model has been supported for quite some time as well. Even more, the “legacy SSIS package store” is also supported. For those who still remember this, it’s the SSIS service where you can log into with SSMS and see which packages are stored in the service (either the file system or the MSDB database) and which are currently running.

Read on for much more detail on the topic.

Comments closed

External Tables vs T-SQL Views in Synapse

James Serra explains the differences between external tables and T-SQL views in Azure Synapse Analytics when querying from Data Lake Storage:

A question that I have been hearing recently from customers using Azure Synapse Analytics (the public preview version) is what is the difference between using an external table versus a T-SQL view on a file in a data lake?

Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as a SQL On-demand pool.

Here are the differences that I have found:

Click through for the differences.

Comments closed

Date and Time Functions in Cosmos DB

Hasan Savran walks us through date and time functions in Azure Cosmos DB:

Json does not have datetime data type, you need to keep the datetime information in string. This can be a problem for database engines specially if user needs to search by date or sort by date. Cosmos DB team introduced bunch of datetime functions to the Azure Cosmos DB database engine this month. You can read my older post about DateTime in CosmosDB if you like to know how Azure CosmosDB saves the datetime in documents. I will cover the new datetime functions in this post. Here is the list of the functions 

Click through for those functions and how you can use them.

Comments closed

Geospatial Analysis with Azure Data Explorer

Chris Webb continues along a theme:

Since last week’s blog post about dynamic M parameters generated so much interest, this week I thought I’d give you another example of something cool you can do with them when you’re using Azure Data Explorer (ADX) as a DirectQuery source in Power BI: geospatial analysis.

Let’s say you work for a chain of supermarkets and want to use Power BI see what other competing stores are close to one of your stores.

Read on for the rest of the story.

Comments closed