Press "Enter" to skip to content

Curated SQL Posts

Using Spark in CDP’s Operational Database Experience

Gokul Kamaraj, et al, take us through using Apache Spark in Cloudera Data Platform’s Operational Database Experience:

Apache Spark is a very popular analytics engine used for large-scale data processing. It is widely used for many big data applications and use cases. CDP Operational Database Experience Experience (COD) is a CDP Public Cloud service that lets you create and manage operational database instances and it is powered by Apache HBase and Apache Phoenix. 

To know more about Apache Spark in CDP and CDP Operational Database Experience, see Apache Spark Overview and CDP Operational Database Experience Overview.

Apache Spark enables you to connect directly to databases that support JDBC. When integrating Apache Spark with Apache Phoenix in COD, you can leverage capabilities provided by Apache Phoenix to save and query data across multiple worker nodes, and use SELECT columns and pushdown predicates for filtering. 

In this blog post, let us look at how you can read and write data to COD from Apache Spark. We are going to use an Operational Database COD instance and Apache Spark present in the Cloudera Data Engineering experience

Read on for the process.

Comments closed

Change Data Capture in Delta Lake

Surya Sai Turaga and John O’Dwyer take us through change data capture in Delta Lake:

Change data capture (CDC) is a use case that we see many customers implement in Databricks – you can check out our previous deep dive on the topic here. Typically we see CDC used in an ingestion to analytics architecture called the medallion architecture. The medallion architecture that takes raw data landed from source systems and refines the data through bronze, silver and gold tables. CDC and the medallion architecture provide multiple benefits to users since only changed or added data needs to be processed. In addition, the different tables in the architecture allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. We are happy to announce the exciting new Change Data Feed (CDF) feature in Delta Lake that makes this architecture simpler to implement and the MERGE operation and log versioning of Delta Lake possible!

Read on to gain an understanding of how it works.

Comments closed

Changing Power BI Evaluation Container Numbers

Chris Webb shows how we can optimize the number of evaluation containers in Power BI:

Last week I showed how the new MaxEvaluationWorkingSetInMB registry setting could increase the performance of memory-hungry Power Query queries in Power BI Desktop. In this post I’ll show how the other new registry setting, ForegroundEvaluationContainerCount, can also help performance. Before I carry on I recommend you read the documentation on these new registry settings if you haven’t done so already.

To illustrate the effect of this setting I created ten identical Power Query queries feeding an Import mode dataset in a new .pbix file, each of which read data from the same 150MB CSV file, apply the a filter and then count the number of rows returned. 

I don’t think I like having to modify a registry setting each time; that’s leading me to believe I should rarely (or never) mess with this.

Comments closed

Persistent Computed Columns and Columnstore Indexes

Erik Darling found a way to do something interesting:

If you read the documentation for column store indexes, it says that column store indexes can’t be created on persisted computed columns.

And that’s true. If we step through this script, creating the column store index will fail.

But it turns out that if there’s a will, there’s a way. Even if this is something you shouldn’t wish to do because who knows what it will mess up.

Comments closed

Creating Power BI Themes

Adam Aspin takes us through creating Power BI themes:

Power BI has taken the world by storm when it comes to creating attention-grabbing dashboards that empower users. It has come to dominate the analytics arena with its ease of use, wide range of connectivity options, and the variety of available visuals. However, formatting (and reformatting) dashboard visuals can prove time-consuming and repetitive – as can standardizing the presentation of multiple dashboards to create a unified look and feel for a suite of reports. Most users would rather spend their time analyzing and delivering meaningful insights as opposed to applying colors and font choices to charts and tables. This article explains how Power BI themes can avoid that tedious work!

This is where the creation and application of Power BI themes comes in. A theme is a standardized definition of some – or all – of the formatting of a dashboard. This can range from defining a color palette and a selection of font choices to the detailed specification of each and every visual. Applying a theme allows you to format virtually every visual in a dashboard instantly. What is more, any changes that you subsequently make to a theme can be reapplied in a few clicks to update your dashboard’s presentation. Themes can be created once, then applied to dozens or even hundreds of Power BI dashboards to guarantee a coherent and rigorously standardized presentation style across a department or even an entire organization.

Read on to see some of the things you can do with customizing Power BI.

Comments closed

What’s New in data_algebra

John Mount has an update on a Python package:

The data algebra is a modern realization of elements of Codd’s 1969 relational model for data wrangling (see also Codd’s 12 rules).

The idea is: most data manipulation tasks can usefully be broken down into a small number of fundamental data transforms plus composition. In Codd’s initial writeup, composition was expressed using standard mathematical operator notation. For “modern” realizations one wants to use a composition notation that is natural for the language you are working in. For Python the natural composition notation is method dispatch.

Click through to see how it works and what’s new in the latest version.

Comments closed

Visio Licensing Changes and Power BI

Chris Webb ties a new Visio announcement to Power BI:

There was an interesting announcement today regarding Visio:

https://www.microsoft.com/en-us/microsoft-365/blog/2021/06/09/bringing-visio-to-microsoft-365-diagramming-for-everyone/

In summary there will soon be a lightweight, web-based version of Visio available to anyone with a Microsoft 365 Business, Office 365 E1/E3/E5, F3, A1, A3 or A5 subscription. Previously Visio was not part of the main M365 plans and was only available as a separate purchase.

So what? As a Power BI user, why should I care? 

Read on for Chris’s answer. If the web-based version of Visio is good, I’m reasonably excited by this prospect.

Comments closed

A Checklist for Database Post-Restoration

Randolph West wants you to keep some things in mind after you restore that database:

Whenever I restore a database — especially one I obtained outside of my regular environment (for example a customer database, a development database, or even a sample database like WideWorldImporters) — there are a few things I like to check to make sure it’s configured for peak performance.

Note that some of this advice may apply only to non-production databases.

Click through for the list.

Comments closed

Defining the Data Fabric

James Serra explains a concept:

Another buzzword that you may have been hearing a lot about lately is Data Fabric. In short, a data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data. It enables accessing, ingesting, integrating, and sharing data in a environment where the data can be batched or streamed and be in the cloud or on-prem. The ultimate goal of data fabric is to use all your data to gain better insights into your company and make better business decisions.  If you are thinking this sounds a lot like a modern data warehouse that I posted a video on recently at Modern Data Warehouse explained, well, I would argue it basically is the same thing except a data fabric expands on that architecture.

Read on for James’s explanation.

Comments closed

CROSS and OUTER APPLY

Kenneth Fisher takes us through CROSS versus OUTER APPLY:

I love CROSS APPLY. I also love OUTER APPLY. What’s the difference though? The same difference as with an INNER vs OUTER JOINs. The CROSS APPLY only includes rows where there is a match, while OUTER APPLY includes all rows even if there isn’t a match. I’ve found over time that I have a lot easier time using an example for this rather than trying to explain in any detail. I’m going to use STRING_SPLIT for my example because it’s easy.

Click through for the example.

Comments closed