Press "Enter" to skip to content

Month: July 2020

Downsides to Optimization in Spark SQL

Anuj Saxena takes us through some of the pros and cons of using the Catalyst Optimizer in Spark, including a couple of issues:

I am sure the optimizations make the calculation time very short and these optimizations are implemented in such a way that you just have to provide the logic and everything else will be done in abstraction. But as my friend and colleague Ramandeep says “Abstract features come with abstract issues”. So following are the few issues which I have faced in my recent interaction with Spark SQL:

1. Too large of a query to be stored in memory
2. Implicit optimizations interfere with partitioning

Click through for examples of this.

Comments closed

R and the TIOBE Index

Alex Woodie notices a change in fortunes for R:

Don’t look now, but R, which some had written off as a language in terminal decline in lieu of Python’s immense and growing popularity, appears to be staging a furious comeback the likes of which IT has rarely seen.

According to the TIOBE Index, which tracks the popularity of programming languages (as expressed in Web searches), R has risen an unprecedented 12 spots, up from number 20 in the summer of 2019 to number 8 on its list today.

I’m happy to see this, as frankly, I think R’s a better language for statistical analysis and data visualization than Python and it’s not close. That’s the advantage of being a DSL: you get to focus on doing one or two things really well, and for R that’s statistical analysis and data visualization.

Comments closed

Web Page Scraping with R and ML Services

Dave Mason shows how you can scrape webpages with R and pull the resulting data into SQL Server using Machine Learning Services:

For this post, it might make more sense to skip ahead to the end result, and then work our way backwards. Here is a web page with some data: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players. We will scrape the web page, and write the data from the “roster” grid to a SQL Server table.

Read on for a demonstration of the process.

Comments closed

Refreshing Power BI Datasets in Azure Data Factory

Meagan Longoria shows us how to refresh a Power BI dataset using Azure Data Factory:

I recently needed to ensure that a Power BI imported dataset would be refreshed after populating data in my data mart. I was already using Azure Data Factory to populate the data mart, so the most efficient thing to do was to call a pipeline at the end of my data load process to refresh the Power BI dataset.

Power BI offers REST APIs to programmatically refresh your data. For Data Factory to use them, you need to register an app (service principal) in AAD and give it the appropriate permissions in Power BI and to an Azure key vault.

Click through for the solution.

Comments closed

Creating Evenly-Sized Batches from Groups

Daniel Hutmacher has a variant on the islands problem as well as the bin-packing problem:

My aim with this post is to split the dataset into batches of roughly 100 rows each.

DECLARE @target_rowcount bigint=100;

I say “roughly”, because we’re not allowed to split a transaction so that a group (grouping_column_1, grouping_column_2) appears in more than one batch, although a batch can obviously contain more than one group. This means that by necessity, some of the batches are going to be slightly under 100 rows and some are going to be slightly over.

Read on for a good solution to the problem. Daniel mentions places where performance could be better, though this feels like the kind of task you don’t necessarily run all that frequently.

Comments closed

Power BI the Right Way

Paul Turley is starting a series on doing Power BI the right way:

Any attempt to apply universal best practices to Power BI solution design is a slippery slope. The tools are so flexible and powerful, and the requirements of each project are so varied that it is challenging to establish a set of steps or rules that, if followed, will always yield the absolute best design for a given scenario. With that out of the way, I’ll say this: In my job, I see a lot of poorly-designed Power BI projects. I’ve worked with dozens or scores (maybe even hundreds?) of consulting clients who bring us projects – some partially completed, some finished, and many that are just broken – to be fixed or completed. My reactions range from “that’s just downright wrong” to “hmmm… I wouldn’t have done it that way but I guess it will work for the time being”. I try not to cast stones and do, on occasion, realize that others have found a better way to solve a problem. I don’t have all the answers but I do have a lot of experience with Microsoft Business Intelligent solution design. I have learned many good practices and design patterns from community leaders over the past twenty or so years.

This is only the introductory post in the series, but I’m excited to see what Paul has in store.

Comments closed

Areas of Improvement for DROP TABLE

Michael J. Swart points out a few foibles about the DROP TABLE syntax:

I was looking at the docs for DROP TABLE and I noticed this in the syntax: [ ,...n ]. I never realized that you can drop more than one table in a statement.

I think that’s great. When dropping tables one at a time. You always had to be careful about order when foreign keys were involved. Alas, you still have to care about order.

That is a shame. Michael also includes a few other places where DROP TABLE could be made better, so check it out.

Comments closed

A Brief Overview of Azure Synapse Analytics

Ginger Grant gives us the nickel tour of why Azure Synapse Analytics is interesting:

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

But it’s more than that. Read on to see what else is available.

Comments closed

Schema References and Multiple Event Types in a Kafka Topic

Robert Yokota updates some prior knowledge:

In the article Should You Put Several Event Types in the Same Kafka Topic?, Martin Kleppmann discusses when to combine several event types in the same topic and introduces new subject name strategies for determining how Confluent Schema Registry should be used when producing events to an Apache Kafka® topic.

Schema Registry now supports schema references in Confluent Platform 5.5, and this blog post presents an alternative means of putting several event types in the same topic using schema references, discussing the advantages and disadvantages of this approach.

Click through to see how this works out.

Comments closed

Performance Tuning for Cloudera’s Operational Database

Liliana Kadar, et al, show us the tools we can use to tune Cloudera’s Operatioanl Database:

A query optimizer determined the most efficient way to run a query. Query optimization helps you to reduce the hardware resources required to run a query and also speeds up your query-response time. Cloudera’s Operational Database provides you with various tools such as plan analyzers to make optimal use of your computing resources. 

Cloudera’s OpDB provides various cost-based and rules-based optimizers. You can use different optimizers based on your use cases. OpDB is primarily used for Online Transactional Processing (OLTP) use cases with Apache Phoenix in the OpDB used as a SQL engine. But you can also use Hive and Impala for Online Analytical Processing (OLAP) use cases. 

Read on for recommendations on platform choice as well as indexing and tuning options.

Comments closed