Press "Enter" to skip to content

Month: January 2022

Bucketing Data in Hive

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node

In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Comments closed

Graph Analysis with NetworkX

Tori Tompkins introduces us to a Python package:

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex graphs. It’s a really cool package that contains heaps of graph algorithms for all different uses. In this tutorial, I will cover how to create a graph from an edge list and different ways we can query it.

Unsure what a graph is exactly? Check out my Data Science Moments video which introduces graphs and their uses in 5 minutes:

Click through for that video, as well as a way to load, process, and display graph data.

Comments closed

Bidirectional Transactional Replication and Server Names

Mousa Janini points out a requirement of bidirectional transactional replication:

The steps to create a Bi-directional replication is simple, and similar to the steps for configuring transnational replication with extra step to enable the @loopback_detection parameter of sp_addsubscription to ensure that changes are only sent to the Subscriber and do not result in the change being sent back to the Publisher.

The most common issue for the Bi-directional replication is when the loop back detection is not working as expected; which results in data conflicts and Primary Key Violations.

Read on to see what is the cause of this problem and what you can do to solve it.

Comments closed

The DIFFERENCE() and SOUNDEX() Functions

Hadi Fadlallah looks at two methods of string distance:

Soundex is a phonetic algorithm developed by Robert C. Russell and Margaret King Odell in the early 1900s. This algorithm is used to index names as they are pronounced in English. The main goal of such an algorithm is to encode homophones to the same representation to be matched even if there are some slight spelling differences. As an example, consider the names “Smith” and “Smyth”, or “Mohamad” and “Mouhammad”. Soundex mainly encodes consonants and only encodes a vowel if it is the first letter of the name.

Being one of the most popular phonetic algorithms, Soundex was implemented in multiple database engines such as OracleSQL ServerMySQLSQLite, and PostgreSQL.

These two methods are not perfect and they do really limit you to one word (or small word grouping), but they are useful.

Comments closed

Diving into Vertipaq Compression

Ed Pollack explains how Vertipaq compression works to make columnstore indexes so efficient:

Columnstore compression is an impressive array of algorithms that can take large analytic tables and significantly reduce their storage footprint. In doing so, IO is also reduced, and query performance dramatically improved.

This article dives into one aspect of columnstore compression that tends to get buried in all of the hoopla surrounding how awesome columnstore indexes are: Vertipaq optimization. This is a critical component of the columnstore compression process, and understanding how it works can significantly improve the performance of analytic workloads while reducing the computing resources required for the underlying data.

Click through for the steps of the process.

Comments closed

Page and Bookmark Navigation in Power BI

Kristi Cantor notes a Power BI update:

Hello P3 Adaptive Nation! Happy New Year, ring out the old and ring in the new! Speaking of ringing in the new, with all the hustle and bustle associated with the holiday season and the excitement of welcoming the new year, did anybody happen to notice the new feature quietly rolled out in Power Bi back in November to take the edge off creating and maintaining custom pages and bookmarks? 

Read on to see what has changed.

Comments closed

Measuring Cost Threshold for Parallelism Effectiveness

Michael Swart gets out the ruler:

The configuration setting cost threshold for parallelism has a default value of 5. As a default value, it’s probably too low and should be raised. But what benefit are we hoping for? And how can we measure it?

The database that I work with is a busy OLTP system with lots of very frequent, very inexpensive queries and so I don’t like to see any query that needs to go parallel.

What I’d like to do is raise the configuration cost threshold to something larger and look at the queries that have gone from multi-threaded to single-threaded. I want to see that these queries become cheaper on average. By cheaper I mean consume less cpu. I expect the average duration of these queries to increase.

Read on for Michael’s results, and I appreciate somebody actually testing and measuring rather than pulling a number from a hat.

2 Comments

Using Query Hints in Query Store

Deepthi Goguri takes us through hint usage in Query Store:

Query Store hints is another amazing feature added and is available in the preview mode in Azure SQL Database, managed instances, elastic pools and hyperscale databases as well. Query Store hints are just like the hints you use in your queries but the difference is you do not have to change the query code to apply these hints using Query store specifically designed stored procedure- sys.sp_query_store_set_hints. Cool, right?

Yes. Yes it is.

Comments closed

Profiling Python Code

Adrian Tam shows how you can test the performance of calls in Python:

Profiling is a technique to figure out how time is spent in a program. With this statistics, we can find the “hot spot” of a program and think about ways of improvement. Sometimes, hot spot in unexpected location may hint a bug in the program as well.

In this tutorial, we will see how we can use the profiling facility in Python. Specifically, you will see

– How we can compare small code fragments using timeit module

– How we can profile the entire program using cProfile module

– How we can invoke a profiler inside an existing program

– What the profiler cannot do

Read on for those techniques.

Comments closed