Using Hive Hooks

Pushkar Gujar shows us how to use Hive hooks, which behave a bit like triggers in relational databases:

To understand how data is consumed, we need to figure out answers to some basic questions like:

  • Which datasets (tables/views/DBs) are accessed frequently?
  • When are the queries run most frequently?
  • Which users or applications are heavily utilizing the resources?
  • What type of queries are running frequently?

The most accessed object can easily benefit from optimization like compression, columnar file format, or data decomposition. A separate queue can be assigned to heavy-resource-utilizing apps or users to balance the load on a cluster. Cluster resources can be scaled up during the timeframe when a large number of queries are mostly run to meet SLAs and scaled down during low usage tide to save cost.

Hive Hooks are convenient ways to answer some of the above questions and more!

Read on to learn how.

Untangling Kafka APIs

Stephane Maarek helps us make sense of when to use which Kafka API:

I identify 5 types of workloads in Apache Kafka, and in my opinion each corresponds to a specific API:

  • Kafka Producer API: Applications directly producing data (ex: clickstream, logs, IoT).

  • Kafka Connect Source API: Applications bridging between a datastore we don’t control and Kafka (ex: CDC, Postgres, MongoDB, Twitter, REST API).

  • Kafka Streams API / KSQL: Applications wanting to consume from Kafka and produce back into Kafka, also called stream processing. Use KSQL if you think you can write your real-time job as SQL-like, use Kafka Streams API if you think you’re going to need to write complex logic for your job.

  • Kafka Consumer API: Read a stream and perform real-time actions on it (e.g. send email…)

  • Kafka Connect Sink API: Read a stream and store it into a target store (ex: Kafka to S3, Kafka to HDFS, Kafka to PostgreSQL, Kafka to MongoDB, etc.)

Stephane then goes into detail on each of these.

Deciding Whether To Clean Up Temp Tables

Grant Fritchey looks at what difference explicitly dropping temporary tables in a procedure makes:

I then set up Extended Events to capture the query metrics and I executed each of the queries multiple times (also, just for the test, I discarded the results because I didn’t want that process mucking with my measurements). After executing both procedures 500 times, the results were quite simple. The average execution time with an explicit drop was 8,672 microseconds. Meanwhile, the average for not dropping the temporary table was 8,530 microseconds. That’s about a 1% difference across hundreds of executions. Reads were identical and so were writes.

In short, the behavior is the same.

What about the impact on the system? Could I see changes in memory or I/O as these different processes ran?

Grant didn’t notice any difference but check Allen White and Jay Robinson’s answers in the comments.  Temp table reuse can happen (if you follow the rules) and can make a difference when a procedure is called frequently enough.

Using IO Cache To Speed Up Spark Jobs

Chris Seferlis looks at what the HDInsight team has done to speed up Apache Spark jobs:

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Read on for more details.

Power BI: Choosing Between Default Or Custom Date Dimensions

Reza Rad looks at the default date dimension available in Power BI and compares it to what you can build yourself:

Oh yes! Of course like any other table structures in the Power BI in-memory based structure, every date table, consumes memory. But it would do the same even if you create your custom date dimension! Whenever you do the role-playing dimension scenario you are also consuming even more memory! The main difference is that Power BI default Date dimension will be created even if you do not want to do the date-based analysis on a date field! For example, even if you don’t use DueDate in your date-based analysis, still Power BI creates a date dimension for it. You cannot stop it for one field. You have to either stop the default creation of the Date dimension for the entire model or use it for the entire model, you cannot customize it per field. But with the custom date dimension, you can.

Reza does a good job of giving us the trade-offs between these two and explaining when we should use each.

Building Custom R Visuals In Power BI

Brad Lewellyn shows us how to create custom R visuals within Power BI:

Over the last few posts, we’ve shown how to use custom R visuals built by others.  Today, we’re going to build our own using the Custom R Visual available in Power BI Desktop.  If you haven’t read the second post in this series, Getting Started with R Scripts, it is highly recommended you do so now, as it provides necessary context for how to link Power BI to your local R ISE.

In the previous post, we created a bunch of log-transformed measures to find good predictors for Revenue.  We’re going to use these same measures today to create a basic linear regression model to predict Revenue.  If you want to follow along, the dataset can be found here.  Here’s the custom DAX we used to create the necessary measures.

Click through for the example.

Cannot Rollback TRUNCATE In Redshift

Derik Hammer notes that you cannot rollback a TRUNCATE TABLE operation in Redshift:

In SQL Server, or PostgreSQL for that matter, the TRUNCATE command is allowed in a transaction and it will commit or rollback like any other DML operation. In all of the scripts, below, I will do the following.

  1. Check my row counts

  2. Begin a transaction

  3. TRUNCATE the table

  4. INSERT one row

  5. Check my row counts from within the transaction

  6. ROLLBACK

  7. Check to see if  my row checks revert or not

That’s an important behavioral difference when loading data using a truncate-and-reload pattern.

How SQL Server Implements Index Spooling

Hugo Kornelis has a long article which dives into the way SQL Server handles index spooling:

A Table Spool operator stores its data in a worktable that is structured as a clustered index. The index is not built on any of the columns in the data, nor on any artificially added columns. It is structured on zero columns. As is normal for a clustered index on a set of columns that is not unique for the set, a 4-byte uniqueifier is then added to the data to give each row a unique internal address.

The worktable for an Index Spool operator is also structured as a clustered index. However, this operator does actually index actual columns from its data instead of just relying on a uniqueifier. The indexed columns are chosen to effectively satisfy the Seek Predicate property. The statement in the Microsoft’s documentation that a nonclustered index is used for Index Seek is not correct.

A stack spool is represented in execution plans as a combination of an Index Spool and a Table Spool, both with the With Stack property present and set to True. This is misleading because it is actually a different type of spool. The worktable it uses is built as a clustered index on a single column, representing the nesting level. Because this is not unique, a uniqueifier is added where needed.

This is a deep look at some operators which people tend to gloss over but can have huge performance impacts.

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031