To understand how data is consumed, we need to figure out answers to some basic questions like:
- Which datasets (tables/views/DBs) are accessed frequently?
- When are the queries run most frequently?
- Which users or applications are heavily utilizing the resources?
- What type of queries are running frequently?
The most accessed object can easily benefit from optimization like compression, columnar file format, or data decomposition. A separate queue can be assigned to heavy-resource-utilizing apps or users to balance the load on a cluster. Cluster resources can be scaled up during the timeframe when a large number of queries are mostly run to meet SLAs and scaled down during low usage tide to save cost.
Hive Hooks are convenient ways to answer some of the above questions and more!
Read on to learn how.
I identify 5 types of workloads in Apache Kafka, and in my opinion each corresponds to a specific API:
Kafka Producer API: Applications directly producing data (ex: clickstream, logs, IoT).
Kafka Connect Source API: Applications bridging between a datastore we don’t control and Kafka (ex: CDC, Postgres, MongoDB, Twitter, REST API).
Kafka Streams API / KSQL: Applications wanting to consume from Kafka and produce back into Kafka, also called stream processing. Use KSQL if you think you can write your real-time job as SQL-like, use Kafka Streams API if you think you’re going to need to write complex logic for your job.
Kafka Consumer API: Read a stream and perform real-time actions on it (e.g. send email…)
Kafka Connect Sink API: Read a stream and store it into a target store (ex: Kafka to S3, Kafka to HDFS, Kafka to PostgreSQL, Kafka to MongoDB, etc.)
Stephane then goes into detail on each of these.
I then set up Extended Events to capture the query metrics and I executed each of the queries multiple times (also, just for the test, I discarded the results because I didn’t want that process mucking with my measurements). After executing both procedures 500 times, the results were quite simple. The average execution time with an explicit drop was 8,672 microseconds. Meanwhile, the average for not dropping the temporary table was 8,530 microseconds. That’s about a 1% difference across hundreds of executions. Reads were identical and so were writes.
In short, the behavior is the same.
What about the impact on the system? Could I see changes in memory or I/O as these different processes ran?
Grant didn’t notice any difference but check Allen White and Jay Robinson’s answers in the comments. Temp table reuse can happen (if you follow the rules) and can make a difference when a procedure is called frequently enough.
The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.
There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.
Read on for more details.
Oh yes! Of course like any other table structures in the Power BI in-memory based structure, every date table, consumes memory. But it would do the same even if you create your custom date dimension! Whenever you do the role-playing dimension scenario you are also consuming even more memory! The main difference is that Power BI default Date dimension will be created even if you do not want to do the date-based analysis on a date field! For example, even if you don’t use DueDate in your date-based analysis, still Power BI creates a date dimension for it. You cannot stop it for one field. You have to either stop the default creation of the Date dimension for the entire model or use it for the entire model, you cannot customize it per field. But with the custom date dimension, you can.
Reza does a good job of giving us the trade-offs between these two and explaining when we should use each.
Over the last few posts, we’ve shown how to use custom R visuals built by others. Today, we’re going to build our own using the Custom R Visual available in Power BI Desktop. If you haven’t read the second post in this series, Getting Started with R Scripts, it is highly recommended you do so now, as it provides necessary context for how to link Power BI to your local R ISE.
In the previous post, we created a bunch of log-transformed measures to find good predictors for Revenue. We’re going to use these same measures today to create a basic linear regression model to predict Revenue. If you want to follow along, the dataset can be found here. Here’s the custom DAX we used to create the necessary measures.
Click through for the example.
In SQL Server, or PostgreSQL for that matter, the TRUNCATE command is allowed in a transaction and it will commit or rollback like any other DML operation. In all of the scripts, below, I will do the following.
Check my row counts
Begin a transaction
TRUNCATE the table
INSERT one row
Check my row counts from within the transaction
Check to see if my row checks revert or not
That’s an important behavioral difference when loading data using a truncate-and-reload pattern.
A Table Spool operator stores its data in a worktable that is structured as a clustered index. The index is not built on any of the columns in the data, nor on any artificially added columns. It is structured on zero columns. As is normal for a clustered index on a set of columns that is not unique for the set, a 4-byte uniqueifier is then added to the data to give each row a unique internal address.
The worktable for an Index Spool operator is also structured as a clustered index. However, this operator does actually index actual columns from its data instead of just relying on a uniqueifier. The indexed columns are chosen to effectively satisfy the Seek Predicate property. The statement in the Microsoft’s documentation that a nonclustered index is used for Index Seek is not correct.
A stack spool is represented in execution plans as a combination of an Index Spool and a Table Spool, both with the With Stack property present and set to True. This is misleading because it is actually a different type of spool. The worktable it uses is built as a clustered index on a single column, representing the nesting level. Because this is not unique, a uniqueifier is added where needed.
This is a deep look at some operators which people tend to gloss over but can have huge performance impacts.