Press "Enter" to skip to content

Month: January 2017

Partitioned Views With Polybase

I look at using SQL 2000-style partitioning with a set of external tables:

Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000.  This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it.  The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.

This leads to a concept I first heard from Ginger Grant:  pseudo-StretchDB.  Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price.  If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month.  For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine.  But if you need to query this data frequently, performance might be a killer.

For Polybase tables without the ability to perform external pushdown, coming up with a good partitioning strategy is probably one of the two best ways to improve performance, with creating a Polybase scale-out cluster the other method.

Comments closed

Nested Sets

Nate Johnson explains the nested sets model:

Put another way, the #3 rule is that you should always operate on the tree (CrUD ops) using stored-procedures and/or triggers that encapsulate all the nitty-gritty details of maintaining the correct position values during said insert/update/delete operations.  Of course, somebody is responsible for writing those stored-procs.  Any volunteers?  Easy now, don’t raise your hands all at once!  Generally, this responsibility falls to the DBA(s) or DBDev(s).

The problem at-hand, in my current situation, was that of “moving a sub-tree”, i.e. taking a node and all its descendants, and moving it to place it under another “parent” node.  In some models, and/or in some languages, this is a simple recursive operation.  However, SQL is not spectacular at recursion — after all, we’re working in a relational engine — so let’s try to play to its strengths:

This is a straightforward look at one of the major hierarchical models in relational design.  Well worth a look.

1 Comment

TempDB And Parallelism

Kendra Little looks at cases when a query uses multiple tempdb data files:

As you might guess, things may not always get evenly accessed, even if you have evenly sized tempdb files. One of my queries did a select into a temp table. Although it used all four tempdb files whether or not it went parallel, there were more file_read events against the first tempdb file than against the other four.

It’s an interesting look at this specific question, but also as pedagogical technique.

Comments closed

BatchMode Execution

Sunil Agarwal describes BatchMode execution with columnstore indexes:

You may be wondering what is this magic number 900 rows within a batch? Well, when executing a query in BatchMode, SQL Server allocates a 64k bytes structure to group the rows. The number of rows in this structure can vary between 64 to 900 depending upon number of columns selected. For the example above, there are two columns that are referenced and X marks the rows that qualified in the BatchMode structure shown in the picture below. If SCAN is part of a bigger query execution tree,  the pointer to this structure is passed to the next operator for further processing. Not all operators can be executed in BatchMode. Please refer to Industry leading analtyics query performance for details on BatchMode Operators.

Under the right circumstances, BatchMode execution can be a major performance benefit.

Comments closed

Virtual Function Calls

Ewald Cress is thinking about virtual function calls:

A virtual function call, on the other hand, is only resolved at runtime. The compiler literally does not know what address is going to get called, and neither does the runtime except in the heat of the moment, because that is going to depend on the type of the object instance that the function is called on. Bear with me, I’ll try and simplify.

A C++ object is just a little chunk of memory: a bunch of related instance variables if you like. All objects of the same class have the same structure in this regard. If you’re wondering about functions (a.k.a. methods), these belong to the class, or put differently, to ALL objects of that class. Once compiled, each method is a chunk of memory with a known address, containing the compiled instructions.

From there, it’s a harrowing journey through bigger layers of indirection.

Comments closed

Spark 2.1

Reynold Xin announces Apache Spark 2.1:

  • Structured Streaming

    Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.

    • Event-time watermarks: This change lets applications hint to the system when events are considered “too late” and allows the system to bound internal state tracking late events.

    • Support for all file-based formats and all file-based features: With these improvements, Structured Streaming can read and write all file-based formats, e.g. JSON, text, Avro, CSV. In addition, all file-based features—e.g. partitioned files and bucketing—are supported on all formats.

    • Apache Kafka 0.10: This adds native support for Kafka 0.10, including manual assignment of starting offsets and rate limiting.

This is a pretty hefty release.  Click through to read the whole thing.

Comments closed

Ten Notes On SparkR

Neil Dewar has a notebook with ten important things when migrating from R to SparkR:

  1. Apache Spark Building Blocks. A high-level overview of Spark describes what is available for the R user.

  2. SparkContext, SQLContext, and SparkSession. In Spark 1.x, SparkContext and SQLContext let you access Spark. In Spark 2.x, SparkSession becomes the primary method.

  3. A DataFrame or a data.frame? Spark’s distributed DataFrame is different from R’s local data.frame. Knowing the differences lets you avoid simple mistakes.

  4. Distributed Processing 101. Understanding the mechanics of Big Data processing helps you write efficient code—and not blow up your cluster’s master node.

  5. Function Masking. Like all R libraries, SparkR masks some functions.

  6. Specifying Rows. With Big Data and Spark, you generally select rows in DataFrames differently than in local R data.frames.

  7. Sampling. Sample data in the right way, and use it as a tool for converting between big and small data.

  8. Machine Learning. SparkR has a growing library of distributed ML algorithms.

  9. Visualization.It can be hard to visualize big data, but there are tricks and tools which help.

  10. Understanding Error Messages. For R users, Spark error messages can be daunting. Knowing how to parse them helps you find the relevant parts.

I highly recommend checking out the notebook.

Comments closed

Non-Trusted Foreign Keys

Daniel Janik explains what happens when you don’t have trusted foreign key constraints:

Why is it untrusted? Perhaps we disabled the check to load data and neglected to re-enable it?

No matter what the reason is the next part is not as simple. This is for two reasons.

  1. The data in the child table may not be valid. Since the key was not being checked I may have data in my table that isn’t represented in the parent.

  2. The syntax is a bit silly. As Mike Byrd in Austin, TX says, Microsoft studders. The syntax to reenable is “CHECK CHECK”. Let’s look at how we reenable the Address key check.

Read on for pros and cons of disabling (or not trusting) foreign key constraints.

Comments closed