Press "Enter" to skip to content

Author: Kevin Feasel

Nested Sets

Nate Johnson explains the nested sets model:

Put another way, the #3 rule is that you should always operate on the tree (CrUD ops) using stored-procedures and/or triggers that encapsulate all the nitty-gritty details of maintaining the correct position values during said insert/update/delete operations.  Of course, somebody is responsible for writing those stored-procs.  Any volunteers?  Easy now, don’t raise your hands all at once!  Generally, this responsibility falls to the DBA(s) or DBDev(s).

The problem at-hand, in my current situation, was that of “moving a sub-tree”, i.e. taking a node and all its descendants, and moving it to place it under another “parent” node.  In some models, and/or in some languages, this is a simple recursive operation.  However, SQL is not spectacular at recursion — after all, we’re working in a relational engine — so let’s try to play to its strengths:

This is a straightforward look at one of the major hierarchical models in relational design.  Well worth a look.

1 Comment

TempDB And Parallelism

Kendra Little looks at cases when a query uses multiple tempdb data files:

As you might guess, things may not always get evenly accessed, even if you have evenly sized tempdb files. One of my queries did a select into a temp table. Although it used all four tempdb files whether or not it went parallel, there were more file_read events against the first tempdb file than against the other four.

It’s an interesting look at this specific question, but also as pedagogical technique.

Comments closed

BatchMode Execution

Sunil Agarwal describes BatchMode execution with columnstore indexes:

You may be wondering what is this magic number 900 rows within a batch? Well, when executing a query in BatchMode, SQL Server allocates a 64k bytes structure to group the rows. The number of rows in this structure can vary between 64 to 900 depending upon number of columns selected. For the example above, there are two columns that are referenced and X marks the rows that qualified in the BatchMode structure shown in the picture below. If SCAN is part of a bigger query execution tree,  the pointer to this structure is passed to the next operator for further processing. Not all operators can be executed in BatchMode. Please refer to Industry leading analtyics query performance for details on BatchMode Operators.

Under the right circumstances, BatchMode execution can be a major performance benefit.

Comments closed

Virtual Function Calls

Ewald Cress is thinking about virtual function calls:

A virtual function call, on the other hand, is only resolved at runtime. The compiler literally does not know what address is going to get called, and neither does the runtime except in the heat of the moment, because that is going to depend on the type of the object instance that the function is called on. Bear with me, I’ll try and simplify.

A C++ object is just a little chunk of memory: a bunch of related instance variables if you like. All objects of the same class have the same structure in this regard. If you’re wondering about functions (a.k.a. methods), these belong to the class, or put differently, to ALL objects of that class. Once compiled, each method is a chunk of memory with a known address, containing the compiled instructions.

From there, it’s a harrowing journey through bigger layers of indirection.

Comments closed

Spark 2.1

Reynold Xin announces Apache Spark 2.1:

  • Structured Streaming

    Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.

    • Event-time watermarks: This change lets applications hint to the system when events are considered “too late” and allows the system to bound internal state tracking late events.

    • Support for all file-based formats and all file-based features: With these improvements, Structured Streaming can read and write all file-based formats, e.g. JSON, text, Avro, CSV. In addition, all file-based features—e.g. partitioned files and bucketing—are supported on all formats.

    • Apache Kafka 0.10: This adds native support for Kafka 0.10, including manual assignment of starting offsets and rate limiting.

This is a pretty hefty release.  Click through to read the whole thing.

Comments closed

Ten Notes On SparkR

Neil Dewar has a notebook with ten important things when migrating from R to SparkR:

  1. Apache Spark Building Blocks. A high-level overview of Spark describes what is available for the R user.

  2. SparkContext, SQLContext, and SparkSession. In Spark 1.x, SparkContext and SQLContext let you access Spark. In Spark 2.x, SparkSession becomes the primary method.

  3. A DataFrame or a data.frame? Spark’s distributed DataFrame is different from R’s local data.frame. Knowing the differences lets you avoid simple mistakes.

  4. Distributed Processing 101. Understanding the mechanics of Big Data processing helps you write efficient code—and not blow up your cluster’s master node.

  5. Function Masking. Like all R libraries, SparkR masks some functions.

  6. Specifying Rows. With Big Data and Spark, you generally select rows in DataFrames differently than in local R data.frames.

  7. Sampling. Sample data in the right way, and use it as a tool for converting between big and small data.

  8. Machine Learning. SparkR has a growing library of distributed ML algorithms.

  9. Visualization.It can be hard to visualize big data, but there are tricks and tools which help.

  10. Understanding Error Messages. For R users, Spark error messages can be daunting. Knowing how to parse them helps you find the relevant parts.

I highly recommend checking out the notebook.

Comments closed

Non-Trusted Foreign Keys

Daniel Janik explains what happens when you don’t have trusted foreign key constraints:

Why is it untrusted? Perhaps we disabled the check to load data and neglected to re-enable it?

No matter what the reason is the next part is not as simple. This is for two reasons.

  1. The data in the child table may not be valid. Since the key was not being checked I may have data in my table that isn’t represented in the parent.

  2. The syntax is a bit silly. As Mike Byrd in Austin, TX says, Microsoft studders. The syntax to reenable is “CHECK CHECK”. Let’s look at how we reenable the Address key check.

Read on for pros and cons of disabling (or not trusting) foreign key constraints.

Comments closed

Tracking Applications

Andy Levy explains how to use connection strings to track which application is hogging database resources:

Fortunately, the .NET SqlClient (and other ODBC drivers as well) has a built-in solution. Your application’s connection string has quite a few parameters available to provide configuration and information, and one that seems to get overlooked is Application Name. This one does exactly what it says on the tin – it lets you specify a name that will be displayed to anyone looking for it in SQL Server, including sp_whoisactive. Anyplace you have the ability to write a connection string, you can use this. It costs you nothing!

You can also start getting fancy with resource governor as well, segmenting pools based on application name.

Comments closed

Where Azure Analysis Services Fits

Melissa Coates explains where Azure Analysis Services fits in common BI architectures:

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.

  • Directly from various systems.  This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).

If you’re thinking about Azure Analysis Services, this post is a good one.

Comments closed