Press "Enter" to skip to content

Curated SQL Posts

GIS Capabilities in R

Lionel Hertzog shows off spatial capabilities in R:

All of these operations follow the same logic, st_operation(A, B) checks for each combinations of the geometries in A and B whether A operation B is true or false. For instance st_within(A, B) checks whether the geometries in A are within B, this is similar to st_contains(B, A), the difference between the two being the shape of the returned object. If A has n geometries and B has m, st_contains(B, A) returns a list of length m where each elements contains the row IDs (numbers between 1 and n) of the geometries in A satisfying the operation. By using sparse=FALSE the functions returns matrices, like st_within(A, B, sparse=FALSE) returns a n x m matrix, st_within(B, A, sparse=FALSE) returns a m x n matrix. Note that running st_operation(A, A) checks the operation between all geometries of the object, so returning a n x n matrix.

Click through for part 1 of the series.

Comments closed

A Primer on Multi-Server SQL Agent Administration

Mikey Bronowski gives us a glimpse at the power of MSX/TSX:

The MSX (master server) can be used to define SQL Agent jobs that then will be pushed over to one or more TSX (target servers). It is possible to see the server is either MSX or TSX when looking at SQL Agent in SQL Server Management Studio Object Explorer. The master server can be identified by (MSX) next to the SQL Server Agent. It will also have two subfolders under the Job folder: Local Jobs (for regular jobs) and Multi-Server Jobs.

This is something woefully few database administrators take advantage of. As the number of servers you have to mange increases, ensuring consistency in SQL Agent jobs becomes more and more difficult. MSX/TSX admittedly has some rough edges, but the tool does a lot.

Comments closed

Statistics Aren’t Guarantees

Brent Ozar lays out an important point:

You might say, “But SQL Server has statistics on those columns, and it knows what the top values are!” Well, that’s true, but…data can change without the statistics being updated. For example, say that one user logs in right now, and then we run the MAX query again:

Statistics tell the engine what they learned at the time they were ran. If you need guarantees, that’s what constraints are for.

Comments closed

Caching and Statistics in Synapse Dedicated SQL Pools

Tsuyoshi Matsuzaki takes us through statistics and caching in Azure Synapse Analytics Dedicated SQL Pools:

In Synapse Analytics, several database objects (such as, compiled procedure, plan, …) will be cached in some conditions.
For instance, CCI tables (see my previous post “Azure Synapse Analytics : Choose Right Index and Partition” for CCI) will locally cache the recently-used columnstore segments on distributed compute nodes, which is called columnar cache. The local disk-based cache is used on Gen2 caching.
You cannot manually control these caching activities. (These are automatically applied to improve performance in Synapse Analytics.) See team blog “Adaptive caching powers Azure SQL Data Warehouse performance gains” for underlying architecture which improves caching in Gen2.

Dedicated SQL Pool behavior is close enough to on-premises SQL Server that it’s easy to expect everything to be the same, but there are some nuances.

Comments closed

Triggering a Refresh with Power BI’s API

Martin Schoombee continues a series on automating Power BI deployments:

At times you may want to refresh a Power BI dataset from outside the portal, either on-demand or as part of another process (think DevOps for instance). In those cases the API provides the ideal mechanism to do so. Just remember that you are still limited to 8 refreshes a day if you don’t have a Premium workspace, and using this method will not work beyond the number of allowed refreshes. You also cannot count on the API to return a useful error message in that case.

Read on to see how.

Comments closed

Testing Format String Expressions

Gilbert Quevauvilliers shows us a neat way to check that the Format String Expression for a calculation group works:

Currently the only way to create and or edit the Format String Expression is in Tabular Editor. Whilst I love Tabular Editor it does not have any DAX Intellisense.

I kept on editing the Format String Expression and I honestly must have had to edit it 4 times and it still was wrong.

Below is a quick tip on what I did to make sure my Format String Expression was valid and would indeed work!

Read on to see how.

Comments closed

Real-Time Data Warehousing in Cloudera

Justin Hayes gives us an overview of using Cloudera Data Platform for real-time data warehousing:

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.

– the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving

– the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction

– the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth

– mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

Read on for more.

Comments closed

Join Execution in Apache Spark

Ajay Gupta takes us through join operations in Apache Spark:

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

Click through for more information on the mechanics of joining, including trade-offs between types of physical join operators.

Comments closed