Curated SQL – Page 1169 – A Fine Slice Of SQL Server

A Power BI Dataset is a series of Power Query queries that have been shaped in a DAX model. Each dataset can combine different files, database tables and online services all into one tabular model. In our cookie analogy, these are all different “ingredients”.

Unlike SSRS, a dataset in Power BI does not represent a single table or query of data. A dataset should be considered more like a “flavor” of data used to accomplish a specific type of reporting: financial, operational, HR, etc. So in our analogy, the dataset is the “raw dough”.

So in Power Query, you are going to have a set of queries which each combine a data source with a usually linear set of transformations.

The pie chart cookie is the first one I would have eaten, if only to eliminate it.

Comments closed

Writing Higher-Order Functions With Scala

Published 2018-10-01 by Kevin Feasel

Jyoti Sachdeva explains the concept of higher-order functions and shares an example in Scala:

In this blog, I’m going to explain higher-order functions.

A higher order function takes other function as a parameter or return a function as a result.

This is possible because functions are first-class value in scala. What does that mean?

It means that functions can be passed as arguments to other functions and functions can return other function.

The map function is a classic example of a higher order function.

Higher-order functions are one of the key components to functional programming and allows us to reason in small chunks at a time

Comments closed

Building Observable Distributed Systems

Published 2018-10-01 by Kevin Feasel

Kevin Sookocheff has some thoughts on building observable systems:

Given the shortcomings of monitoring and testing, we should shift focus to building observable systems. This means treating observability of system behaviour as a primary feature of the system being built, and integrating this feature into how we design, build, test, and maintain our systems. This also means acknowledging that the ease with which we can debug our production environment will be a key indicator of system reliability, scalability, and ultimately customer experience. Designing a system to be observable requires effort from three disciplines of software development: development, testing, and operations. None of these disciplines is more important than the others, and the sum of them is greater than the value of the individual parts. Let’s take some time to look at each discipline in more detail, with a focus on observability.

My struggle has never been with the concept, but rather with getting the implementation details right. “Make everything observable” is great until you run out of disk space because you’re logging everything.

Comments closed

Redshift Architecture Performance Tips

Published 2018-10-01 by Kevin Feasel

John Ryan has a few hints to help us build speedy Redshift clusters:

The Need to Vacuum

As Redshift does not reclaim free space automatically, updates and delete operations can frequently lead to table growth. Equally, it’s important as new entries are added, that the data is maintained in a sorted sequence.

The VACUUM command is used to re-sequence data, and reclaim disk space as a result of DELETE and UPDATE operations. Although it won’t block other processes, it can be a resource-intensive operation, especially for data stored using interleaved sort keys.

It should be run periodically to ensure consistent performance and to reduce disk usage.

Some of this is good Postgres advice; some of it is good MPP advice (and serves well, for example, when dealing with Azure SQL Data Warehouse); the rest is Redshift-specific.

Comments closed

Scaling Anomaly Detection With Kafka And Cassandra

Published 2018-10-01 by Kevin Feasel

Paul Brebner has started a series on anomaly detection using Kafka and Cassandra, starting with an introduction:

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

Large volumes of streaming data is ingested into Kafka (at variable rates)
Data is sent to Cassandra for long term persistence
Streams processing is triggered by the incoming events in real-time
Historic data is requested from Cassandra
Historic data is retrieved from Cassandra
Historic data is processed, and
A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection. I’ll be interested in following this series.

Comments closed

Cloning And Columnstore Statistics

Published 2018-10-01 by Kevin Feasel

Niko Neugebauer points out a fix in SQL Server 2019:

I have a huge love for the DBCC CLONEDATABASE command – it has been made available (backported) to every SQL Server version starting with SQL Server 2012, since the original release in SQL Server 2014, while being constantly improved in the Service Packs and Cumulative Updates.

This blog post is focusing on the Database Cloning improvement in the SQL Server 2019 that is already available in the public CTP 2.0 – the possibility of the automated statistics extraction for the Columnstore Indexes.
WHY ?
Well, there was quite a significant problem with the Columnstore Indexes previously – the statistics for them were not extracted into the cloned database, unless you did created the statistics in the most recent step before Database cloning.

Click through for more details and a comparison between SQL Server versions.

Comments closed

Hybrid Columnstore And B+ Tree Designs

Published 2018-10-01 by Kevin Feasel

Adrian Colyer reviews a Microsoft paper on the combination of columnstore and B+ tree indexes on a single table:

The authors conducted a series of microbenchmarks as follows:

scans with single predicates with varying selectivity to study the trade-off between the range scan of a B+ tree vs a columnstore scan
sort and group-by queries to study the benefit of the sort order supported by B+ trees (columnstores in SQL Server are not sorted).
update statements with varying numbers of updated rows to analyze the cost of updating the different index types
mixed workloads with different combinations of reads and updates

It’s interesting to read an academic paper covering the topic, particularly when you can confirm that it works well in practice too.

Comments closed

Finding Trace Flag Usage With dbachecks

Published 2018-10-01 by Kevin Feasel

Rob Sewell points out an addition to dbachecks:

This will show you

the UniqueTag which will enable you to run only that check if you wish

AllTags which shows which tags will include that check

Config will show you which configuration items can be set for this check

The trace flag checks require the app.sqlinstance configuration which is the list of SQL instances that the checks will run against. You can also specify the instances as a parameter for Invoke-DbCheck as well.

Click through for an example.

Comments closed

Graph Additions In SQL Server 2019

Published 2018-10-01 by Kevin Feasel

Shreya Verma announces one of the new additions to graph database support in SQL Server 2019:

SQL Server 2017 and Azure SQL Database introduced native graph database capabilities used to model many-to-many relationships. The first implementation of SQL Graph introduced support for nodes to represent entities, edges to represent relationships and a new MATCH predicate to support graph pattern matching and traversal.

We will be further expanding the graph database capabilities with several new features. In this blog we will discuss one of these features that is now available for public preview in SQL Server 2019, Edge Constraints on Graph Edge Tables.

In the first release of SQL Graph, an edge could connect any node to any other node in the database. With Edge Constraints users can enforce specific semantics on the edge tables. The constraints also help in maintaining data integrity. This post describes how you can create and use edge constraints in a graph database. We will use the following graph schema created in the WideWorldImporters database for the samples discussed here.

I know that SQL Server 2017 was a bit underwhelming for graph database work, so I will be interested in seeing how much of the gap they cover in this release.

Comments closed

Best Practices For Tabular Models

Published 2018-10-01 by Kevin Feasel

Ginger Grant has started a new series on best practices for Analysis Services Tabular models:

Data Type Selection

The data type selected will impact the physical storage used, not the compression of the models in memory. It is important whenever possible to reduce the cardinality of the data in order to be able to sort the data effectively. When storing decimal numbers, unless you need many significant digits, store the data as Currency as it will take less space in physical storage than decimal.

Click through for additional tips.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Curated SQL Posts

The Need to Vacuum

Data Type Selection