Author: Kevin Feasel

Event Hub Performance Tips

Published 2018-06-07 by Kevin Feasel

Vincent-Philippe Lauzon has a few tips for improving Azure Event Hub performance:

Here are some recommendations in the light of the performance and throughput results:

If we send many events: always reuse connections, i.e. do not create a connection only for one event. This is valid for both AMQP and HTTP. A simple Connection Pool pattern makes this easy.

If we send many events & throughput is a concern: use AMQP.

If we send few events and latency is a concern: use HTTP / REST.

If events naturally comes in batch of many events: use batch API.

If events do not naturally comes in batch of many events: simply stream events. Do not try to batch them unless network IO is constrained.

If a latency of 0.1 seconds is a concern: move the call to Event Hubs away from your critical performance path.

Let’s now look at the tests we did to come up with those recommendations.

Read the whole thing.

Comments closed

Privacy-Preserving Data Mining

Published 2018-06-07 by Kevin Feasel

Duncan Greaves gives us a few options for mining data while maintaining user anonymity:

In pseudonymisation, matching data sets at individual row level is done using key fields, which are then pseudonymised for consumption. Candidates for key fields include those combinations that are most often used to match the datasets, e.g. DoB/Gender/Postcode, credit card numbers, IP addresses or email identifiers. Allocation of persistent pseudonyms are used to build up profiles over time to allow data mining to happen in a privacy sensitive way.

All methods for privacy aware data mining carry additional complexity associated with creating pools of data from which secondary use can be made, without compromising the identity of the individuals who provided the data. Pseudonymisation can act as the best compromise between full anonymisation and identity in many scenarios where it is essential that the identity is preserved, whilst minimising the risks of re-identification beyond reasonable means.

Read the whole thing.

Comments closed

Aim For Database-First Designs

Published 2018-06-07 by Kevin Feasel

Lukas Eder explains why database-first design patterns tend to work better than code-first design:

The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.

This is old wisdom, nothing new. Primary and unique keys are good. Foreign keys are good. Check constraints are good. Assertions (when they’re finally implemented) are good.

And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:

In what tablespace your table resides

What PCTFREE value it has

What the cache size of your sequence (behind the identity) is

Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.

But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.

It’s a great post. Also check out Lukas’s responses in the comments section.

Comments closed

Starting And Stopping SQL Server From The Command Line

Published 2018-06-07 by Kevin Feasel

Marek Masko shows us how to start and stop SQL Server instances in Windows and Linux from the command line:

On Linux, we don’t have yet named instances, so all commands are executed against default instance. SQL Server can be managed using thesystemctl command.

To check the current state of SQL Server instance you can run this command:

sudo systemctl status mssql-server

Click through for more.

Comments closed

The CXCONSUMER Wait Type

Published 2018-06-07 by Kevin Feasel

Paul Randal explains the CXCONSUMER wait type:

Back in September 2016 I created a Connect item (3102145) to split the CXPACKET wait into benign and actionable waits, so that when a CXPACKET wait happens, there’s really something to investigate. Unfortunately the Connect website has been shut down by Microsoft so I can’t point you to the original request I made (I also tried in the Internet Archive but couldn’t find it).

The reason I wanted this change is that CXPACKET waits have always been registered by both producer and consumer threads for query plan operators where some threads produce data (i.e. producer threads) and some threads consume the produced data (i.e. consumer threads). However, the waits for the consumer threads are not actionable, because it’s the *producer* threads that are the cause of the consumer thread waits, and so it’s the producer thread waits that are actionable. By splitting the consumer waits out, the number of CXPACKET waits should be reduced, and those that are left should be investigated.

During the PASS Summit in 2017, my friend Pedro Lopes (b|t) on the Tiger Team announced that they’d made the change. The new wait type, CXCONSUMER, was initially added to SQL Server 2017 RTM CU3 and was added to SQL Server 2016 SP2 in April. You can read Pedro’s blog post here. Note that there was initially a bug with the implementation which was fixed in SQL Server 2017 RTM CU4.

It’s nice to see this split, as CXPACKET has been one of the most ambiguous wait types for a while, due to it having multiple potential meanings.

Comments closed

Making Power BI Reports Accessible

Published 2018-06-07 by Kevin Feasel

Meagan Longoria has a checklist for ensuring that your Power BI reports are accessible to all users:

Tooltips

Don’t use tooltips to convey important information. Users with motor issues and users who do not use a mouse will have difficulties accessing them.
Do add tooltips to charts as ancillary information. It is included in the accessible Show Data table for each visual.

There’s a lot of good information here.

Comments closed

Using RAISERROR Instead Of PRINT

Published 2018-06-07 by Kevin Feasel

Randolph West recommends using RAISERROR WITH NOWAIT rather than PRINT for printing messages:

Read that last line again. It’s saying that the PRINT command needs to fill some sort of memory buffer before you see any output. Sometimes you’re lucky, and sometimes you’re not. Erland Sommarskog says the same thing in the series Error and Transaction Handling in SQL Server (Part 1: Jumpstart Error Handling | Part 2: Commands and Mechanisms | Part 3: Implementation).

If you want immediate output from your long-running process, use RAISERROR instead of PRINT, and make use of the WITH NOWAIT parameter.

Naturally I can’t leave you hanging with a 150-word blog post, most of which is quoting Erik, so I decided to do some digging. The question I posed myself was “How big is the PRINT buffer?”

I always use error level 10, as that’s the minimum level guaranteed to print to the console immediately and it does not write to error logs or stop operations.

Comments closed

Databricks MLflow

Published 2018-06-06 by Kevin Feasel

Matai Zaharia announces a new Databricks offering:

MLflow is inspired by existing ML platforms, but it is designed to be open in two senses:

Open interface: MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.

Open source: We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organizations if you wish to open source your code.

Mlflow is still currently in alpha, but we believe that it already offers a useful framework to work with ML code, and we would love to hear your feedback. In this post, we’ll introduce MLflow in detail and explain its components.

Even in alpha, it looks nice.

Comments closed

The Basics Of Azure Stream Analytics

Published 2018-06-06 by Kevin Feasel

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed

Lookups And Conditionals In Azure Data Factory V2

Published 2018-06-06 by Kevin Feasel

Alex Whittles shows us how to perform lookups and operations with IF clauses in Azure Data Factory V2:

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Read on for an example.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31