Press "Enter" to skip to content

Category: Architecture

Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Comments closed

Splitting A Small Database

Brent Ozar explains why he recommended a client break out a small database:

Listen, I can explain. Really.

We had a client with a 5GB database, and they wanted it to be highly available. The data powered their web site, and that site needed to be up and running in short order even if they lost the server – or an entire data center – or a region of servers.

The first challenge: they didn’t want to pay a lot for this muffler database. They didn’t have a full time DBA, and they only had licensing for a small SQL Server Standard Edition.

Read on for the full explanation.  Given the constraints and expectations, it makes sense, and this is a good example of figuring out how expected future growth can change the bottom line for a DBA.

Comments closed

Azure Networking

Joshua Feierman has an article on how Azure Networking works, particularly from the viewpoint of a DBA:

The connecting thread between an Azure virtual machine and a virtual network is a Virtual Network Interface Card, or VNic for short. These are resources that are separate and distinct from the virtual machine and network itself, which can be assigned to a given virtual machine.

If you go to the “All Resources” screen and sort by the “Type” column, you will find a number of network interface resources.

There’s some good information in here.

Comments closed

SQL Server On VMware Guide

David Klee announces an update to VMware’s SQL Server best practices guide:

I am proud to announce that we contributed to the latest revision of the Microsoft SQL Server on VMware best practices guide, freely available at this address. This document outlines some of the common VM-level tweaks and adjustments that are made when running enterprise SQL Server VMs on VMware platforms. This guide is considered a must-read if you manage these sorts of SQL Servers, which cannot be treated as general purpose virtual machines.

This guide was recently updated for vSphere 6.5, and we consider it an absolute must for your enterprise management library!

If you manage SQL Server instances on VMware, it’s definitely worth the read.

Comments closed

Entity Framework Slow, News At 11

Jovan Popovic shows that Entity Framework is slow and Dapper is fast:

To setup test, you can go to StackExchange/Dapper GitHub an download source code. Tests are created as C# solution (Dapper.sln). When you open this solution you can find Dapper.Tests project. You might need to change two things:

  1. Connection strings are hardcoded in Tests.cs file with values like “Server=(local)\SQL2014;Database=tempdb;User ID=sa;Password=Password12!”. You might need to change this and put your connection info.
  2. Project is compiled using dotnet sdk 1.0.0-preview2-003121, so you might get compilation errors if you don’t have a matching framework. I have removed line: “sdk”: { “version”: “1.0.0-preview2-003121” } from global.json to fix this.

Now you will be able to build project and run tests.

Nothing’s going to be faster than hand-crafted, well-tuned statements from people who know what they’re doing.  Micro-ORMs like Dapper and FSharp.Data.SqlClient will trade a little bit of a speed hit for developer niceties.  Heavier frameworks like Entity Framework and NHibernate add a lot more, but tend to be significantly slower.

Comments closed

Supersized Tables

Deborah Melkin tells a story of a design battle she lost:

The programmers came to me and said we need to add a large number of columns to this table for one piece of functionality. It would more than double the total number of columns on the table. Oh, and all of the new columns would be NULL since we would only need to populate them if they were using that functionality and even then, not all of them would require data. The final result would be that 65-75% of the table would end up having nullable fields with the majority of those having NULL for the value.

I said what I think any sane DBA would say to this request: No.

Click through for the rest of the tale.

Comments closed

Serverless Azure

Christos Matskas has an article on Azure Functions, Service Fabric, and Batch:

This service is the hidden gem of HPC (high performance computing) within the Azure Compute service family. As the name implies, Azure Batch is designed to run large-scale and high-performance computing applications efficiently in the cloud. When you’re faced with large workloads, all you have to do is to use Azure Batch to define compute resources to execute your applications in parallel and at the desired scale. A good use-case for Azure Batch would be to perform financial risk modelling, climate data analysis or stress testing. What makes Batch so useful is the fact that you don’t need to manually manage the node cluster, virtual networks or scheduling because all this is handled by the service. You need to define a job, any associated data and the number of nodes you want to utilise. It makes no difference if you need to run on one, a hundred or even thousands of nodes. The service is designed to scale according to the workload needs.

The cheapest server may very well be no server, and we’re at the point where relatively simple services could just run as Azure Functions or AWS Lambda functions.

Comments closed

That 53rd Week

Jens Vestergaard notes that you can sometimes have a 53rd week in the year:

There are a lot of great examples out there on how to build your own custom Time Intelligence into Analysis Services (MD). Just have a look at this, this, this, this and this. All good sources for solid Time Intelligence in SSAS.
One thing they have in common though, is that they all make the assumption that there is and will always be 52 weeks in a year. The data set I am currently working with is built on ISO 8601 standard. In short, this means that there is an (re-) occurrence of a 53rd full week as opposed to only 52 in the Gregorian version which is defined by: 1 Gregorian calendar year = 52 weeks + 1 day (2 days in a leap year).

The 53rd occurs approximately every five to six years, though this is not always the case. The last couple of times  we saw 53 weeks in a year was in 1995, 2000, 2006, 2012 and 2015. Next time will be in 2020. This gives you enough time to either forget about the hacks and hard-coded fixes in place to mitigate the issue OR bring your code in a good state, ready for the next time.

Dates and currency are hard problems.

Comments closed

Finding Candidate Keys

Daniel Hutmacher explains ways to find candidate keys:

Let’s assume we have a temp table heap called #table, with 9 columns and no indexes at all. Some columns are integers, one is a datetime and few are numeric. As I’m writing this post, my test setup has about 14.4 million rows.

In the real world, when you’re investigating a table for primary key candidates, there are a few things you’ll be looking for that are beyond the scope of this post. For instance, it’s a fair assumption that a numeric or float column is not going to be part of a primary key, varchar columns are less probable candidates than integer columns, and so on. Other factors you would take into consideration are naming conventions; column names ending with “ID” and/or columns that you can tell are foreign keys would also probably be good candidates.

It’s useful to think of all the candidate keys, as getting to Boyce-Codd Normal Form or 4th/5th NF involves dealing with all potential primary keys, not just the one you selected.  Daniel’s post gives you several different methods of searching existing data; combine that with domain knowledge and a bit of logic and you have a pretty decent start at finding candidate keys.

Comments closed

Leading Wildcard Seek Triggers

Aaron Bertrand demonstrates the triggers you could use if you wanted to build leading wildcard seek tables:

In my last post, “One way to get an index seek for a leading wildcard,” I mentioned that you would need triggers to deal with maintaining the fragments I recommended. A couple of people have contacted me to ask if I could demonstrate those triggers.

To simplify from the previous post, let’s assume we have the following tables – a set of Companies, and then a CompanyNameFragments table that allows for pseudo-wildcard searching against any substring of the company name

Read on for the triggers.

Comments closed