Press "Enter" to skip to content

Month: June 2020

Cassandra Monitoring and Data Modeling

Instaclustr has put up a couple interesting posts on Cassandra. First, Anup Shirolkar explains how we can monitor Cassandra installations:

Cassandra is developed in Java and is a JVM based system. Each Cassandra node runs a single Cassandra process. JVM based systems are enabled with JMX (Java Management Extensions) for monitoring and management. Cassandra exposes various metrics using MBeans which can be accessed through JMX. Cassandra monitoring tools are configured to scrape the metrics through JMX and then filter, aggregate, and render the metrics in the desired format. There are a few performance limitations in the JMX monitoring method, which are referred to later. 

The metrics management in Cassandra is performed using Dropwizard library. The metrics are collected per node in Cassandra. However, those can be aggregated by the monitoring system. 

On the development side, the Instaclustr team walks us through data modeling guidelines:

The ultimate goal of Cassandra data modeling and analysis is to develop a complete, well organized, and high performance Cassandra cluster. Following the five Cassandra data modeling best practices outlined will hopefully help you meet that goal:

1. Cassandra is not a relational database, don’t try to model it like one
2. Design your model to meet 3 fundamental goals for data distribution
3. Understand the importance of the Primary Key in the overall data structure 
4. Model around your queries but don’t forget about your data
5. Follow a six step structured approach to building your model. 

Because Cassandra uses a variant of SQL, it’s easy to forget that data is stored completely differently and that design decisions are quite different from what we see in the relational world.

Comments closed

Service Broker Conversations and Messages

Chris Johnson is working on a series on Service Broker fundamentals. This post covers conversations and messages:

First, each conversation has its own unique dialog_handle (not conversation handle, despite everything calling it a conversation from now on, score one for consistency Microsoft). We need to capture this handle in a UNIQUEIDENTIFIER variable, as we will need it later on to send messages across the conversation. In fact, the statement will error if you don’t supply a variable to capture the handle.

Second, we need to supply both FROM and TO services. These tell the conversation which service is the source and which is the target. Remember, each service is attached to a queue, and can have one or more contracts attached to it. The source service is a database object, but the target service is an NVARCHAR. This allows the target service to live outside the database, which is something that I will cover at some point in the Service Broker 201 series.

This is a nice explanation of the process, so if you aren’t particularly familiar with Service Broker, check out Chris’s series, starting with the first and second posts.

Comments closed

Setting Drive Allocation Unit Size using Powershell

Eric Cobb has a tiny script for us:

There seems to be some ongoing debate around whether or not formatting your data and log drives with 64KB allocation unit size even matters anymore. I would encourage you to do your own research to determine if you want to do this or not. My personal take on it is: if it doesn’t hurt, and it may help, and it only takes 2 seconds to click the “go” button on my PowerShell script, then I would rather go ahead and do it and not need it than not do it and wish I had later down the road.

I don’t have a strong opinion in that debate, myself, so I’ll just say that if you want to see how to do this in a couple lines of Powershell code, check out Eric’s post.

Comments closed

Generating SQL Server Data Tools Solutions from Templates

Sander Stad walks us through creating a template for building SSDT solutions:

Yes, templates. But how are we’re going to create a template for an SSDT solution in such a way that it can be reused?

That’s where the PowerShell module called “PSModuleDevelopment” comes in. PSModuleDevelopment is part of the PSFramework PowerShell module.

The PSModuleDevelopment module enables you to create templates for files but also entire directories. Using placeholders you can replace values in the template making is possible to have the name and other variables set.

This is where the SSDT template comes in. I have created a template for SSDT that containes two projects. One project is meant for the data model and the other project is meant for the unit tests.

Read the whole thing and check out Sander’s GitHub repo.

Comments closed

Proper Ways to Store Currency Data in SQL Server

Randolph West thinks about ways to store money values in SQL Server:

I completely agree with this statement. Never store values used in financial calculations as floating point values, because a floating point is an approximate representation of a decimal value, stored as binary. In most cases it is inaccurate as soon as you store it. You can read more in this excellent — if a little dry — technical paper.

With that out of the way, we get into an interesting discussion about the correct data type to store currency values.

Randolph states an argument around why DECIMAL(19,4) is fine. And it’s great for most cases, though the one “real” financial system I’ve worked with have money stored as integer types (with SQL Server, that’d be BIGINT) because of precision, especially when working with exchange rates. But for most cases—especially when you’re not building the system of record for financial transactions or accounts—I agree with Randolph that DECIMAL is fine. Dave Wentzel has a great comment explaining even further the reasoning behind integer values for certain monetary columns.

Comments closed

Alerting when Power BI Tenant Settings Change

Melissa Coates walks us through how to track Power BI tenant settings changes:

This post discusses two methods of receiving an alert when a tenant setting has changed in the Power BI Service: one using Cloud App Security, and the other using the M365 Security & Compliance Center.

Tenant settings in the Power BI Service are among the most important things to get right. Once they are set the way you want, the objective is to ensure all changes are well-controlled. Tip: Check section 10 in the Planning a Power BI Enterprise Deployment whitepaper—I included some tenant setting recommendations in the latest version.

Read on to see how.

Comments closed

Genetic Algorithms in Python

Abhinav Choudhary walks us through building a genetic algorithm library in Python:

Here are quick steps for how the genetic algorithm works:

1. Initial Population– Initialize the population randomly based on the data.
2. Fitness function– Find the fitness value of the each of the chromosomes(a chromosome is a set of parameters which define a proposed solution to the problem that the genetic algorithm is trying to solve)
3. Selection– Select the best fitted chromosomes as parents to pass the genes for the next generation and create a new population
4. Cross-over– Create new set of chromosome by combining the parents and add them to new population set
5. Mutation– Perfrom mutation which alters one or more gene values in a chromosome in the new population set generated. Mutation helps in getting more diverse oppourtinity.Obtained population will be used in the next generation

I’m a sucker for genetic algorithms (and even more so its cousin, genetic programming). And there are still good use cases for genetic algorithms, especially in creating scoring functions for neural networks.

Comments closed

Avoiding Overfitting and Underfitting in Neural Networks

Manas Narkar provides some advice on optimizing neural network models:

Adding Dropout

Dropout is considered as one of the most effective regularization methods. Dropout is basically randomly zero-ing or dropping out features from your layer during the training process, or introducing some noise in the samples. The key thing to note is that this is only applied at training time. At test time, no values are dropped out. Instead, they are scaled. The typical dropout rate is between 0.2 to 0.5.

Click through for a demo on dropout, as well as coverage of several other techniques.

Comments closed

Row Counts and Arrow Widths, Continued

Hugo Kornelis finishes a series on row counts and arrow widths with a look at Compute Scalar operators:

Compute Scalar operator is probably the most common of all operators. I hardly ever see an execution plan that doesn’t have at least a few occurrences of this operator. The task of the Compute Scalar operator is a simple one: to use some of the data in its input and, based on that, produce new data that is then added as extra columns in its output.

Because of the simplicity of this task, the actual execution of that task is often done by one of the other operators in the execution plan, and the Compute Scalar operator itself doesn’t actually execute. A side effect is that it can’t track how many rows it processes, because it doesn’t process anything at all. The result is that, even in an execution plan with run-time statistics (aka “actual execution plan”), no run-time statistics will be reported by a Compute Scalar operator when all its computations are performed by other operators. (See also the note in this (retired) Books Online article).

But then Hugo head-fakes us and shows us the real conclusion:

I already described, in a previous post, how sometimes the optimizer can create an execution plan that uses a Filter operator to evaluate a specific predicate, but then a post-optimization rewrite finds a way to push that predicate down into another operator, as a Predicate property, and then removes the Filter operator. When this happens with a bitmap filter, the Estimated Number of Rows is not adjusted, which can be quite confusing.

But for the issue in this post, the root cause was the same, but the error surfaces completely differently.

This has been a fun series to read, showing how an extremely useful signal can nonetheless exhibit problems in many edge cases.

Comments closed

Merge Replication: Subscriber and Publisher Versions

Steve Stedman gives us the proper order for upgrading SQL Server when you’re using merge replication:

Working on a recent SQL Server merge replication project we needed to update some of the servers in a merge replication scenario without upgrading all of them. Consider a merge replication setup with a publisher, a distributor and 2 or more subscribers all on the same version of SQL Server, and you need to upgrade the SQL Server version on the subscriber to a newer version like SQL Server 2019.

Before doing any type of upgrade, I wanted to confirm that things would or would not work. First checking some Microsoft documentation it appears that replication from a SQL 2012, SQL 2014, SQL 2016, or any older version of a publisher is not supported to a subscriber running on SQL Server 2019. Or more specifically the subscriber needs to be on the same ore older version than the publisher.

Read on for a demo, as well as an interesting caveat.

Comments closed