Press "Enter" to skip to content

Month: September 2016

Deploying SSDT Scripts

Richie Lee has concerns with database deployments:

At any rate, the script is generated and maybe reviewed….. so then what? In SSDT there is no way to create and deploy script in one step; they are two distinct steps. And even if they were one step, this would still not resolve the issue that troubles me. So what is this issue?

The issue is that by creating a script, and then running the deploy, you cannot be sure that the database is in the exact same state that it was when the initial script was generated. If you don’t already know, SSDT runs a deploy entirely in memory, so as mentioned there is no script created. You have to explicitly create the script as part of the process. Or, if you have already created one, you have to re-create the script.

I’m on the fence here.  In simpler environments, I think Richie has a good point.  But in a complex environment, I wouldn’t even use auto-generated deployment scripts; when you’re changing hundreds of database objects (including adding and modifying columns, backfilling data, modifying indexes, etc.), that automated deployment script is almost guaranteed to fail.  And if it does fail, it could leave you in a state of irreparable harm.

Comments closed

Regular Expressions Against Large Data Sets

Liz Bennett explains types of regular expressions which do not scale:

With recursive backtracking based regex engines, it is possible to craft regular expressions that match in exponential time with respect to the length of the input, whereas the Thompson NFA algorithm will always match in linear time. As the name would imply, the slower performance of the recursive backtracking algorithm is caused by the backtracking involved in processing input. This backtracking has serious consequences when working with regexes at a high scale because an inefficient regex can take orders of magnitude longer to match than an efficient regex. The standard regex engines in most modern languages, such as Java, Python, Perl, PHP, and JavaScript, use this recursive backtracking algorithm, so almost any modern solution involving regexes will be vulnerable to poorly performing regexes. Fortunately, though, in almost all cases, an inefficient regex can be optimized to be an efficient regex, potentially resulting in enormous savings in terms of CPU cycles.

There’s a significant performance difference, so if you work frequently with regular expressions, check this out.

Comments closed

HBase Performance Tips

Ashish Thapliyal has nine tips for optimizing HBase performance:

Does your RowKey’s looks like 1,2,3…….. or 00000001, 00000002, 00000003, or do you have Row Key that starts with date-time (starting with the year)? If you answered yes, bad news is that HBase will not scale for you, you have so many options to improve the HBase performance but there is nothing that will compensate for the bad rowkey design.

When rowkey is in sorted order, all the writes go to the same region and other regions will sit ideal doing nothing. you will see one of your node is very stressed trying to cope up with all the writes where as other nodes are thanking you for not giving them enough work. So, always salt your keys by adding random numbers or characters to the row key prefix.

If you are using Phoenix on top of HBase, Phoenix provides a way to transparently salt the row key with a salting byte for a particular table. You need to specify this in table creation time by specifying a table property “SALT_BUCKETS” typical practice is to set the value of SALT_BUCKET =number of region server

I think the biggest one is to design your data structures correctly.  This is particularly important if you’re coming at it from a relational background and are thinking in terms of what makes relational databases fast.

Comments closed

Distributed File System Replication And Backups

James Anderson discusses an interesting setting within Distributed File System Replication:

Ideas of a cmd job step (after the backup step) that renamed the .bak files to .BTFU started to form, but a quick search showed that there is a default filter on DFSR folders.

  • Files starting with ~ (temporary files created by programs like Word)

  • Files with .tmp extension

  • Files with a .bak extension.

Read on to learn what you can do to remove extension filters within DFSR.

Comments closed

Pearson’s Correlation Coefficient

Mala Mahadevan explains correlation coefficients:

The statistical definition of Pearson’s R Coefficient, as it is called, can be found in detail here for those interested. A value of 1 indicates that there is a strong positive correlation(the two variables in question increase together), 0 indicates no correlation between them, and -1 indicates a strong negative correlation (the two variables decrease together). But you rarely get a perfect -1, 0 or 1. Most values are fractional and interpreted as follows:
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.

Mala includes R and T-SQL code so you can follow along.

Comments closed

SSPI

Kevin Hill diagnoses an SSPI error:

Apparently, the account was either locked out from our failed logon attempts, or had been disabled in Active Directory due to its age.  They do that sometimes.   Most likely the issue was locked.

We restarted the SQL Server (O/S restart) and that resolved it once the AD group unlocked it.

My assumption is that the lockout either blocked Kerberos authentication due to SPN no longer being valid, or the SPN itself got corrupted.  It was still there, just not working.   Verified its existence through running SetSPN -L with the account name.

This is on my top five list of least helpful error messages.  Even if it is literally true, it does not help you diagnose and correct the issue.  There are a number of potential causes and it’s up to you to troubleshoot each one (assuming you even know that it could be an issue) until it just works again.

Comments closed

Throwing Hardware At The Problem

Erik Darling says get more RAM:

I’m not saying you need a 1:1 relationship between data and memory all the time, but if you’re not caching the stuff users are, you know, using, in an efficient way, you may wanna think about your strategy here.

  • Option 1: Buy some more RAM
  • Option 2: Buy an all flash array

You’ll still need to blow some development time on tuning queries and indexes, but hardware can usually bridge the gap if things are already critical.

Looking at hardware is a reasonable approach.  The best bet is to satisfy the most pressing need at the margin.  Sometimes that means more (or better) hardware, sometimes it means tuning queries, and sometimes it means application-level changes to retrieve data differently.

Comments closed

Group Workspaces

Ginger Grant shows how to use Group Workspaces within Power BI to allow a team to work together on the same files without stepping on each other’s feet:

Many of the functionality people associate with source control programs live inside the group one drive which is created for Power BI. Looking at the picture of the group screen, which was created when a Power BI Workspace was created, you will see that this group contains 7 members and four files. The members of this groups are the only ones who have access to the files. The file AcmeThree.pbix is selected, and cClicking on the elipise (…) brings up a menu for the file. Notice one of them is Check Out. If I check out a file, the icon next to the name changes, providing a visual queue to all who wish to edit the file that it is being working on. The menu option for me would change to Check In, providing the ability to check the file in to the directory, allowing others to check out the file and work on it. Notice Version History also exists. This feature allows previous versions of the file to be loaded, which means that changes made to a file can be rolled back.

It’s good that this is available, and I’d make use of it.  For Power BI Desktop, it seems prudent to continue using source control.

Comments closed

Cache Eviction Policies

Dan Luu has a great article from a couple years ago on when a random cache eviction policy might be preferable to Least Recently Used:

Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn’t so bad. That random eviction isn’t bad can be surprising – if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it’s going to cause a miss every time if the loop doesn’t fit. A random eviction policy degrades gracefully as the loop gets too big.

In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?

Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.

Dan writes at a depth I appreciate and on topics I often don’t understand (particularly when he gets into CPU engineering details).

Comments closed

MapReduce

I talk about Hadoop a good bit on Curated SQL.  Therefore, I think it’s worth mentioning the original MapReduce paper that Jeffrey Dean and Sanjay Ghemawat published in 2004:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

If you’ve never read this paper before, today might be a good day to do so.

Comments closed