Press "Enter" to skip to content

Author: Kevin Feasel

Restoring With Standby

Kenneth Fisher describes the WITH STANDBY option for database restorations:

It then leaves the database in a read-only state. So now you can restore the database back to a point in time (even mid log backup). Check the data. Restore a few minutes into the future. Check the data again. Over and over again until you are where you need to be. It’s still going to be tedious but better than doing the full restore over again each time you need to check, right?

On top of that we can use the same idea and combine it with log shipping. Now you not only have a spare in case of a DR situation but that spare can be read only most of the time (except when actually restoring). You can use it to run reports, ad-hoc queries, etc. (Don’t use it for CHECKDBs.)

Those are a few good uses of the WITH STANDBY option.

Comments closed

CISL 1.3.0

Niko Neugebauer has released version 1.3.0 of his Columstore script library:

I am extremely proud to share with everyone the news that the long-awaited and quite overdue release of the CISL – Columnstore Indexes Scripts Library is finally public – the 1.3.0 version. The most important part of this release is the support of the SQL Server 2016 & Azure SQLDatabase – all 3 scenarios (Nonclustered Columnstore, Disk-based Clustered Columnstore & the Memory-Optimised Clustered Columnstore Indexes) are included.
You will be able to explore all the important new architecture objects, such as Deleted Buffer & Deleted Table, plus the scripts for every version supports the new output results, even though there were no In-Memory tables in SQL Server 2012 for example.

If you use columnstore indexes, check this out.

Comments closed

Help With Extended Events

Jason Brimhall has two recent blog posts on figuring out Extended Events information.  First is a republication of an older article:

First let’s tackle the problem of discovery.  When we want to use extended events to try and troubleshoot a problem or to capture more information, it is really good to know if such an event exists.  There are many events that capture data for various different things within SQL Server.  More and more events are being added with each release.  More and more data is being made available to the DBA to help perform a better job and to help the DBA better understand what is really happening within the database environment.

In order to determine if there might be an event, that can provide the data for that one “thing” that may be happening within your environment, we could start by querying the SQL Server Internals.  This next query will do just that for us.

After you read that article and check out the queries there, Jason has another post on finding the right event:

In my previous article I demonstrated how to find an event based solely on the name or description of the event. This is fantastic if the event name (or description) contains one of the magical words you have used. What if the event name or description has nothing to do with the terms you selected? Or, what if the data you seek may be attached to the event but wouldn’t necessarily stand out as a description for that event (by name or description details for that event)?

Now comes the more difficult task right? If the name or description of the event doesn’t relate to the search terms then you just might overlook a few events and be stuck trying to troubleshoot a problem. An equally big problem this could cause is yet another invisible barrier to using Extended Events. It would be easy to slide down the slippery slope and not transition to Extended Events just because an event, applicable to the problem at hand, could not be found.

Check out both of these posts.

Comments closed

Clustering With Spark

Konur Unyelioglu shows how to implement k-means and Guassian clustering techniques in Apache Spark using MLlib:

Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.

There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.

This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.

Comments closed

Rename A Primary Key Constraint

Steve Jones shows how to rename a primary key constraint:

When you compare that with the same table in another database, what’s the likelihood that you’ll have the PK named PK__OrderDet__D3B9D30C7D677BB4? Probably pretty low.

This means that if you are looking to deploy changes, and perhaps compare the deployment from one database to the next, you’ll think you have different indexes. Most comparison tools will then want to change the index on your target server, which might be using this technique. Or the choice might be something that performs much worse.

What we want to do is get this named the same on all databases. In this case, the easiest thing to do with rename the constraint on all systems. This is easy to do with sp_rename, which is better than dropping and rebuilding the index.

Do read this and avoid renaming a constraint the bad way.

Comments closed

Statistic Column Sort Order

Shaun J. Stuart points out an inconsistency in display order for columns on a statistic:

What’s going on? Why are the columns in the statistic not in the same order as the columns in the index? Well, it turns out, they are. If we look on the Details page, we see the density vector is, in fact, created as Col2, Col1, Col3, which is the order of the columns in the index:

Read the whole thing to avoid confusion next time you look at the statistics GUI.

Comments closed

Cluster Rebalancing

Peter Coates discusses cluster rebalancing in Hadoop:

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

It didn’t take long to discover the cause—the configuration parameterdfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.

On the development side, “it’ll automatically rebalance without us having to worry” is a great thing.  On the administrative side, we’re paid to worry about these things…

Comments closed

Data Cleansing Outside Of Excel

Lee Baker shows some free alternatives to Excel for data cleansing:

Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.

The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.

Lee recommends three free tools, and they look like they’re worth trying out.

Comments closed

StackLite Dataset

David Robinson reports on a new Stack Exchange data set available to the public:

For each Stack Overflow question asked since the beginning of the site, the dataset includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID (except for deleted questions)
  • Number of answers
  • Tags

This is ideal for performing analyses such as:

  • The increase or decrease in questions in each tag over time

  • Correlations among tags on questions

  • Which tags tend to get higher or lower scores

  • Which tags tend to be asked on weekends vs weekdays

  • Rates of question closure or deletion over time

  • The speed at which questions are closed or deleted

This is pretty exciting.  Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.

Comments closed