Press "Enter" to skip to content

Month: September 2016

Changing Identity Start Value

Kenneth Fisher has a good post on what happens when you change the seed value of an identity column:

Well Paul told me this wasn’t the case. Now when Paul tells me something I believe him, but I also like to run tests. So I decided to usesys.fn_PhysLocCracker(%%physloc%%). %%physloc%% returns a varbinary that gives you the location of the row. When passed tosys.fn_PhysLocCracker(%%physloc%%) it returns the database file, page in the file, and slot number where the row can be found. So to start with I create an identity(1,1) and I run 20 inserts, one at a time, checking row locations each time. This is to confirm I’m right about this part.

Clicking through is worth it for the hypnotizing animated GIFs.

Comments closed

SSMS Connection Colors

Andrew Pruski shows how to change window bar colors within SQL Server Management Studio:

A simple but effective setting in SQL Server Management Studio is using custom colours to identify which server you are about to execute a query on. It’s simple to setup but not everyone who uses SSMS is aware of it so I thought I’d quickly run through the steps here.

This is a nice visual way of figuring out you’re in production before you run that truncate table script.

Comments closed

Azure SQL Data Warehouse Setup

Arun Sirpal configures a new instance of Azure SQL Data Warehouse:

The information shown here is the DSQL (Distributed SQL) plan – When you send a SQL query to SQL Data Warehouse, the Control node processes a query and converts the code to DSQL then the Control node sends the command to run in each of the compute nodes.

The returned query plan depicts sequential SQL statements; when the query runs it may involve parallelized operations, so some of the sequential statements shown may run at the same time. More information can be found at the following URL https://msdn.microsoft.com/en-us/library/mt631615.aspx.

Arun also looks at running a simple Power BI report off of Azure SQL Data Warehouse; click through for that.

Comments closed

Reporting On Unstructured Data

Jen Underwood discusses a tool which helps provide structure to text documents:

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Looks like an interesting tool.

Comments closed

One-Sample T Tests

Mala Mahadevan shows how to perform one-sample T Tests:

For this post I decided to go with a simple example of how many steps I walked with my per day for the month of August. My goal is 10,000 steps per day – that has been my average over the year but is this true of the data I gathered in August? I have a simple table with two columns – day and steps. Each record has how many steps I took in August per day, for 30 days. So – SELECT AVG(steps) FROM [dbo].[mala-steps] gives me 8262 as my average number of steps per day in August. I want to know if am consistently under performing my goal, or if this is a result of my being less active in August alone. Let me state my problem first – or state what is called ‘null hypothesis’:

I walk 10,000 steps on an average per year. 

Read on for T test operations in T-SQL (although not all operations are available) and R.

Comments closed

Clickstream Anomaly Detection

Chris Marshall shows how to perform anomaly detection using AWS Kinesis Analytics:

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

This kind of scenario is pretty cool—you could also do things like detecting service outages in streams (fewer than X events in a window, where X is some very small number relative to your overall data) or changes in advertising campaigns.

Comments closed

Deadlocks In Apache Ignite

Prachi Garg discusses Deadlock-Free Transactions in Apache Ignite:

When transactions in Ignite are performed with concurrency mode -OPTIMISTIC and isolation level -SERIALIZABLE, locks are acquired during transaction commit with an additional check allowing Ignite to avoid deadlocks. This also prevents cache entries from being locked for extended periods and avoids “freezing” of the whole cluster, thus providing high throughput. Furthermore, during commit, if Ignite detects a read/write conflict or a lock conflict between multiple transactions, only one transaction is allowed to commit. All other conflicting transactions are rolled back and an exception is thrown, as explained in the section below.

This sounds pretty similar to how SQL Server’s In-Memory OLTP works.

Comments closed

Asynchronous Mirroring And Asynchronous AGs

Tara Kizer contrasts asynchronous database mirroring with asynchronous Availability Groups:

ASYNCHRONOUS DATABASE MIRRORING

For asynchronous Database Mirroring, all we need is two servers: the principal at the primary site and the secondary at the DR site. Setup async mirroring between the two, and you’re done. If the secondary server goes down, production still continues. The transaction log isn’t clearing when the log backups occur because the principal still needs to send those log records to the secondary. As long as you have enough disk space where the transaction log resides to support this until the secondary server comes back online, production still continues. Of course if you run out of disk space, users will start receiving errors. But that can take some time and often times is enough time to get the secondary back online.

Availability Groups are more difficult to implement than database mirroring, but the flipside is that they’re much more powerful.

Comments closed

Apache Hadoop 3 Alpha

Andrew Wang looks at the new alpha for Hadoop:

Support for Multiple Standby NameNodes

HDFS NameNode high availability with QuorumJournalManager uses a Paxos quorum to store the NameNode edit log. With a three-node quorum, this change means we can tolerate the loss of any one node and still continue operation.

However, business-critical deployments may wish to run with higher levels of fault-tolerance, e.g. a five-node quorum to be able to tolerate the loss of any two nodes.

QuorumJournalManager already supports an arbitrary number of nodes, but fault tolerance was limited since HDFS was only able to run a single active and single standby NameNode. Hadoop 3 eliminates this restriction by supporting running multiple standby NameNodes. This improves the fault tolerance of HDFS.

This one is huge to me.  It was a sad day when I learned that the “secondary” NameNode was no such thing.

Comments closed

Replication Raiserror Procedure Change

Matt Slocum documents a change to a replication stored procedure in SQL Server 2014 SP2:

I ran into this replication error the other day and I was completely stumped.
Procedure or function sp_MSreplraiserror has too many arguments specified.
 
We started getting that error message shortly after we had applied SP2 for SQL Server 2014 to a server that is a replication Publisher (source of replicated data).

We dug into the commands that were being replicated and found that there were missing rows in the table on the replication Subscriber (destination for replicated data).  Once the rows were populated the errors stopped.  However, after digging in a bit more, we found that this error has an explainable source.

Read on for Matt’s solution to the issue.

Comments closed