Press "Enter" to skip to content

Month: October 2016

RTVS 0.5

David Smith notes that R Tools for Visual Studio has hit version 0.5:

RTVS also makes it easy to run R code as a SQL Server 2016 stored procedure. (This is a great way to make share the results of R code with other database users while making use of the power of the database for R computations.) The new SQL R Stored Procedure file works with SQL Server R Services to create a stored procedure that embeds R code you create, edit and test within RTVS. This greatly simplifies the process of running R code within SQL Server 2016 as you can see below:

The RTVS team is making good progress.  If you passed on RTVS early on, it might be time to take another look.

Comments closed

Phantom Reads

Wayne Sheffield discusses phantom reads:

Run both code scripts again (Code Script 1 first, and Code Script 2 within 10 seconds). This time, you will see that Code Script 2 completes immediately without being blocked, and when Code Script 1 finishes, it has spawned additional data in its result set. A phantom read in action.

If you want to hide from phantom reads completely, then you’ll need to use either the serializable or snapshot transaction isolation levels. Both of these have the same concurrency effects: No dirty reads, non-repeatable reads, or phantom reads. The difference is in how they are implemented: the serializable transaction isolation level will block all other transactions affecting this data, while the snapshot isolation level utilizes row versions to create connection-specific versions of the table for the transaction – all of these row versions will cause increased activity in the tempdb database. Let’s take a look at how the snapshot isolation level will eradicate the phantom reads. First off, we need to modify the database to accept this isolation level.

After reading Wayne’s post, if you want a more academic (i.e., less fun) read, you can also go back to the Microsoft Research isolation levels paper, which describes most of the isolation levels we have in place today for SQL Server.

Comments closed

Temporary Stored Procedures

Jana Sattainathan discusses temporary stored procedures:

The real benefit of these procedures is when they contain lot of logic that you need on a temporary basis but do not want to clutter the existing stored procedure list. You could even have multiple temporary procedures that call each other. I would not go overboard with this. It is just a convenience.

I don’t often see these in use; when I’ve seen them, they’re in environments in which normal stored procedure create rights are locked down and you want to do something as a one-off (like testing an operation against production data).  In other words, those sketchy things that we don’t admit to each other that we do…

1 Comment

BCP For UTF-8

Sanjay Mishra notes that SQL Server 2016 and 2014 SP2 support UTF-8 for BCP and BULK INSERT:

This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:

(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)

I ran into this problem before, where developers wanted to bulk load UTF-8 but had to settle for an inferior solution.

Comments closed

Page Free Space On Heaps

Raul Gonzalez shows a case in which writing to a heap can be more efficient than writing to a clustered index:

Lots have been written, including myself, about the advantage and disadvantages of the different types of tables that exist in SQL Server, Clustered and Heaps.

Today I will again, because at the end of the day, a table is the most basic structure in a database and we need them to store the data, which is indeed the purpose of having a database, right?

Interesting reading.  Check it out.

Comments closed

New Samsung Drives

Joe Chang looks at the Samsung 960 PRO SSD:

All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3. Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less. In earlier generation products, an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate was typical.

Presumably there was a reason why every PCI-E x4 was in the 2000-2500MB/s bandwidth. It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s. Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary on the downstream side. The could be other reasons as well, perhaps the DRAM for caching NAND meta-data. Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.

If you’re looking at a test lab server, this might be a good disk for you.

Comments closed

Online HDFS Disk Balancer

Lei Xu demonstrates the intra-DataNode disk balancer in HDFS:

By default, the DataNode uses the round-robin-based policy to write new blocks. However, in a long-running cluster, it is still possible for the DataNode to have created significantly imbalanced volumes due to events like massive file deletion in HDFS or the addition of new DataNode disks via the disk hot-swap feature. Even if you use the available-space-based volume-choosing policy instead, volume imbalance can still lead to less efficient disk I/O: For example, every new write will go to the newly-added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.

Recently, the Apache Hadoop community developed server offline scripts (as discussed inHDFS-1312, the dev@ mailing list, and GitHub) to alleviate the data imbalance issue. However, due to being outside the HDFS codebase, these scripts require that the DataNode be offline before moving data between disks. As a result, HDFS-1312 also introduces an online disk balancer that is designed to re-balance the volumes on a running DataNode based on various metrics. Similar to the HDFS Balancer, the HDFS disk balancer runs as a thread in the DataNode to move the block files across volumes with the same storage types.

This is a good read and sounds like a very useful feature.

Comments closed

Sparklyr On EMR

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Comments closed

Resetting Kafka Topics

I show two methods to clear out a Kafka topic:

The first method works fine for non-production scenarios where you can stop all of the producers and consumers, but let’s say that you want to flush the topic while leaving your producers and consumers up (but maybe you have a downtime window where you know the producers aren’t pushing anything).  In this case, we can change the retention period to something very short, let the queue flush, and bring it back to normal, all using the kafka-configs shell script.

Points deducted for slipping and writing “queue” there, but otherwise, I prefer the second method, as things are still online.  In less-extreme scenarios, you might drop the retention period to a few minutes, especially if your consumers are all caught up.

Comments closed

R Services Resource Utilization

Ginger Grant shows off some R Services reports to see how hard the developers are battering your poor servers with their R scripts:

R Services – Extended Events is also not a report but a list of all the extended events that are available for R Services. This is a handy bit of information, which can be a great reference tool for extended events monitoring. R Services – Packages lists the packages which are currently installed on SQL Server. When people write R, many lot of different packages are used within the script. Prior to running a package, check the information on this report to ensure the libraries used are installed on SQL Server. If the library is missing the code will not work. R Services – Resource Usage is a great way to see how R has been configured to run on the server. Notice I have created an External Pool for R. This is a configuration recommended by Microsoft to better monitor your R Services.

Click through for more information, and grab the reports from Microsoft’s Github repo.

Comments closed