Press "Enter" to skip to content

Author: Kevin Feasel

BCP For UTF-8

Sanjay Mishra notes that SQL Server 2016 and 2014 SP2 support UTF-8 for BCP and BULK INSERT:

This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from http://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:

(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)

I ran into this problem before, where developers wanted to bulk load UTF-8 but had to settle for an inferior solution.

Comments closed

Page Free Space On Heaps

Raul Gonzalez shows a case in which writing to a heap can be more efficient than writing to a clustered index:

Lots have been written, including myself, about the advantage and disadvantages of the different types of tables that exist in SQL Server, Clustered and Heaps.

Today I will again, because at the end of the day, a table is the most basic structure in a database and we need them to store the data, which is indeed the purpose of having a database, right?

Interesting reading.  Check it out.

Comments closed

New Samsung Drives

Joe Chang looks at the Samsung 960 PRO SSD:

All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3. Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less. In earlier generation products, an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate was typical.

Presumably there was a reason why every PCI-E x4 was in the 2000-2500MB/s bandwidth. It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s. Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary on the downstream side. The could be other reasons as well, perhaps the DRAM for caching NAND meta-data. Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.

If you’re looking at a test lab server, this might be a good disk for you.

Comments closed

Online HDFS Disk Balancer

Lei Xu demonstrates the intra-DataNode disk balancer in HDFS:

By default, the DataNode uses the round-robin-based policy to write new blocks. However, in a long-running cluster, it is still possible for the DataNode to have created significantly imbalanced volumes due to events like massive file deletion in HDFS or the addition of new DataNode disks via the disk hot-swap feature. Even if you use the available-space-based volume-choosing policy instead, volume imbalance can still lead to less efficient disk I/O: For example, every new write will go to the newly-added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.

Recently, the Apache Hadoop community developed server offline scripts (as discussed inHDFS-1312, the dev@ mailing list, and GitHub) to alleviate the data imbalance issue. However, due to being outside the HDFS codebase, these scripts require that the DataNode be offline before moving data between disks. As a result, HDFS-1312 also introduces an online disk balancer that is designed to re-balance the volumes on a running DataNode based on various metrics. Similar to the HDFS Balancer, the HDFS disk balancer runs as a thread in the DataNode to move the block files across volumes with the same storage types.

This is a good read and sounds like a very useful feature.

Comments closed

Sparklyr On EMR

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Comments closed

Resetting Kafka Topics

I show two methods to clear out a Kafka topic:

The first method works fine for non-production scenarios where you can stop all of the producers and consumers, but let’s say that you want to flush the topic while leaving your producers and consumers up (but maybe you have a downtime window where you know the producers aren’t pushing anything).  In this case, we can change the retention period to something very short, let the queue flush, and bring it back to normal, all using the kafka-configs shell script.

Points deducted for slipping and writing “queue” there, but otherwise, I prefer the second method, as things are still online.  In less-extreme scenarios, you might drop the retention period to a few minutes, especially if your consumers are all caught up.

Comments closed

R Services Resource Utilization

Ginger Grant shows off some R Services reports to see how hard the developers are battering your poor servers with their R scripts:

R Services – Extended Events is also not a report but a list of all the extended events that are available for R Services. This is a handy bit of information, which can be a great reference tool for extended events monitoring. R Services – Packages lists the packages which are currently installed on SQL Server. When people write R, many lot of different packages are used within the script. Prior to running a package, check the information on this report to ensure the libraries used are installed on SQL Server. If the library is missing the code will not work. R Services – Resource Usage is a great way to see how R has been configured to run on the server. Notice I have created an External Pool for R. This is a configuration recommended by Microsoft to better monitor your R Services.

Click through for more information, and grab the reports from Microsoft’s Github repo.

Comments closed

Stretch Database Authentication Failures

Jack Li walks through a bug in Stretch database:

The message provided enough directions.  It says either you have a bad login or firewall setting on the Azure DB Server side is not configured correctly.     The very first thing is to ensure the Firewall was configured correctly.   We even tried 0.0.0.0. to 255.255.255.255. But it didn’t resolve the issue.

Next we created a brand new database on the same server and tried on that one.  It worked.  But customer just couldn’t get the old database to work even she made sure that she could use the login/password to log in using SSM on the same server to the Azure DB server.

On the same server, brand new database worked but the old database didn’t.   So that made me wonder what happens if I manually cause an failure and later retry.

Read on for the repo and solution.

Comments closed

Parallel PoshRSJob Template

Cody Konior walks through using PoshRSJob with a custom function:

Recently I migrated from my own runspace module to Boe Prox’s PoshRSJob which is pretty much perfect. But today I wanted to share how to integrate PoshRSJob cleanly into your functions through a default -Parallel parameter and using a template.

You can very easily modify this for your own purposes however it’s even more awesome as-is if you run parallelised tests for one major input (like a computer name) but where additional information might also be passed in through object properties on a pipeline (I’ll explain why you’d want to do that later in the post). Here’s what it looks like:

Read on for code and explanation.  Powershell parallelism is something that I’ve never been good at, so hopefully this makes it easier for me…

Comments closed

Linear Models

Andrea Spano, et al, are starting a new book:

This chapter is an introduction to the first section of the book, Linear Models, and contain some theoretical explanation and lots of examples. At the end of the chapter you will find two summary tables with Linear model formulae and functions in R and Common R functions for inference.

The book is just getting started, but you can get it from the Quantide website.  In the meantime, there are two other books on learning R and developing in R.  These books are licensed Creative Commons, so they’re free to read and share.

Comments closed