Month: March 2018

DATETIME2 In SQL Server

Published 2018-03-22 by Kevin Feasel

Randolph West continues his SQL Server date & time data types series:

SQL Server 2008 introduced new data types to handle dates and times in a more intelligent way than the DATETIME and SMALLDATETIME types that we looked at previously.

This week, we look at the DATETIME2 data type. I’m not the first person to think that this was probably not the best name for a data type, but here we are, a decade later.

DATETIME2 is, at its heart, a combination of the DATE and TIME data types we covered in previous weeks. DATE is 3 bytes long and TIME is between 3 and 5 bytes long depending on accuracy. This of course means that DATETIME2 can be anything from 6 to 8 bytes in length.

Nowadays, if you want to store a date plus time, this should be your default, not DATETIME.

Comments closed

Function Calls Missing From dm_exec_query_stats

Published 2018-03-22 by Kevin Feasel

Kendra Little blogs about a data collection oddity with functions in SQL Server:

Some of my functions in the demo code were showing up just fine. I was really puzzled by that. I thought …

Maybe this is a bug with ‘CREATE OR ALTER’? A sign of some weird memory pressure? Something introduced in SQL Server 2017? A buggy side effect of implicit conversions in some of the functions? A problem with the queries I was using? A weird setting on the database? (Also: about 100 other things that didn’t turn out to be the case.)

I finally wrote up some simple demo code, tested it against a SQL Server 2008 R2 instance (omitting the Query Store components), compared it with SQL Server 2017, and found it to be consistent.

Click through to see which types of functions show up and which ones stay hidden.

Comments closed

Understanding Recovery Intervals

Published 2018-03-22 by Kevin Feasel

Paul Randal explains what the recovery interval option on SQL Server really means:

One of the concepts I find people misunderstand frequently is the recovery interval, either for the server as a whole or the per-database setting that was introduced in SQL Server 2012 for indirect checkpoints.

There are two misconceptions here:

The recovery interval equals how often a checkpoint will occur
SQL Server guarantees the recovery interval (i.e. crash recovery for the database will only take the amount of time specified in the recovery interval)

It’s good to keep this in mind.

Comments closed

Tabs And Spaces In SSMS

Published 2018-03-22 by Kevin Feasel

Wayne Sheffield continues his SQL Server Management Studio Tips & Tricks series:

Regardless of whether you like to use tabs or spaces, this is where you go to configure your settings. The first part of the screen controls the indenting options. If “None” is selected, then the next line will start at the beginning of the line. If you have selected “Block”, then it will align the next line with the previous line. And if you are using “Smart”, then the appropriate language will determine which indenting style to use.

The next section controls the tab size / indent size. This controls how many characters that a tab takes. It also controls whether tabs are converted to spaces or kept as tabs.

You can read more about these options at this link: Manage Code Formatting.

I turn on the View Whitespace option that Wayne mentions because I’m a formatting pedant that way.

Comments closed

Kafka Partitioning Strategies

Published 2018-03-21 by Kevin Feasel

Amy Boyle shares some thoughts on Kafka partitioning strategy:

If you have enough load that you need more than a single instance of your application, you need to partition your data. The producer clients decide which topic partition data ends up in, but it’s what the consumer applications will do with that data that drives the decision logic. If possible, the best partitioning strategy to use is random.

However, you may need to partition on an attribute of the data if:

The consumers of the topic need to aggregate by some attribute of the data.
The consumers need some sort of ordering guarantee.
Another resource is a bottleneck and you need to shard data.
You want to concentrate data for the efficiency of storage and/or indexing.

Good advice.

Comments closed

Single-Node Hadoop 3 Installation

Published 2018-03-21 by Kevin Feasel

Mark Litwintschik has a fairly simple guide for installing Hadoop 3 on a single node for testing:

This post is meant to help people explore Hadoop 3 without feeling the need they should be using 50+ machines to do so. I’ll be using a fresh installation of Ubuntu 16.04.2 LTS on a single computer. The machine has an Intel Core i5-7300HQ CPU clocked at 2.50GHz, 8 GB of RAM and a 200 GB mechanical disk drive. I intentionally picked a low end machine to demonstrate not much is needed to try out Hadoop in a learning exercise.

Please do be mindful these instructions are aimed at building a test environment that is cut off from the outside world. Beyond the fact this is a single machine installation for software which is meant to run on multiple machines there would need to be significant content changes to turn these instructions into production installation notes.

It’s a useful guide if you’re not interested in going with one of the platform vendors like Hortonworks or Cloudera.

Comments closed

Reproducibility And ML Projects

Published 2018-03-21 by Kevin Feasel

Pete Warden explains some of the difficulties around reproducing ML models:

Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.

I see two separate problems: reproducing the process and reproducing the result. Reproducing the process is why you want to use something like notebooks: it’s a proof that you (and others!) can generate the same type of model the same way multiple times. Reproducing the result is harder given the stochastic nature of ML, but if you’re following the same process, you’re at least more likely to end up close to the same result.

Comments closed

Speeding Up Power BI Aggregations With Primary Keys

Published 2018-03-21 by Kevin Feasel

Chris Webb has an interesting use case for adding primary keys on lookup tables:

As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.

The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”.

Although baseline performance is bad, Chris shows a way of improving that performance significantly.

Comments closed

XML In Scala

Published 2018-03-21 by Kevin Feasel

Mahesh Chand Kandpal shows how to create XPath statements in Scala:

We called the \() on the XML element and asked it to look for all symbol elements. It returns an instance of scala.xml.NodeSeq, which represents a collection of XML nodes.

The \() method looks only for the elements that are direct descendants of the target element(i.e symbol). If we want to search through all the elements in the hierarchy starting from the target element, \\() method is used

Check it out, especially if you’re working with Spark, as you never know when a rogue XML file will head your way.

Comments closed

Remote DAC And Vulnerability Assessments

Published 2018-03-21 by Kevin Feasel

Max Vernon points out a SQL Server Management Studio Vulnerability Assessment check which seems somewhat incomplete:

Certainly, you’d want to ensure the port for the DAC is not available to the Internet, but hopefully if you’re reading this blog you already know how silly it would be to open SQL Server to the Internet.

Assuming you don’t have the port open to the Internet, it’s very likely the DAC will not be of any use at all if you disable Remote Admin Connections as advised in the Vulnerability Assessment. My advice is to ignore this warning completely and configure the DAC to allow remote connections. Microsoft Technet has documentation about using the DAC, and says to configure it for remote connections by logging onto the server locally first, then configuring SQL Server to allow remote DAC connections, which seems a bit like putting the cart before the horse.

Read the whole thing. I agree with Max’s assessment that if there are some basic controls around your instance (like not letting SQL Server be Internet-accessible, putting SQL Server instances in a protected subnet, etc.), remote DAC is definitely useful enough to keep running.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31