As people, we’re used to being able to classify anything we see in the world around us, and we naturally expect machines to have the same ability. Most models are only trained to recognize a very limited set of objects though, such as the 1,000 categories of the original ImageNet competition. Crucially, the training process makes the assumption that every example the model sees is one of those objects, and the prediction must be within that set. There’s no option for the model to say “I don’t know”, and there’s no training data to help it learn that response. This is a simplification that makes sense within a research setting, but causes problems when we try to use the resulting models in the real world.
Back when I was at Jetpac, we had a lot of trouble convincing people that the ground-breaking AlexNet model was a big leap forward because every time we handed over a demo phone running the network, they would point it at their faces and it would predict something like “Oxygen mask” or “Seat belt”. This was because the ImageNet competition categories didn’t include any labels for people, but most of the photos with mask and seatbelt labels included faces along with the objects. Another embarrassing mistake came when they would point it at a plate and it would predict “Toilet seat”! This was because there were no plates in the original categories, and the closest white circular object in appearance was a toilet.
Read the whole thing.
Data Lifecycle Manager (DLM) delivers on the promise of location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments. This empowers businesses to deliver the right data in the right environment to power the right use cases.
DLM v1.1 provides a complete solution to replicate data, metadata and security policies between on-premises and in cloud. It also supports data movement for data-at-rest and data-in-motion – whether the data is encrypted using a single key or multiple keys on both source and target clusters. DLM supports HDFS and Apache Hive dataset replication.
With DLM infrastructure administrators can manage their data, metadata and security management on-prem and in-cloud using a single-pane of glass that is built on open source technology. Business users can consume their workload outputs in the cloud with data-source-abstraction. DLM also enables business to reduce their capital expenditures and enjoy the benefits of flexibility and elasticity that cloud provides.
Click through for a demo. May HDFS replication have as long a life and slightly less vitriol than SQL Server replication.
Other data sets of note include:
A collection of 38M tweets related to the 2012 US election
3-D capture data from individuals performing a variety of hand gestures
Infer.NET, a framework for running Bayesian inference in graphical models
Images for 1 million celebrities, and associated tags
MS MARCO, is a new large-scale dataset for reading comprehension and question answering
Click through for more information, and then check out the data sets.
One of the areas I like to focus on is new features in SQL Server. Under both MVP and Microsoft Partner programs, I get to see a lot of builds of SQL Server that don’t make it to the public, and documentation for these builds is typically sparse. In order to get a head start on testing things out, I often need to explore on my own. And so I wrote some scripts for that, which I’ve talked about in previous blog posts:
- How I spot not-yet-documented features in SQL Server CTPs
- More ways to discover changes in new versions of SQL Server
When I install a new version of SQL Server (be it a cumulative update, the final service pack for a major version, or the first CTP of vNext), there are two steps:
Create a linked server to the build that came before it
Create local synonyms referencing the important catalog views in the linked server
It’s a good way to get a glimpse at which features devs are currently working on but haven’t enabled yet.
So we branch the the code in source control, and start writing a helper class to manage the data for us closer to the application. We throw in a
SqlDataAdapter, use the
Fill()method to bring back all the rows from the query in one go, and then use a caching layer to keep it in memory in case we need it again. SQL Server’s part in this story has now faded into the background. This narrow table consumes a tiny 8 MB of RAM, and having two or more copies in memory isn’t the end of the world for our testing. So far, so good again.
We run the new code, first stepping through to make sure that it still does what it used to, massaging here and there so that in the end, a grid is populated on the application with the results of the query. Success! We then compile it in Release mode, and run it without any breakpoints to do some performance testing.
And then we find that it runs at exactly the same speed to produce exactly the same report, using our caching and
SqlDataAdapter, and we’ve wasted another hour of our time waiting for the grid and report. Where did we go wrong?
As people get better at tuning, we start to make assumptions based on prior experience. That, on net, is a good thing, but as Randolph shows, those assumptions can still be wrong.
I’m going to create a table and insert exactly 1 million rows. This particular table will be a clustered index, and will contain 1 million GUIDs.
12 CREATE TABLE dbo.demoNOLOCK (someguid uniqueidentifier NOT NULL PRIMARY KEY);INSERT dbo.demoNOLOCK (someguid) SELECT TOP (1000000) NEWID() FROM sys.all_columns t1, sys.all_columns t2;
Next I prove that there a million rows.
1 SELECT COUNT(*) FROM dbo.demoNOLOCK;
Now without inserting or deleting any rows, I’m going to shuffle them.
1 UPDATE dbo.demoNOLOCK SET someguid = NEWID();
And if while this is happening, I count the rows in a different session, I have to wait for that query to finish.
Read on to see what happens when someone gets the idea of running the select query with NOLOCK.
I’ve been using this at work recently and it also relates to the presentation I gave at the ONSSUG June meeting around data compression. The beginnings of this script originated online as I dug into learning about the DMVs that related to objects and compression and then customized for what I needed.
If you run the below as is it will provide basic information about all objects in your database, except those in the ‘sys’ schema, along with their current size and compression level.
Click through for the script or, if your version of dbatools is up to date, call
The solution is part of my calendar/date dimension code, and it is used to do relative positioning over date periods. For example, say you have the need to get data from the 10 days. You can definitely use a simple between to filter the rows, and a bunch of date functions to group by year, month, etc., generally all of the “normal” groupings. But using a calendar table allows you to prebuild a set of date calculations that make the standard values easier to get, and non-standard groupings possible. The technique I will cover makes moving around in the groupings more easily accessible. Like if you want data from the last 3 complete months. The query to do this isn’t rocket science, but it isn’t exactly straightforward either.
For the example, I will use the calendar table that I have on my website here: http://drsql.org/code in the download SimpleDateDimensionCreateAndLoad, and will load it with data up until 2020. Here is that structure:
Read on for examples of usage. This is an example where thinking relationally differs from thinking procedurally—imagining date ranges as pre-calculated sets isn’t intuitive to procedural developers, but it can give a big performance boost.
Besides needing to choose between various grouping and aggregation strategies (preordered Stream Aggregate, Sort + Stream Aggregate, Hash Aggregate), SQL Server also needs to choose whether to go with a serial or a parallel plan. In fact, it can choose between multiple different parallelism strategies. SQL Server uses costing logic that results in optimization thresholds that under different conditions make one strategy preferred to the others. We’ve already discussed in depth the costing logic that SQL Server uses in serial plans in the previous parts of the series. In this section I’ll introduce a number of parallelism strategies that SQL Server can use for handling grouping and aggregation. Initially, I won’t get into the details of the costing logic, rather just describe the available options. Later in the article I’ll explain how the costing formulas work, and an important factor in those formulas called DOP for costing.
As you will later learn, SQL Server takes into account the number of logical CPUs in the machine in its costing formulas for parallel plans. In my examples, unless I say otherwise, I assume the target system has 8 logical CPUs. If you want to try out the examples that I’ll provide, in order to get the same plans and costing values like I do, you need to run the code on a machine with 8 logical CPUs as well. If you’re machine happens to have a different number of CPUs, you can emulate a machine with 8 CPUs—for costing purposes—like so:DBCC OPTIMIZER_WHATIF(CPUs, 8);
Even though this tool is not officially documented and supported, it’s quite convenient for research and learning purposes.
This is a fairly long article, but a great one.