Other data sets of note include:
A collection of 38M tweets related to the 2012 US election
3-D capture data from individuals performing a variety of hand gestures
Infer.NET, a framework for running Bayesian inference in graphical models
Images for 1 million celebrities, and associated tags
MS MARCO, is a new large-scale dataset for reading comprehension and question answering
Click through for more information, and then check out the data sets.
One of the areas I like to focus on is new features in SQL Server. Under both MVP and Microsoft Partner programs, I get to see a lot of builds of SQL Server that don’t make it to the public, and documentation for these builds is typically sparse. In order to get a head start on testing things out, I often need to explore on my own. And so I wrote some scripts for that, which I’ve talked about in previous blog posts:
- How I spot not-yet-documented features in SQL Server CTPs
- More ways to discover changes in new versions of SQL Server
When I install a new version of SQL Server (be it a cumulative update, the final service pack for a major version, or the first CTP of vNext), there are two steps:
Create a linked server to the build that came before it
Create local synonyms referencing the important catalog views in the linked server
It’s a good way to get a glimpse at which features devs are currently working on but haven’t enabled yet.
So we branch the the code in source control, and start writing a helper class to manage the data for us closer to the application. We throw in a
SqlDataAdapter, use the
Fill()method to bring back all the rows from the query in one go, and then use a caching layer to keep it in memory in case we need it again. SQL Server’s part in this story has now faded into the background. This narrow table consumes a tiny 8 MB of RAM, and having two or more copies in memory isn’t the end of the world for our testing. So far, so good again.
We run the new code, first stepping through to make sure that it still does what it used to, massaging here and there so that in the end, a grid is populated on the application with the results of the query. Success! We then compile it in Release mode, and run it without any breakpoints to do some performance testing.
And then we find that it runs at exactly the same speed to produce exactly the same report, using our caching and
SqlDataAdapter, and we’ve wasted another hour of our time waiting for the grid and report. Where did we go wrong?
As people get better at tuning, we start to make assumptions based on prior experience. That, on net, is a good thing, but as Randolph shows, those assumptions can still be wrong.
I’m going to create a table and insert exactly 1 million rows. This particular table will be a clustered index, and will contain 1 million GUIDs.
12 CREATE TABLE dbo.demoNOLOCK (someguid uniqueidentifier NOT NULL PRIMARY KEY);INSERT dbo.demoNOLOCK (someguid) SELECT TOP (1000000) NEWID() FROM sys.all_columns t1, sys.all_columns t2;
Next I prove that there a million rows.
1 SELECT COUNT(*) FROM dbo.demoNOLOCK;
Now without inserting or deleting any rows, I’m going to shuffle them.
1 UPDATE dbo.demoNOLOCK SET someguid = NEWID();
And if while this is happening, I count the rows in a different session, I have to wait for that query to finish.
Read on to see what happens when someone gets the idea of running the select query with NOLOCK.
I’ve been using this at work recently and it also relates to the presentation I gave at the ONSSUG June meeting around data compression. The beginnings of this script originated online as I dug into learning about the DMVs that related to objects and compression and then customized for what I needed.
If you run the below as is it will provide basic information about all objects in your database, except those in the ‘sys’ schema, along with their current size and compression level.
Click through for the script or, if your version of dbatools is up to date, call
The solution is part of my calendar/date dimension code, and it is used to do relative positioning over date periods. For example, say you have the need to get data from the 10 days. You can definitely use a simple between to filter the rows, and a bunch of date functions to group by year, month, etc., generally all of the “normal” groupings. But using a calendar table allows you to prebuild a set of date calculations that make the standard values easier to get, and non-standard groupings possible. The technique I will cover makes moving around in the groupings more easily accessible. Like if you want data from the last 3 complete months. The query to do this isn’t rocket science, but it isn’t exactly straightforward either.
For the example, I will use the calendar table that I have on my website here: http://drsql.org/code in the download SimpleDateDimensionCreateAndLoad, and will load it with data up until 2020. Here is that structure:
Read on for examples of usage. This is an example where thinking relationally differs from thinking procedurally—imagining date ranges as pre-calculated sets isn’t intuitive to procedural developers, but it can give a big performance boost.
Besides needing to choose between various grouping and aggregation strategies (preordered Stream Aggregate, Sort + Stream Aggregate, Hash Aggregate), SQL Server also needs to choose whether to go with a serial or a parallel plan. In fact, it can choose between multiple different parallelism strategies. SQL Server uses costing logic that results in optimization thresholds that under different conditions make one strategy preferred to the others. We’ve already discussed in depth the costing logic that SQL Server uses in serial plans in the previous parts of the series. In this section I’ll introduce a number of parallelism strategies that SQL Server can use for handling grouping and aggregation. Initially, I won’t get into the details of the costing logic, rather just describe the available options. Later in the article I’ll explain how the costing formulas work, and an important factor in those formulas called DOP for costing.
As you will later learn, SQL Server takes into account the number of logical CPUs in the machine in its costing formulas for parallel plans. In my examples, unless I say otherwise, I assume the target system has 8 logical CPUs. If you want to try out the examples that I’ll provide, in order to get the same plans and costing values like I do, you need to run the code on a machine with 8 logical CPUs as well. If you’re machine happens to have a different number of CPUs, you can emulate a machine with 8 CPUs—for costing purposes—like so:DBCC OPTIMIZER_WHATIF(CPUs, 8);
Even though this tool is not officially documented and supported, it’s quite convenient for research and learning purposes.
This is a fairly long article, but a great one.
WSL is primarily aimed at developers, and it allows you to run Linux environments directly on Windows in a native format and without the overhead of a virtual machine. Let us retake a look at that statement: run Linux environments directly on Windows in a native format. Yes native format, WSL is not a UNIX-like environment like Cygwin, which wraps non-Windows functionality in Win32 system calls but it serves Linux programs as special, isolated minimal processes (pico-processes) attached to kernel-mode pico-providers. If you want to read all the “gory” details about WSL: Windows Subsystem for Linux Overview gives you an excellent introduction.
Surprisingly, it’s pretty easy—I would have expected some strange compatibility issues.
This plot uses an explicitly defined vector field. A vector field refers to an assignment of a vector to each point in a subset of space.
In this plot, we visualize a collection of arrows that simply model the wind speed and direction at various levels of the atmosphere.
3-D weather plots can be useful to research scientists to gain a better understanding of the atmospheric profile, such as during the prediction of severe weather events like tornadoes and hurricanes.
Sometimes a 3D plot is the best answer. When it is, this looks like a good solution. H/T R-bloggers
I had a situation last week where I needed to find the MIN() and MAX() values of some data. Normally this would be pretty straightforward, but the problem was I needed to identify the MIN() and MAX() values amongst multiple columns. So today I bring you a SQL tip using APPLY.
Looking at sys.procedures we can see I have a create_date and a modify_date column for each record. What if we needed to return the MIN() or MAX() value from those 2 (or more) columns?SELECT [Name] ,create_date ,modify_date FROM sys.procedures AS p WHERE p.[name] = 'ChangePassword';
In this two-column example, it’s not too difficult. As you add more and more columns, the solution remains the same, though the urge to ask why all of these dates are unpivoted might increase…