Regularization Prevents Overfitting

Hui Li has an explanation of what regularization is and how it works to reduce the likelihood of overfitting training data:

Assume that the red line is the regression model we learn from the training data set. It can be seen that the learned model fits the training data set perfectly, while it cannot generalize well to the data not included in the training set. There are several ways to avoid the problem of overfitting.

To remedy this problem, we could:

  • Get more training examples.
  • Use a simple predictor.
  • Select a subsample of features.

In this blog post, we focus on the second and third ways to avoid overfitting by introducing regularization on the parameters βi of the model.

Read the whole thing.

Tidygraph

Thomas Lin Pedersen announces tidygraph, a tidyverse library for dealing with graphs and trees in R:

One of the simplest concepts when computing graph based values is that of
centrality, i.e. how central is a node or edge in the graph. As this
definition is inherently vague, a lot of different centrality scores exists that
all treat the concept of central a bit different. One of the famous ones is
the pagerank algorithm that was powering Google Search in the beginning.
tidygraph currently has 11 different centrality measures and all of these are
prefixed with centrality_* for easy discoverability. All of them returns a
numeric vector matching the nodes (or edges in the case of
centrality_edge_betweenness()).

This is a big project and is definitely interesting if you’re looking at analyzing graph data.

R’s iGraph + SQL Server Graphs

Dennes Torres has a post which shows how to use R’s iGraph library to visualize graphs created in SQL Server 2017:

The possibility to use both technologies together is very interesting. Using graph objects we can store relationships between elements, for example, relationships between forum members. Using R scripts we can build a cluster graph from the stored graph information, illustrating the relationships in the graph.

The script below creates a database for our example with a subset of the objects used in my article and a few more relationship records between the forum members.

Click through for the script.

Drilldown Choropleth Power BI Custom Visual

Devin Knight has a new entry in his Power BI custom visuals series:

In this module you will learn how to use the Drilldown Choropleth Custom Visual.  The Drilldown Choropleth is a map visual that displays divided regions that are highlighted indicating the relative value in each location.

Click through for Devin’s video and example.

Dynamic Unpivoting For Change Detection

Shane O’Neill has a script that dynamically unpivots a pair of rows and compares values column by column, storing the changes in XML:

Overall, the script is longer at nearly double the lines but where it shines is when adding new columns.
To include new columns, just add them to the table; to exclude them, just add in a filter clause.

So, potentially, if every column in this table is to be tracked and we add columns all the way up to 1,024 columns, this code will not increase.
Old way: at least 6,144.
New way: at least 2,048.
Dynamic: no change

Read on for that script.  Even though his developer ended up not using his solution, Shane has made it available for the rest of the world so that some day, someone else can have the maintenance nightmare of trying to root out a bug in the process.

System Health XE

Kenneth Fisher describes what is in the system_health Extended Events session:

Per BOL you get the following information:

  • Errors with a severity of >= 20.

  • Memory related errors (Errors 17803, 701, 802, 8645, 8651, 8657 and 8902).

  • Non-yielding scheduler problems (Error 17883).

  • Deadlocks.

  • Sessions that have waited on locks for > 30 seconds.

  • Sessions waiting for a long time on preemptive waits (waits on external API calls).

Read on to learn more of the things this session contains as well as a couple ways you can access the data.

When You Need To Read Memory-Optimized Data From Disk

Ned Otter enumerates the scenarios in which SQL Server needs to read data from disk for memory-optimized tables:

Those who have studied In-Memory OLTP are aware that in the event of “database restart”, durable memory-optimized data must be streamed from disk to memory. But that’s not the only time data must be streamed, and the complete set of events that cause this is not intuitive. To be clear, if your database had to stream databack to memory, that means all your memory-optimized data was cleared from memory. The amount of time it takes to do this depends on:

  • the amount of data that must be streamed

  • the number of indexes that must be rebuilt

  • the number of containers in the memory-optimized database, and how many volumes they’re spread across

  • how many indexes must be recreated (SQL 2017 has a much faster index rebuild process, see below)

  • the number of LOB columns

  • BUCKET count being properly configured for HASH indexes

Read on for the list of scenarios that might cause a standalone SQL Server instance to need to stream data from disk into memory to re-hydrate memory-optimized tables and indexes.

TempDB System Table Contention

Alexander Arvidsson diagnoses an interesting problem:

I ran this several times to see if there was a pattern to the madness, and it turned out it was. All waits were concentrated in database ID 2 – TEMPDB. Many people perk up by now and jump to the conclusion that this is your garden variety SGAM/PFS contention – easily remedied with more TEMPDB files and a trace flag. But, alas- this was further inside the TEMPDB. The output from the query above gave me the exact page number, and plugging that into DBCC PAGE gives the metadata object ID.

His conclusion is to reduce temp table usage and/or use memory-optimized tables.  We solved this problem with replacing temp tables with memory-optimized TVPs in our most frequently-used procedures.

Avoid Ticks

Michael J. Swart shows you how to convert DATETIME2 values to Ticks:

A .Net tick is a duration of time lasting 0.1 microseconds. When you look at the Tick property of DateTime, you’ll see that it represents the number of ticks since January 1st 0001.
But why 0.1 microseconds? According to stackoverflow user CodesInChaos “ticks are simply the smallest power-of-ten that doesn’t cause an Int64 to overflow when representing the year 9999”.

Even though it’s an interesting idea, just use one of the datetime data types, that’s what they’re there for. I avoid ticks whenever I can.

I agree with Michael:  avoid using Ticks if you can.

Categories

July 2017
MTWTFSS
« Jun  
 12
3456789
10111213141516
17181920212223
24252627282930
31