2018-07-23 – Curated SQL

In the previous post, I have showed how to visualize near real-time data using Python and Dash module. And it is time to see one of the many ways, how to do it in R. This time, I will not use any additional frames for visualization, like shiny, plotly or any others others, but will simply use base R functions and RODBC package to extract data from SQL Server.

Extracting data from SQL Server will and simulating inserts in SQL Server table will primarily simulate the near real-time data. If you have followed the previous post, you will notice that I am using same T-SQL table and query to extract real-time data.

Tomaz is using the base plot library, but if you want something nicer, there are several good alternatives.

Comments closed

Using word2vec To Model User Behavior

Published 2018-07-23 by Kevin Feasel

Nishan Subedi walks us through an Etsy project to model user journeys via semantic embedding techniques:

We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It models the context given a target token and attempts to maximize the average likelihood of seeing any of the context tokens given a target token. The negative sampling draws a negative token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

Similarly, we consider the purchased item in a particular session to be a global contextual token that applies to the entire sequence of user interactions. The intuition behind this is that there are many touch points on the user’s journey that help them come to the final purchase decision, and so we want to share the purchase intent across all the different actions that they took. This is also referred to as the linear multi-touch attribution model.

This is a very interesting article, and their attempt at getting around the problem of unexpected explosive growth in demand.

Comments closed

When Wait Stats Aren’t Enough

Published 2018-07-23 by Kevin Feasel

Joe Obbish has an example of diagnosing performance problems when wait stats don’t indicate any problems:

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

This is a great investigation into the depths of debugging in SQL Server. Joe wasn’t able to get a definitive solution to his problem, but he showed us a lot along the way.

Comments closed

Recommendations For Storage On Azure SQL DB Managed Instances

Published 2018-07-23 by Kevin Feasel

Dimitri Furman has some thoughts on database storage architecture for Azure SQL Database Managed Instances:

MI GP uses Azure Premium Storage to store database files for all databases, except for the tempdb database. From the perspective of the database engine, this storage type is remote, i.e. it is accessed over the network, using Azure network infrastructure. To use Azure Premium Storage, MI GP takes advantage of SQL Server native capability to use database files directly in Azure Blob Storage. This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.

Since Azure Premium Storage is used, its performance characteristics, limits, and scalability goals fully apply to MI GP. The High-performance Premium Storage and managed disks for VMs documentation article includes a section describing Premium Storage disk limits. While the topic is written in the context of VMs and Azure disks, which is the most common usage scenario for Azure Premium Storage, the documented limits are also applicable to blobs. As shown in the limits table in the documentation, the size of the blob determines the maximum IOPS and throughput that can be achieved against the blob. For MI GP, this means that the size of a database file determines the maximum IOPS and throughput that is achievable against the file.

The disk/blob size shown in the limits table is the maximum size for which the corresponding limit applies. For example, a blob that is > 64 GB and <= 128 GB (equivalent to a P10 disk) can achieve up to 500 IOPS and up to 100 MB/second throughput.

Read the whole thing if you’re looking at Managed Instances, but there are some tips for SQL Server in Azure IaaS.

Comments closed

Data Lakes eBook

Published 2018-07-23 by Kevin Feasel

Melissa Coates has a free eBook available:

I wrote the updated content from a practical point of view, totally hype-free. The table of contents:

Modern Data Architecture

Business Needs Driving Data Architectures to Evolve and Adapt

Principles of a Modern Data Architecture

Data Lake + Data Warehouse: Complementary Solutions

Tips for Designing a Data Lake

Azure Technologies for Implementing a Data Lake

Considerations for a Successful Data Lake in the Cloud

Getting Started with a Data Lake

To download the ebook, BlueGranite will ask for you to register your information. That’s common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you’ll receive notifications of new content that you may find interesting.

It’s the length of a good-sized paper, so you won’t have to invest dozens of hours of time to get the story.

Comments closed

Don’t Set Max Size For Containers In In-Memory OLTP

Published 2018-07-23 by Kevin Feasel

Ned Otter recommends you not mess with the maximum container size when creating a memory-optimized filegroup:

I recently saw a thread on twitter, where the OP talked about setting the max size for an In-Memory OLTP container. I responded as I always do: it’s not possible to set a limit on anything having to do with storage for In-Memory OLTP.

Unfortunately, that’s not correct: through SSMS or TSQL, you can in fact set a max size for a container.

But you should not ever do that…..

Why?

Because if you do, and your checkpoint files exceed the max size of the container, your database can go into the In Recovery, Suspect, or OFFLINE state.

Read on for a repro that you should not try in production. Or anywhere, really.

Comments closed

Diagnosing Database Corruption

Published 2018-07-23 by Kevin Feasel

Jeff Mlakar has started a series on database corruption. His first post involves finding corruption:

The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.

Database corruption cannot entirely be prevented. It is not a matter of if but rather when

Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.

Click through for a few ways to find potential corruption.

Comments closed

Power BI Log Files

Published 2018-07-23 by Kevin Feasel

Kellyn Pot’vin-Gorman shows us how to access Power BI log and trace files:

We’ve now identified the process, the amount of memory allocated to perform a task captured in the log, start time and the duration. The information in these log files can assist when diagnosing if Power BI desktop crashes, but the data collected is quite rudimentary.

If you shut down Power BI Desktop, the PBIDesktop* log file writes to the startup file, which was once empty and it then empties and saves off the timestamp of the exit of the program.

The Microsoft Mashup file has much of the same information, but includes deeper level processing work by Power BI, such as work done in the Query Editor or when we create a measure or new column/table.

In the three examples from the file below, you can see a compile, a save and then an evaluate task.

There’s some useful information here for debugging.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Day: July 23, 2018

Real-Time Data Visualization With R And SQL Server

Using word2vec To Model User Behavior

When Wait Stats Aren’t Enough

Recommendations For Storage On Azure SQL DB Managed Instances

Data Lakes eBook

Don’t Set Max Size For Containers In In-Memory OLTP

Diagnosing Database Corruption

Power BI Log Files