Press "Enter" to skip to content

Month: July 2018

Using word2vec To Model User Behavior

Nishan Subedi walks us through an Etsy project to model user journeys via semantic embedding techniques:

We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It models the context given a target token and attempts to maximize the average likelihood of seeing any of the context tokens given a target token. The negative sampling draws a negative token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

Similarly, we consider the purchased item in a particular session to be a global contextual token that applies to the entire sequence of user interactions. The intuition behind this is that there are many touch points on the user’s journey that help them come to the final purchase decision, and so we want to share the purchase intent across all the different actions that they took. This is also referred to as the linear multi-touch attribution model.

This is a very interesting article, and their attempt at getting around the problem of unexpected explosive growth in demand.

Comments closed

When Wait Stats Aren’t Enough

Joe Obbish has an example of diagnosing performance problems when wait stats don’t indicate any problems:

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

This is a great investigation into the depths of debugging in SQL Server.  Joe wasn’t able to get a definitive solution to his problem, but he showed us a lot along the way.

Comments closed

Recommendations For Storage On Azure SQL DB Managed Instances

Dimitri Furman has some thoughts on database storage architecture for Azure SQL Database Managed Instances:

MI GP uses Azure Premium Storage to store database files for all databases, except for the tempdb database. From the perspective of the database engine, this storage type is remote, i.e. it is accessed over the network, using Azure network infrastructure. To use Azure Premium Storage, MI GP takes advantage of SQL Server native capability to use database files directly in Azure Blob Storage. This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.

Since Azure Premium Storage is used, its performance characteristics, limits, and scalability goals fully apply to MI GP. The High-performance Premium Storage and managed disks for VMs documentation article includes a section describing Premium Storage disk limits. While the topic is written in the context of VMs and Azure disks, which is the most common usage scenario for Azure Premium Storage, the documented limits are also applicable to blobs. As shown in the limits table in the documentation, the size of the blob determines the maximum IOPS and throughput that can be achieved against the blob. For MI GP, this means that the size of a database file determines the maximum IOPS and throughput that is achievable against the file.

The disk/blob size shown in the limits table is the maximum size for which the corresponding limit applies. For example, a blob that is > 64 GB and <= 128 GB (equivalent to a P10 disk) can achieve up to 500 IOPS and up to 100 MB/second throughput.

Read the whole thing if you’re looking at Managed Instances, but there are some tips for SQL Server in Azure IaaS.

Comments closed

Data Lakes eBook

Melissa Coates has a free eBook available:

I wrote the updated content from a practical point of view, totally hype-free. The table of contents:

  • Modern Data Architecture
  • Business Needs Driving Data Architectures to Evolve and Adapt
  • Principles of a Modern Data Architecture
  • Data Lake + Data Warehouse: Complementary Solutions
  • Tips for Designing a Data Lake
  • Azure Technologies for Implementing a Data Lake
  • Considerations for a Successful Data Lake in the Cloud
  • Getting Started with a Data Lake

To download the ebook, BlueGranite will ask for you to register your information. That’s common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you’ll receive notifications of new content that you may find interesting.

It’s the length of a good-sized paper, so you won’t have to invest dozens of hours of time to get the story.

Comments closed

Don’t Set Max Size For Containers In In-Memory OLTP

Ned Otter recommends you not mess with the maximum container size when creating a memory-optimized filegroup:

I recently saw a thread on twitter, where the OP talked about setting the max size for an In-Memory OLTP container. I responded as I always do: it’s not possible to set a limit on anything having to do with storage for In-Memory OLTP.

Unfortunately, that’s not correct: through SSMS or TSQL, you can in fact set a max size for a container.

But you should not ever do that…..

Why?

Because if you do, and your checkpoint files exceed the max size of the container, your database can go into the In Recovery, Suspect, or OFFLINE state.

Read on for a repro that you should not try in production.  Or anywhere, really.

Comments closed

Diagnosing Database Corruption

Jeff Mlakar has started a series on database corruption.  His first post involves finding corruption:

The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.

Database corruption cannot entirely be prevented. It is not a matter of if but rather when

Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.

Click through for a few ways to find potential corruption.

Comments closed

Power BI Log Files

Kellyn Pot’vin-Gorman shows us how to access Power BI log and trace files:

We’ve now identified the process, the amount of memory allocated to perform a task captured in the log, start time and the duration.  The information in these log files can assist when diagnosing if Power BI desktop crashes, but the data collected is quite rudimentary.

If you shut down Power BI Desktop, the PBIDesktop* log file writes to the startup file, which was once empty and it then empties and saves off the timestamp of the exit of the program.

The Microsoft Mashup file has much of the same information, but includes deeper level processing work by Power BI, such as work done in the Query Editor or when we create a measure or new column/table.

In the three examples from the file below, you can see a compile, a save and then an evaluate task.

There’s some useful information here for debugging.

Comments closed

Plotting ML Results In R

Bernardo Lares shows off the plots he creates in R to compare ML models:

Split and compare quantiles

This parameter is the easiest to sell to the C-level guys. “Did you know that with this model, if we chop the worst 20% of leads we would have avoided 60% of the frauds and only lose 8% of our sales?” That’s what this plot will give you.

The math behind the plot might be a bit foggy for some readers so let me try and explain further: if you sort from the lowest to the highest score all your observations / people / leads, then you can literally, for instance, select the top 5 or bottom 15% or so. What we do now is split all those “ranked” rows into similar-sized-buckets to get the best bucket, the second best one, and so on. Then, if you split all the “Goods” and the “Bads” into two columns, keeping their buckets’ colours, we still have it sorted and separated, right? To conclude, if you’d say that the worst 20% cases (all from the same worst colour and bucket) were to take an action, then how many of each label would that represent on your test set? There you go!

Read on to see what else he uses and how you can build it yourself.

Comments closed

SQL Operations Studio July Edition

Alan Yu announces a new version of SQL Operations Studio:

Highlights for this release include the following.

  • SQL Server Agent preview extension Job configuration support
  • SQL Server Profiler preview extension Improvements
  • Combine Scripts Extension
  • Wizard and Dialog Extensibility
  • Social content
  • Fix GitHub Issues

For complete updates, refer to the Release Notes.

Alan also has demos for each of these.  I still wish that they wouldn’t call their Extended Events viewer “Profiler” because that makes it harder for us to explain the difference between “good Profiler” and “bad Profiler.”

Comments closed

Custom Statistics Block Column Alteration DDL

Max Vernon demonstrates that custom statistics and prevent you for modifying a column:

Interestingly, if SQL Server has auto-created a stats object on a column, and you subsequently modify that column, you receive no such error. SQL Server silently drops the statistics object, and modifies the column. The auto-created stats object is *not* automatically recreated until a query is executed that needs the stats object. This difference in how auto-created stats and manually created stats are treated by the engine can make for some confusion.

Just one more thing to think about if you manually create statistics on tables.  But at least the error message is clear.

Comments closed