Press "Enter" to skip to content

Month: June 2018

Using LIME To Explain Keras Models

Shirin Glander shows us how to use the LIME package to explain image recognition models built from Keras:

The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time-consuming explanation function.

Fun stuff.  I’m glad that there’s a lot of work going into explaining neural networks rather than hand-waving them off as magic.

Comments closed

Learning R Or Python?

David Smith tackles the age-old question:

If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science applications.

One thing I am quite sure of though: neither Python nor R is inherently better than the other, and arguments on that front are ultimately futile. (Trust me, I’ve been there.) Which is better for any given person depends on a wide variety of factors, and for some, it may even be worthwhile to learn both. Brian Ray recently posted a good overview of the factors that may lead you towards R or Python for data science: their history, the community, performance, third-party support, use cases, and even how to use them together. It’s great food for thought if you’re trying to decide which community to invest in.

Embrace the power of “and.”  The whole R versus Python bit is fun for purposes of arguing with people, but they’re both powerful languages and we’re seeing more and more overlap—for example, the Keras package David mentions runs Python’s TensorFlow under the covers.

Comments closed

Backing Up Azure Data Lake Store Data

Hugo Almeida has some hints for backing up Azure Data Lake Store data using Azure Data Factory:

Our Hadoop HDP IaaS cluster on Azure uses Azure Data Lake Store (ADLS) for data repository and accesses it through an applicational user created on Azure Active Directory (AAD). Check this tutorial if you want to connect your own Hadoop to ADLS.

Our ADLS is getting bigger and we’re working on a backup strategy for it. ADLS provides locally-redundant storage (LRS), however, this does not prevent our application from corrupting data or accidentally deleting it. Since Microsoft hasn’t published a new version of ADLS with a clone feature we had to find a way to backup all the data stored in our data lake.

We’re going to show you How to do a full ADLS backup with Azure Data Factory (ADF). ADF does not preserve permissions. However, our Hadoop client can only access the AzureDataLakeStoreFilesystem (adl) through hive with a “hive” user and we can generate these permissions before the backup.

Read the whole thing if you’re thinking of using Azure Data Lake Store.

Comments closed

Retrieving Statistic Use From Query Plan XML

Lonny Niederstadt shows us how to retrieve stats usage details from SQL Server query plans if trace flag 8666 is enabled:

Years ago someone said “Hey – why not drop auto-created stats, since the stats you need will just get created again and you’ll end up getting rid of those you no longer need.”   That *may* be a reasonable step on some systems.  If the risk of bad plans on first execution of a query needed stats that have been dropped is too high, its a bad deal.  If the potential concurrent cost of auto-creating dropped stats is too high, that’s a bad deal.  What about analyzing query plans over some period of time to see which stats are actually used in those plans?  Then auto-stats which aren’t used in that set of plans could be dropped.

That type of stats analysis could have other uses, too.  Prioritizing stats manual stats updates in regular maintenance comes to mind.  Or, determining what stats to create/update on an Always On Availability Group primary based on secondary activity.  And troubleshooting problem queries or identifying suspicious “watchlist” stats based on highly variable queries/plans they are involved with.

So I created this blog post almost 4 years ago.  And now I’ll plead with you to not use the query there… it’s awful.  If you want to query trace flag 8666 style stats from plan XML, please start from the query in this post instead – its much more well behaved 🙂

Read on for the script.

Comments closed

Alerting On Azure Data Lake Store Data Usage

Jose Lara shows off an interesting feature in Azure Data Lake Store:

The massive scale and capabilities of Azure Data Lake Store are regularly used by companies for big data storage. As the number of files, file types, and folders grow, things get harder to manage and staying compliant becomes a greater challenge for companies. Regulations such as GDPR (General Data Protection Regulation) have heightened requirements for control and supervision of files that contain sensitive data.

In this blog post, I’ll show you how to set up alerts in your Azure Data Lake Store to make managing your data easier. We will create a log analytics query and an alert that monitors a specific path and file type and sends a notification whenever the path or file is created, accessed, modified, or deleted.

Auditing access has historically been tricky, so it’s nice that they were able to get that in.

Comments closed

Updating SQL Agent Job Owners With dbatools

Stuart Moore gives us two methods of updating SQL Agent job owners, one using T-SQL and the other with dbatools:

Now we all know that having SQL Server Agent jobs owned by ‘Real’ users isn’t a good idea. But I don’t keep that close an eye on some of our test instances, so wasn’t surprised when I spotted this showing up in the monitoring:

The job failed. Unable to determine if the owner (OldDeveloper) of job important_server_job has server access (reason: Could not obtain information about Windows NT group/user 'OldDeveloper', error code 0x534. [SQLSTATE 42000] (Error 15404)).

Wanting to fix this as quickly and simply as possible I just wanted to bulk move them to our job owning account (let’s use the imaginative name of ‘JobOwner’).

Click through for both scripts.

Comments closed

FCB_REPLICA_SYNC Spinlock Explanation

Paul Randal explains what the FCB_REPLICA_SYNC spinlock is and what it does:

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

  • If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.
  • If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Great information.  And a good reminder that if you are using database snapshots in SQL Server, you generally don’t want to have more than one on the same database.

Comments closed

When Rowstore Compression Beats Columnstore

Joe Obbish looks at scenarios where page-level compression on rowstore tables can beat columnstore compression in terms of resultant table size:

It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).

Interestingly, Joe also comes up with a scenario where row-level compression can beat columnstore even without string values.  All this said, the normal case when dealing with non-string data is that columnstore tends to compress a lot better.

Comments closed

Performance Tuning Window Functions

Kathi Kellenberger gives us some hints on tuning queries using window functions:

The only way to overcome the performance impact of sorting is to create an index specifically for the OVER clause. In his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions, Itzik Ben-Gan recommends the POC index. POC stands for (P)ARTITION BY, (O)RDER BY, and (c)overing. He recommends adding any columns used for filtering before the PARTITION BY and ORDER BY columns in the key. Then add any additional columns needed to create a covering index as included columns. Just like anything else, you will need to test to see how such an index impacts your query and overall workload. Of course, you cannot add an index for every query that you write, but if the performance of a particular query that uses a window function is important, you can try out this advice.

There are some good insights here.

Comments closed

Enriching Syslog Data In A Kafka Pipeline

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams.  If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed