Press "Enter" to skip to content

Day: June 3, 2020

Genetic Algorithms in Python

Abhinav Choudhary walks us through building a genetic algorithm library in Python:

Here are quick steps for how the genetic algorithm works:

1. Initial Population– Initialize the population randomly based on the data.
2. Fitness function– Find the fitness value of the each of the chromosomes(a chromosome is a set of parameters which define a proposed solution to the problem that the genetic algorithm is trying to solve)
3. Selection– Select the best fitted chromosomes as parents to pass the genes for the next generation and create a new population
4. Cross-over– Create new set of chromosome by combining the parents and add them to new population set
5. Mutation– Perfrom mutation which alters one or more gene values in a chromosome in the new population set generated. Mutation helps in getting more diverse oppourtinity.Obtained population will be used in the next generation

I’m a sucker for genetic algorithms (and even more so its cousin, genetic programming). And there are still good use cases for genetic algorithms, especially in creating scoring functions for neural networks.

Comments closed

Avoiding Overfitting and Underfitting in Neural Networks

Manas Narkar provides some advice on optimizing neural network models:

Adding Dropout

Dropout is considered as one of the most effective regularization methods. Dropout is basically randomly zero-ing or dropping out features from your layer during the training process, or introducing some noise in the samples. The key thing to note is that this is only applied at training time. At test time, no values are dropped out. Instead, they are scaled. The typical dropout rate is between 0.2 to 0.5.

Click through for a demo on dropout, as well as coverage of several other techniques.

Comments closed

Row Counts and Arrow Widths, Continued

Hugo Kornelis finishes a series on row counts and arrow widths with a look at Compute Scalar operators:

Compute Scalar operator is probably the most common of all operators. I hardly ever see an execution plan that doesn’t have at least a few occurrences of this operator. The task of the Compute Scalar operator is a simple one: to use some of the data in its input and, based on that, produce new data that is then added as extra columns in its output.

Because of the simplicity of this task, the actual execution of that task is often done by one of the other operators in the execution plan, and the Compute Scalar operator itself doesn’t actually execute. A side effect is that it can’t track how many rows it processes, because it doesn’t process anything at all. The result is that, even in an execution plan with run-time statistics (aka “actual execution plan”), no run-time statistics will be reported by a Compute Scalar operator when all its computations are performed by other operators. (See also the note in this (retired) Books Online article).

But then Hugo head-fakes us and shows us the real conclusion:

I already described, in a previous post, how sometimes the optimizer can create an execution plan that uses a Filter operator to evaluate a specific predicate, but then a post-optimization rewrite finds a way to push that predicate down into another operator, as a Predicate property, and then removes the Filter operator. When this happens with a bitmap filter, the Estimated Number of Rows is not adjusted, which can be quite confusing.

But for the issue in this post, the root cause was the same, but the error surfaces completely differently.

This has been a fun series to read, showing how an extremely useful signal can nonetheless exhibit problems in many edge cases.

Comments closed

Merge Replication: Subscriber and Publisher Versions

Steve Stedman gives us the proper order for upgrading SQL Server when you’re using merge replication:

Working on a recent SQL Server merge replication project we needed to update some of the servers in a merge replication scenario without upgrading all of them. Consider a merge replication setup with a publisher, a distributor and 2 or more subscribers all on the same version of SQL Server, and you need to upgrade the SQL Server version on the subscriber to a newer version like SQL Server 2019.

Before doing any type of upgrade, I wanted to confirm that things would or would not work. First checking some Microsoft documentation it appears that replication from a SQL 2012, SQL 2014, SQL 2016, or any older version of a publisher is not supported to a subscriber running on SQL Server 2019. Or more specifically the subscriber needs to be on the same ore older version than the publisher.

Read on for a demo, as well as an interesting caveat.

Comments closed

Working with ADLS Gen 2 in Power Query

Marco Russo takes us through some ways to optimize performance when working with Azure Data Lake Storage Gen 2 from Power Query:

With Power Query you can apply filters to the list obtained by the File System View option, thus restricting the access to only those files (or a single file) you are interested in. However, there is no query folding of this filter. What happens is that every time you refresh the data source, the list of all these files is read by Power Query; the filters in M Query to the folder path and the file name are then applied to this list only client-side. This is very expensive because the entire list is also downloaded when the expression is initially evaluated just to get the structure of the result of the transformation.

A better way to manage the process is to specify in the URL the complete folder path to traverse the hierarchy, and get only the files that are interesting for the transformation – or the exact path of the file is you are able to do that. For example, the data lake I used had one file for each day, stored in a folder structure organized by yyyy\mm, so every folder holds up to 31 files (one month).

Read on for more advice in this vein.

Comments closed

Untrusted Shared Access Signature Certificates in SQL Server

William Assaf diagnoses an issue:

If you’ve tried doing Backup to URL with SQL Server using a Shared Access Signature (SAS) certificate and received this error:

Error: 18204, Severity: 16, State: 1.BackupDiskFile::CreateMedia: Backup device ‘https://account.blob.core.windows.net/container/folder/msdb_log_202005170801.trn’ failed to create. Operating system error 50(The request is not supported.).Cannot open backup device ‘https://account.blob.core.windows.net/container/folder/msdb_log_202005170801.trn’. Operating system error 50(The request is not supported.). [SQLSTATE 42000] (Error 3201) BACKUP LOG is terminating abnormally. [SQLSTATE 42000] (Error 3013). NOTE: The step was retried the requested number of times (1) without succeeding. The step failed. 

You may have received the same error I encountered.

This error popped up only after startup of SQL Server. To resolve the problem, we’d recreate the SAS key, using the same cert in the same script, and the backups would start working again. This affected all types of SQL database backups.

William did some troubleshooting and explains the core problem. It’s a weird one.

Comments closed

Updated SQL Server Diagnostic Queries

Glenn Berry has an updated set of DMV queries for us:

These are my SQL Server Diagnostic Information Queries for June 2020, aka my DMV Diagnostic Queries. They allow you to get a very comprehensive view of the configuration and performance of your SQL Server instance in a short amount of time. There are separate versions of these T-SQL queries for SQL Server 2005 through SQL Server 2019. I also have separate versions for SQL Managed Instance and Azure SQL Database. My diagnostic queries have been used by many people around the world since 2009. I make regular improvements to these queries each month.

This is one of my favorite methods for learning about a new SQL Server instance.

Comments closed

The Rollup and Cube Operators

Greg Dodd digs into the ROLLUP and CUBE operators in a two-parter. First, ROLLUP:

As you can see, we now have these null’s popping up, but with totals. Row 5 for example, tells us that in 2017 there were 1,427,461 people living in Hawaii. Row 11 tells us that there are 2,438,188 people living in Rhode Island and Hawaii in 2017. Row 22 tells us that there were 2,429,070 people living in Rhode Island and Hawaii in 2018, and finally row 23 tells us that in total there have been 4,867,268 people in 2017 and 2018. This last row is a bit useless for this data as the overlap of those people would be huge, but for something like sales data, this number could be useful.

Next, CUBE:

For those with a keen eye you’ll see that I’ve started at row 28 in that screenshot. When we run the GROUP BY without ROLLUP or CUBE we get just 16 rows. With ROLLUP that grows to 23, but with CUBE it explodes out to 57. Why?

I’ve used ROLLUP several times with proper hierarchical data (e.g., product category, product sub-category, product) and it does an excellent job of summarizing that sort of data. CUBE has always returned too many rows for my liking. But the operator I go to most frequently is GROUPING SETS, as then I get to control the levels.

Comments closed