Press "Enter" to skip to content

Month: July 2018

Constrained Optimization In Python: pyomo

Jeff Schecter introduces us to pyomo, a Python package for constrained optimization problems:

Constrained optimization is a tool for minimizing or maximizing some objective, subject to constraints. For example, we may want to build new warehouses that minimize the average cost of shipping to our clients, constrained by our budget for building and operating those warehouses. Or, we might want to purchase an assortment of merchandise that maximizes expected revenue, limited by a minimum number of different items to stock in each department and our manufacturers’ minimum order sizes.

Here’s the catch: all objectives and constraints must be linear or quadratic functions of the model’s fixed inputs (parameters, in the lingo) and free variables.

Constraints are limited to equalities and non-strict inequalities. (Re-writing strict inequalities in these terms can require some algebraic gymnastics.) Conventionally, all terms including free variables live on the lefthand side of the equality or inequality, leaving only constants and fixed parameters on the righthand side.

To build your model, you must first formalize your objective function and constraints. Once you’ve expressed these terms mathematically, it’s easy to turn the math into code and let pyomo find the optimal solution.

I haven’t touched it in a decade, but I did have some success with LINGO for solving the same type of problem.

Comments closed

Benefits To Federating The Hadoop NameNode

Hanisha Koneru and Arpit Agarwal show us a few benefits to NameNode federation:

The Apache Hadoop Distributed File System (HDFS) is highly scalable and can support petabyte-sizes clusters.  However, the entire Namespace (file system metadata) is stored in memory. So even though the storage can be scaled horizontally, the namespace can only be scaled vertically. It is limited by the how many files, blocks and directories can be stored in the memory of a single NameNode process.

Federation was introduced in order to scale the name service horizontally by using multiple independent Namenodes/ Namespaces. The Namenodes are independent of each other and there is no communication between them. The Namenodes can share the same Datanodes for storage.

KEY BENEFITS

Scalability: Federation adds support for horizontal scaling of Namespace

Performance: Adding more Namenodes to a cluster increases the aggregate read/write throughput of the cluster

Isolation: Users and applications can be divided between the Namenodes

Read on for examples.

Comments closed

SQL Server Vulnerability Assessment Powershell Cmdlets

Ronit Reger announces a new set of SQL Server vulnerability assessment Powershell cmdlets:

SQL Vulnerability Assessment (VA) is a service that provides visibility into your security state, and includes actionable steps to resolve security issues, and enhance your database security. It can help you:

  • Meet compliance requirements that require database scan reports.
  • Meet data privacy standards.
  • Monitor a dynamic database environment where changes are difficult to track.

VA runs vulnerability scans on your database, flagging security vulnerabilities and highlight deviations from best practices, such as misconfigurations, excessive permissions, and unprotected sensitive data. The rules are based on Microsoft’s best practices and focus on the security issues that present the biggest risks to your database and its valuable data. These rules also represent many of the requirements from various regulatory bodies to meet their compliance standards.

Results of the scan include actionable steps to resolve each issue and provide customized remediation scripts where applicable. An assessment report can be customized for your environment by setting an acceptable baseline for permission configurations, feature configurations, and database settings. This baseline is then used as a basis for comparison in subsequent scans, to detect deviations or drifts from your secure database state.

Read on for more, and if you’re interested, the cmdlets are available in the SqlServer Powershell module.

Comments closed

Read-Scale Availability Groups

Ryan Adams explains how to create a Read-Scale Availability Group:

A Read-Scale Availability Group is a Clusterless Availability Group.  It’s sole purpose and design is to scale out a read workload.  More importantly is what it is not.  It is NOT a High Availability or Disaster Recovery solution.  Since this design has no cluster under it, you lose things like automatic failover and database level health detection.  For example, You have reports that run for customers that are in your DMZ that is fire-walled off from your internal network.  Opening up ports for Active Directory so that you can have a cluster means opening a ton of ephemeral ports and ports with high attack vectors.  Remember the Slammer worm?  This solution removes those dependencies.

Click through for the setup scripts as well as a video Ryan created of him putting it all together.  As long as you recognize the trade-offs involved, this can be a nice solution to certain problems.

Comments closed

Resumable Online Index Creation In Azure SQL Database

Niko Neugebauer looks at a feature coming in SQL Server vNext:

It is about the time to create our first Clustered Online Resumable Index:

CREATE CLUSTERED INDEX CI_SampleDataTable
	ON dbo.SampleDataTable (c1)
		WITH ( ONLINE = ON, RESUMABLE = ON ) ;

But all we shall get is an error message:

Msg 155, Level 15, State 1, Line 25
'RESUMABLE' is not a recognized CREATE CLUSTERED INDEX option.

I was shocked and I was disappointed, but I have understood that it was my own mind’s fault. Nobody, I repeat – NOBODY has told me that it would work for the CLUSTERED Indexes, but when I see an announcement that the Indexes are supported, I was totally believing that the traditional (not XML, no CLR, no LOB’s) Rowstore Indexes would be totally supported. Oh yes, I know that it is crazy difficult. I know that this is a pretty forward-facing feature, but come on – my mind played trick on me, telling me the story that does not exist, for now, at least.

After realising my mind’s mistake I took a deeper breath and decided to try out the Resumable Nonclustered Index Creation with the following command:

CREATE NONCLUSTERED INDEX NCI_SampleDataTable
	ON dbo.SampleDataTable (c1)
		WITH ( ONLINE = ON, RESUMABLE = ON );

Hopefully we get a bit more support as SQL Server vNext is developed and eventually released.  In the meantime, Niko hits some limitations but his timings for the feature look good.

Comments closed

Finding Scalar Functions In Execution Plans

Kendra Little points out that scalar user-defined functions can hide in the most unassuming of places:

After we find matches based on the customer id, we have more work “left over” — that’s the “residual” bit.

For every row that matches, SQL Server is plugging values into the Website.CalculateCustomerPrice() function and comparing the result to the Unit price column, just like we asked for in the where clause.

In other words, this is happening for every row in Sales.InvoiceLines that has a matching row in Sales.Invoices.

Which is every single invoice & invoice line, as it turns out.

It’s a shame there’s no “this is why your query is slow” plan operator for scalar UDFs.

Comments closed

Faking Arrays In T-SQL With Custom Types

Jovan Popovic shows how to use custom types as pseudo-arrays in SQL Server:

One of the missing language features in T-SQL language is array support. In some cases you can use custom types to work with arrays that are passed as parameters to your stored procedures.

Custom types in T-SQL enable you to create alias for some table, .Net or built-in type. Once you create a custom type, you can use it for local variables and parameters of functions.

I go back and forth on whether I’d like full array support in T-SQL, as on the plus side, it simplifies interactions with external tools.  On the other hand, it can promote bad habits like violating first normal form.

Comments closed

How Qubole Optimizes Apache Spark Clusters

Mikhail Stolpner gives us some tips on how to optimize Apache Spark clusters:

There are four major resources: memory, compute (CPU), disk, and network. Memory and compute are by far the most expensive. Understanding how much compute and memory your application requires is crucial for optimization.

You can configure how much memory and how many CPUs each executor gets. While the number of CPUs for each task is fixed, executor memory is shared between the tasks processed by a single executor.

A few key parameters provide the most impact on how Spark is executed in terms of resources: spark.executor.memoryspark.executor.coresspark.task.cpus, spark.executor.instances, and spark.qubole.max.executors.

This article gives us some idea of the levers we have available as well as when to pull them.  Though the article itself is vendor-specific, a lot of the advice is general.

Comments closed

Area Under The ROC Is Not Accuracy

Stephen Chen debunks bad journalistic summaries of a Google research paper:

Journalists latched onto Google’s NN 0.95 score vs. the comparison 0.86 (see EWS Strawman below), as the accuracy of determining mortality. However the actual metric the researchers used is AUROC (Area Under Receiver Operating Characteristic Curve) and not a measure of predictive accuracy that indexes the difference between the predicted vs. actual like RMSE (Root Mean Squared Error) or MAPE (Mean Absolute Percentage Error). Some articles even erroneously try to explain the 0.95 as the odds ratio.

Just as the concept of significance has different meanings to statisticians and laypersons, AUROC as a measure of model accuracy does not mean the probability of Google’s NN predicting mortality accurately as journalists/laypersons have taken it to mean. The ROC (see sample above) is a plot of a model’s False Positive Rate (i.e. predicting mortality where there is none) vs. the True Positive Rate (i.e. correctly predicting mortality). A larger area under the curve (AUROC) means the model produces less False Positives, not the certainty of mortality as journalists erroneously suggest.

The researchers themselves made no claim to soothsayer abilities, what they said in the paper was:

… (their) deep learning model would fire half the number of alerts of a traditional predictive model, resulting in many fewer false positives.

It’s an interesting article and a reminder of the importance of terminological precision (something I personally am not particularly good at).

Comments closed

How Perfmon Memory Counters Fit Together

Lonny Niederstadt takes us through a tour of how various Perfmon memory counters relate:

Wading through all of the SQL Server memory-related perfmon counters to understand how they related to each other took me a really long time.  Time-series graphs that show the relationship help me tremendously, and when I started trying to account for SQL Server memory years ago I couldn’t find any.  So I started to blog some time-series graphs, under the theory that either my understanding was correct and my graphs would be helpful to someone… or they’d be wrong and someone would correct me.
Well… its been about 5 years and my graphs haven’t generated too much discussion, but they’ve really helped me 😀😀😀

Perfmon: SQL Server Database pages + Stolen pages + Free pages = Total pages
http://sql-sasquatch.blogspot.com/2013/09/perfmon-database-pages-stolen-pages.html

Working with SQL Server 2016 and some demanding ColumnStore batch mode workloads, I began to see suspicious numbers, and graphs that didn’t make sense to me.  Today I got pretty close to figuring it out so I wanted to share what I’ve learned.

The following graphs are from a 4×10 physical server running Windows and SQL Server.  Four sockets, 4 NUMA nodes.

For bonus points, Lonny traces down a problem where expectations aren’t meeting reality.

Comments closed