Improving Solr Performance

Michael Sun has some tips to improve performance of Solr operations, focusing on memory tuning but including a few other tips as well:

For time series applications, it’s very common to have queries in the following pattern

q=*:*&fq=[NOW-3DAYS TO NOW]

However, this is not a good practice from memory perspective. Under the hood, Solr converts ‘NOW’ to a specific timestamp, which is the time when the query hits Solr. Therefore, two consecutive queries with the same field query fq=[NOW-3DAYS TO NOW] are considered different queries once ‘NOW’ is replaced by the two different timestamp. As a result, both of these queries would hit disk and can’t take advantage of caches.

In most of use cases, missing data of last minute is acceptable. Therefore, try to query in the following way if your business logic allows.

q=*:*&fq=[NOW/MIN-3DAYS TO NOW/MIN]

If you’re using Solr for full text search, this is rather useful information.

JSON In Powershell

Adam Bertram shows an easy way of dealing with JSON data inside Powershell:

As the world continually becomes “eaten by software,” more and more services are being replaced by software. IT pros have most likely seen this in the form of software-defined everything. One of the premier components of this focus on software and with the continuing adoption of DevOps is application programming interfaces (APIs). All of these services needs to talk together and must provide a way for programs and users to interact with them. This is where APIs come in handy. But, what does this have to do with PowerShell and JSON, you ask?

APIs, more specifically REST APIs, return data when queried. This data is typically in the JSON format. JSON is a way of structuring data that makes it easy for software to consume. When working with PowerShell, Microsoft has provided some helpful tools to work with JSON called the ConvertTo-Json and ConvertFrom-Json commands. These commands allow you to quickly work with REST APIs or any other service that returns or accepts JSON as an input.

Read on for more details on how to use these commands.

Fraud Detection With Python

Kevin Jacobs has a walkthrough of how to use Pandas and scikit-learn to perform fraud detection against a sample set of credit card transactions:

Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.

There is a small trick for getting more information than only the raw records. We can use the following code:

print(df.describe())

This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.

The Kaggle competition data set is available, so you can follow along.

Forcing 0 Intercept Inflates R-squared In R

John Mount has an informative post on how you can trick yourself when running linear regression models in R and forcing the y intercept to be 0:

So far so good. Let’s now remove the “intercept term” by adding the “0+” from the fitting command.

m2 <- lm(y~0+x, data=d)t(broom::glance(m2))
## [,1]
## r.squared 7.524811e-01
## adj.r.squared 7.474297e-01
## sigma 3.028515e-01
## statistic 1.489647e+02
## p.value 1.935559e-30
## df 2.000000e+00
## logLik -2.143244e+01
## AIC 4.886488e+01
## BIC 5.668039e+01
## deviance 8.988464e+00
## df.residual 9.800000e+01
d$pred2 <- predict(m2, newdata = d)

Uh oh. That appeared to vastly improve the reported R-squared and the significance (“p.value“)!

Read on to learn why this happens and how you can prevent this from tricking you in the future.

Scripting Azure Resources With ARM Templates

Melissa Coates has a detailed explanation of how to script out the creation and configuration of services in Azure using Azure Resource Manager (ARM) templates:

In many cases, you can easily provision resources in the web-based Azure portal. If you’re never going to repeat the deployment process, then by all means use the interface in the Azure portal. It doesn’t always make sense to invest the time in automated deployments. However, ARM templates are really helpful if you’re interested in achieving repeatability, improving accuracy, achieving consistency between environments, and reducing manual effort.

Use ARM templates if you intend to:

  • Include the configuration of Azure resources in source control (“Infrastructure as Code”), and/or

  • Repeat the deployment process numerous times, and/or

  • Automate deployments, and/or

  • Employ continuous integration techniques, and/or

  • Utilize DevOps principles and practices, and/or

  • Repeatedly utilize testing infrastructure then de-provision it when finished

Melissa walks through an example of deploying a website with backing database, along with various configuration changes.

Linked Servers And The Kerberos Double-Hop Problem

Jana Sattainathan shows how to set up Kerberos pass-through when dealing with linked servers:

Let us say you have SQLServer1 and you want to setup a linked server to SQLServer2 using “pass-through authentication”, a double-hop happens as explain in the article below. Basically, the first hop is when the user authenticates to SQLServer1 and the second hop when that gets passed on from SQLServer1 to SQLServer2.

The below article is a must-read before you proceed:

The three nodes involved in the double-hop as illustrated in the example are

  1. Client – The client PC from which the user is initiating connection to SQLServer1

  2. Middle server – SQLServer1

  3. Second server – SQLServer2

Dealing with the double-hop problem is far trickier than it should be; if you’ve had to deal with this, I recommend Jana’s guide.

Good Indexes Make Good Queries

Thomas Rushton has an example of a positive indexing experience:

100k runs of the query in a ten minute interval? yeowch. Yeah, this should be optimised if possible. The primary wait type was CPU – indicating that the data was all in RAM, but the CPU was having to schlep through the entire table to find what it needed. Or to find that it didn’t need anything. Or something.

The interesting story buried in here is that none of the tooling Thomas used indicated that things could be improved with a better index, and yet there was a tremendous opportunity here based on the graphics at the end.

Leave ANSI PADDING On

Kenneth Fisher explains what the ANSI PADDING setting does:

ON is the default and is what you would expect. Trailing spaces are saved in VARCHAR and in CHAR additional spaces added to fill the entire space. When ANSI_PADDING is off then additional spaces are not saved .. unless the column is CHAR AND NOT NULL.

So there’s the first reason to not turn ANSI_PADDING off. Most people expect the ON results and the OFF results can be .. let’s just say confusing.

Click through for more details, including how painful it is to change the setting on a column after the fact.

Not All Defaults Are Good

Nate Johnson rails against bad defaults in SQL Server:

Your servers have many-core CPUs, right?  And you want SQL to utilize those cores to the best of its ability, distributing the many users’ workloads fairly amongst them, yes?  Damn right, you paid $3k or more per core in freaking licensing costs!  “OK”, says SQL Server, “I’ll use all available CPUs for any query with a ‘cost’ over ‘5’“.  (To give context here, in case you’re not aware, ‘5’ is a LOW number; most OLTP workload queries are in the double to triple digits).  “But wait!”, you protest, “I have more than 1 user, obviously, and I don’t want their horrible queries bringing all CPUs to their knees and forcing the 50 other user queries to wait their turn!”

Nate has a few recommendations here, as well as a picture of kittens.

Environmental Factors And SQL Server

Jeff Mlakar has a set of tips and tricks around SQL Server performance:

Performance problems for a SQL Server based application are likely to be caused by environmental factors and not buggy code.

Whether it is a configuration you can change in SQL Server, Windows Server, VMware, or the network it is likely the first course of action is to perform a quick assessment of the environment. This is where understanding the various configurations and best practices are key. Knowing what to look for can save tons of time.

A mistake I often see is a performance issue is passed off to someone else (more senior) and that engineer assumes a lot of things without checking. People are going to relay the problem as they see it – not as it actually is. This leads to skipping over some elementary checks which can save time and frustration from tracking down imaginary bugs.

Start troubleshooting with a quick environmental check.

There are quite a few checks here.

Categories

June 2017
MTWTFSS
« May  
 1234
567891011
12131415161718
19202122232425
2627282930