Press "Enter" to skip to content

Day: November 20, 2017

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist:

Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text containing no, simple or complex mathematics. One thing I am missing here, is the support for code completion, but there are tons of plugins available so this should be no problem. It is also easy to turn your notebook into a presentation. For collaboration with non-technical teams, this is a great tool.

Conclusion: perfect Python IDE for data science! Less support for code inspection.

Click through for reviews of three IDEs.

Comments closed

Data Type Conversions In 4 Database Systems

Eleni Markou has samples for converting strings to dates, numerals, or currency in SQL Server, Postgres, Redshift, and BigQuery:

The TO_DATE function in PostgreSQL is used to converting strings into dates. Its syntax is TO_DATE(text, text) and the return type is a date.

In contrast with MS SQL Server which has strictly specified date formats, in Redshift, any format constructed using the patterns of the table found in the corresponding documentation can be correctly interpreted.

When using the TO_DATE() one has to pay attention as even if an invalid date is passed, it will convert it into a nominally valid date without raising any error.

There are a few other tricks in SQL Server for some of these (for example, on 2012 or newer, I’d use TRY_CONVERT rather than CONVERT).  That said, it’s a good overview of how to translate skills in one relational system to another.

Comments closed

Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.

Comments closed

Using DMVs To Plan Out Your Indexes

Eric Blinn explains how to use two particular DMVs to see which index changes you might want to make:

Missing Indexes

This group of DMVs records every scan and large key lookups.  When the optimizer declares that there isn’t an index to support a query request it generally performs a scan.  When this happens a row is created in the missing index DMV showing the table and columns that were scanned.  If that exact same index is requested a second time, by the same query or another similar query, then the counters are increased by 1.  That value will continue to grow if the workload continues to call for the index that doesn’t exist.  It also records the cost of the query with the table scan and a suspected percentage improvement if only that missing index had existed.  The below query calculated those values together to determine a value number.

Click through for sample scripts for this and the index usage stats DMV.  The tricky part is to synthesize the results of these DMVs into the minimum number of viable indexes.  Unlike the optimizer—which is only concerned with making the particular query that ran faster—you have knowledge of all of the queries in play and can find commonalities.

Comments closed

What Update Locks Do

Guy Glantser explains the process around updating data in SQL Server, particularly the benefit of having update locks:

In order to update a row, SQL Server first needs to find that row, and only then it can perform the update. So every UPDATE operation is actually split into two phases – first read, and then write. During the read phase, the resource is locked for read, and then it is converted to a lock for write. This is better than just locking for write all the way from the beginning, because during the read phase, other sessions might also need to read the resource, and there is no reason to block them until we start the write phase. We already know that the SHARED lock is used for read operations (phase 1), and that the EXCLUSIVE lock is used for write operations (phase 2). So what is the UPDATE lock used for?

If we used a SHARED lock for the duration of the read phase, then we might run into a deadlock when multiple sessions run the same UPDATE statement concurrently.

Read on for more details.

Comments closed

More On Microsoft SQL Operations Studio

Dan Guzman shares some thoughts on Microsoft SQL Operations Studio:

Microsoft made the new cross-platform SQL Operations Studio (SOS) tool available on Github this week as a free open-source project. This SOS preview allows one to develop and manage SQL Server and Azure SQL Database from Windows, Linux, and macOS. The current preview can be downloaded from the SOS portal page, which also contains links to impressive quick start guides, how-to, and tutorials. I encourage you to try out the preview and improve it by reporting issues and offering suggestions.

If you are a developer, consider contributing to this project on Github. SOS is built on the Electron framework, which leverages JavaScript, HTML, and Node.js technologies to build rich cross-platform desktop applications. This is the same stack that the popular VS Code IDE employs so it’s not surprising SOS has a similar look and feel.

Click through for Dan’s thoughts and also a link to try it yourself.

Comments closed

Using The Restore-DbaDatabase Pipeline

Stuart Moore describes the updated Restore-DbaDatabase cmdlet:

The biggest change is that Restore-DbaDatabase is now a wrapper around 5 public functions. The 5 functions are:

  • Get-DbabackupInformation
  • Select-DbabackupInformation
  • Format–DbabackupInformation
  • Test–DbabackupInformation
  • Invoke-DbaAdvancedRestore

These can be used individually for advanced restore scenarios, I’ll go through some examples in a later post.

Stuart then provides additional information at the various steps, explaining at a high level how things work.

Comments closed