Press "Enter" to skip to content

Day: July 27, 2017

Extracting Phone Numbers With Apache Tika

Unni Mana knows how to get your digits:

Last time, I had difficulties detecting phone numbers from different types of documents. The challenge was that I had to use different parsers to parse and extract the phone numbers. For example, to extract phone numbers from a Word document, I had to use a library that supports Word. Also, I cannot use the same library or logic to parse a PDF file. Ultimately, I need to maintain different libraries for different document types, which, as you can image, can lead to many issues.

It looks like this covers international phone numbers as well.  Seems pretty interesting.

Comments closed

ConvertTo-HTML Tips

Jeff Hicks shows off some of the niceties of Powershell’s ConvertTo-HTML cmdlet:

This is because Convertto-Html, like Export-CSV and Export-Clixml, take the entire object. This is not just the default result you see on the screen. Remember, everything will be treated as a string. In my example, if I want a similar HTML file, I will have to recreate the output with Select-Object. This might require piping the original result to Get-Member to discover the “real” property names.

It won’t output beautiful results, but with the appropriate CSS theming, you can generate good internal reports.

Comments closed

Binary Collation Case-Sensitivity

Solomon Rutzky explains that binary collations are not really case-sensitive:

Quite often people will use, or will recommend using, a binary Collation (one ending in “_BIN” or “_BIN2“) when wanting to do a case-sensitive operation. While in many cases it appears to behave as expected, it is best to not use a binary Collation for this purpose. The problem with using binary Collations to achieve case-sensitivity is that they have no concept of linguistic rules and cannot equate different versions of characters that should be considered equal. And the reason why using a binary Collation often appears to work correctly is simply the result of working with a set of characters that has no accents or other versions. One such character set (a common one, hence the confusion), is US English (i.e. “A” – “Z” and “a” – “z”; values 65 – 90 and 97 – 122, respectively). However, there are a few areas where binary collations don’t behave as many (most, perhaps?) people expect them to.

Solomon gives examples of false negatives (such as the same character represented by different code point combinations) and also explains how sort order can change.

Comments closed

Sharepoint And MAXDOP=1

Daniel Glenn has a public service announcement on Sharepoint maximum degree of parallelism:

For my first install test, I was using a service account that did have administrative rights on SQL as well. I looked at the setting in SQL and SharePoint 2016 did change the Maximum Degree Of Parallelism setting to 1. So, the story is, as of now anyway (we are dealing with Preview software), SharePoint Server 2016 requires MAXDOP=1.

Kidding about Sharepoint aside, if you do have Sharepoint in your environment, it’s worth knowing how to configure it correctly.

Comments closed

Understanding ACID Properties

Randolph West explains the basics of ACID properties and gives a high-level description of how relational databases typically ensure these properties:

Relational database management systems (RDBMS) such as SQL Server, Oracle, MySQL, and PostgreSQL use transactions to allow concurrent users to select, insert, update, and delete data without affecting everyone else.

An RDBMS is considered ACID-compliant if it can guarantee data integrity during transactions under the following conditions:

Read on for more.

Comments closed

Tracking DDL Events

Kenneth Fisher has a simple database trigger to track certain data definition language events:

A couple of notes before testing the code. The event groups I’m using will pull CREATE, ALTER and DELETE events for those objects. For a more complete list of events (you might want to add service broker events for example) go here. Also I’m using ORIGINAL_LOGIN because it will return who made the change even if they are impersonating someone else.

For my test, I created a user that only has db_DDLADMIN on the database. That means it can make DDL changes but can’t insert, update, delete or even run a select against any table in the database. That’s why I grant INSERT to public for the logging table.

It’s a good way of knowing when unexpected changes happen, too.

Comments closed