Month: February 2018

Right now, the data is not ideal for analysis. Keeping in mind how I want to use the data, I need to perform some cleansing and transformation tasks. Any time I work with a new data source, I look to see if I need to do any of the following:

Remove unneeded rows or columns. Power BI stores all my data in memory when I have the PBIX file open. For optimal performance when it comes time to calculate something in a report and to minimize the overhead required for my reports, I need to get rid of anything I don’t need.
Expand lists or records. Whether I need to perform this step depends on my data source. I’ve noticed it more commonly in JSON data sources whenever there are multiple levels of nesting.
Rename columns. I prefer column names to be as short, sweet, and user friendly as possible. Short and sweet because the length of the name affects the width of the column in a report, and it drives me crazy when the name is ten miles long, but the value is an inch long—relatively speaking. User friendly is important because a report is pretty much useless if no one understands what a column value represents without consulting a data dictionary.
Rearrange columns. This step is mostly for me to look at things logically in the query editor. When the model is built, the fields in the model are listed alphabetically.
Set data types. The model uses data types to determine how to display data or how to use the data in calculations. Therefore, it’s important to get the data types set correctly in the Query Editor.

It’s a fun topic to use for learning about Power BI…says the guy wearing a Blue Jackets shirt right now…

Comments closed

Benefits Of Explicit Transactions

Published 2018-02-26 by Kevin Feasel

Kendra Little talks about explicit transactions and when they’re useful for single-statement operations:

If you do not enable implicit transactions, and you don’t start an explicit transaction, you are in the default “autocommit” mode.

This mode means that individual statements are automatically committed or rolled back as whole units. You can’t end up in a place where only half your statement is committed.

Our question is really about whether there are unseen problems with this default mode of autocommit for single-statement units of work.

By force of habit, I wrap data modification operations in an explicit transaction. They let me test my changes before committing and the time you’re most likely to spot an error seems to be right after hitting F5.

Comments closed

Read-Only Databases And Single-User Mode

Published 2018-02-26 by Kevin Feasel

David Fowler notes an old bug in SQL Server 2012 and 2014 which bit him recently:

Here’s a strange one that I’ve recently come across. I had a customer report that their log shipping restore jobs were chock a block of errors. Now, the logs seem to have been restoring just fine but before every restore attempt, the job is reporting the error,

Error: Failed to update database “DATABASE NAME” because the database is read-only.

Unfortunately I haven’t got any direct access to the server but their logshipping is setup to disconnect users before and leave the database in standby after. After a bit of to-ing and fro-ing, I asked the customer to send me a trace file covering the period that the restore job ran.

Read on for the details and keep those servers patched.

Comments closed

Classifying Data In SSMS

Published 2018-02-26 by Kevin Feasel

Steve Jones gives SQL Server Management Studio 17.5 a spin and tries to classify some data:

There’s a getting started link, which takes me to the SQL Server Security Blog. I suspect that’s an incorrect link. I think it should go here: SQL Data Discovery and Classification.

Below this, I see a list of the recommendations. This has grabbed tables that appear to continue to contain some data that might be sensitive and require classification. One of the tenets of the GDPR is that you know your data. You aren’t allowed to figure this out later, but rather you must proactively know what data you are collecting and processing.

It’s a good overview of the feature. Like Steve mentions, I appreciate this data being stored as extended properties: that way, third party and custom-built tools can make use of it. You can also script them out for migration.

Comments closed

Discovering Composite Keys

Published 2018-02-26 by Kevin Feasel

John Morehouse shares some good information on composite keys, including a few scripts:

As I started to work on this, my first thought was that it would be helpful to know how many tables had a composite primary key. This would give me an idea on how many tables I was dealing with. Thankfully, SQL Server has this information by using system DMVs (dynamic management views) along with the COL_NAME function.

Note: the COL_NAME function will only work with SQL Server 2008 and newer.

All of this time, I’d never known about COL_NAME.

Comments closed

Securing KSQL

Published 2018-02-23 by Kevin Feasel

Yeva Byzek shows the methods available to secure a Kafka Streams application:

To connect to a secured Kafka cluster, Kafka client applications need to provide their security credentials. In the same way, we configure KSQL such that the KSQL servers are authenticated and authorized, and data communication is encrypted when communicating with the Kafka cluster. We can configure KSQL for:

Read the whole thing if you’re thinking about using Kafka Streams.

Comments closed

Deploying Jupyter Notebooks

Published 2018-02-23 by Kevin Feasel

Teja Srivastasa has an example of deploying a Jupyter notebook for production use on AWS:

No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.

But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.

Another option might be to use JupyterHub.

Comments closed

Reviewing The Team Data Science Process

Published 2018-02-23 by Kevin Feasel

I am starting a new series on launching a data science project, and my presentation quickly veers into a pessimistic place:

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data. But the truth is that, in a real-world production scenario, we will never have truly clean data. Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results. Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data: sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons. Each of these can negatively impact your data, leaving you with invalid entries.

Read on for a few more adages which shape the way we work on projects, followed by an overview of the Microsoft Team Data Science Process.

Comments closed

Enabling Optimizer Fixes In SQL Server

Published 2018-02-23 by Kevin Feasel

Monica Rathbun explains that just upgrading a SQL Server database doesn’t enable optimizer changes:

When applying a new SQL Server cumulative update, hot fix, or upgrade SQL Server doesn’t always apply all the fixes in the patch. When you upgrade the database engine in-place, databases you had already stay at their pre-upgrade compatibility level, which means they run under the older set of optimizer rules. Additionally, many optimizer fixes are not turned on. The reason for this is that while they may improve overall query performance, they may have negative impact to some queries. Microsoft actively avoids making breaking changes to its software.

To avoid any negative performance impacts, Microsoft has hidden optimizer fixes behind a trace flag, giving admins the option to enable or disable the updated fixes. To take advantage of optimizer fixes or improvements you would have enable trace flag 4199 after applying each hot fix or update or set it up as a startup parameter. Did you know this? This was something I learned while working with an existing system, years into my career. I honestly assumed it would just apply any applicable changes that were in the patch to my system. Trace flag 4199 was introduced in the SQL Server 2005-era. In SQL Server 2014, when Microsoft made changes to the cardinality estimator they protected the changes with trace flags as well, giving you the option to run under compatibility level 120 and not have the cardinality estimator changes in effect.

Things changed starting with SQL Server 2016.

Click through to see how SQL Server 2016 made it a bit easier.

Comments closed

Log Shipping Tests With dbachecks

Published 2018-02-23 by Kevin Feasel

Sander Stad has a bonus post in his log shipping series:

We want everyone to know about this module. Chrissy LeMaire reached out to me and asked if I could write some tests for the log shipping part and I did.

Because I wrote the log shipping commands for dbatools I was excited about creating a test that could be implemented into this module for everyone to use.

That test is also quite easy to use, as Sander demonstrates.

Comments closed