Securing KSQL

Yeva Byzek shows the methods available to secure a Kafka Streams application:

To connect to a secured Kafka cluster, Kafka client applications need to provide their security credentials. In the same way, we configure KSQL such that the KSQL servers are authenticated and authorized, and data communication is encrypted when communicating with the Kafka cluster. We can configure KSQL for:

Read the whole thing if you’re thinking about using Kafka Streams.

Deploying Jupyter Notebooks

Teja Srivastasa has an example of deploying a Jupyter notebook for production use on AWS:

No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.

But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.

Another option might be to use JupyterHub.

Reviewing The Team Data Science Process

I am starting a new series on launching a data science project, and my presentation quickly veers into a pessimistic place:

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data.  But the truth is that, in a real-world production scenario, we will never have truly clean data.  Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results.  Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data:  sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons.  Each of these can negatively impact your data, leaving you with invalid entries.

Read on for a few more adages which shape the way we work on projects, followed by an overview of the Microsoft Team Data Science Process.

Enabling Optimizer Fixes In SQL Server

Monica Rathbun explains that just upgrading a SQL Server database doesn’t enable optimizer changes:

When applying a new SQL Server cumulative update, hot fix, or upgrade SQL Server doesn’t always apply all the fixes in the patch. When you upgrade the database engine in-place, databases you had already stay at their pre-upgrade compatibility level, which means they run under the older set of optimizer rules. Additionally, many optimizer fixes are not turned on. The reason for this is that while they may improve overall query performance, they may have negative impact to some queries. Microsoft actively avoids making breaking changes to its software.

To avoid any negative performance impacts, Microsoft has hidden optimizer fixes behind a trace flag, giving admins the option to enable or disable the updated fixes. To take advantage of optimizer fixes or improvements you would have enable trace flag 4199 after applying each hot fix or update or set it up as a startup parameter. Did you know this? This was something I learned while working with an existing system, years into my career. I honestly assumed it would just apply any applicable changes that were in the patch to my system. Trace flag 4199 was introduced in the SQL Server 2005-era. In SQL Server 2014, when Microsoft made changes to the cardinality estimator they protected the changes with trace flags as well, giving you the option to run under compatibility level 120 and not have the cardinality estimator changes in effect.

Things changed starting with SQL Server 2016.

Click through to see how SQL Server 2016 made it a bit easier.

Log Shipping Tests With dbachecks

Sander Stad has a bonus post in his log shipping series:

We want everyone to know about this module. Chrissy LeMaire reached out to me and asked if I could write some tests for the log shipping part and I did.

Because I wrote the log shipping commands for dbatools I was excited about creating a test that could be implemented into this module for everyone to use.

That test is also quite easy to use, as Sander demonstrates.

Row Goals And Semi Joins

Paul White continues his row goals series:

The remaining physical join type is nested loops, which comes in two flavours: regular (uncorrelated) nested loops and apply nested loops (sometimes also referred to as a correlated or lateral join).

Regular nested loops join is similar to hash and merge join in that the join predicate is evaluated at the join. As before, this means there is no value in setting a row goal on either input. The left (upper) input will always be fully consumed eventually, and the inner input has no way to determine which row(s) should be prioritized, since we cannot know if a row will join or not until the predicate is tested at the join.

By contrast, an apply nested loops join has one or more outer references (correlated parameters) at the join, with the join predicate pushed down the inner (lower) side of the join. This creates an opportunity for the useful application of a row goal. Recall that a semi join only requires us to check for the existence of a row on join input B that matches the current row on join input A (thinking just about nested loops join strategies now).

In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

This has the depth and quality that you naturally expect from Paul, making it an immediate read.

Digging Into dbachecks

Rob Sewell walks us through dbachecks configuration:

So I can import this configuration and run my checks with it any time I like. This means that I can create many different test configurations for my many different environment or estate configurations.

Yes, I know “good/best practice” says we should use the same configuration for all of our instances but we know that isn’t true. We have instances that were set up 15 years ago that are still in production. We have instances from the companies our organisation has bought over the years that were set up by system administrators. We have instances that were set up by shadow IT and now we have to support but cant change.

As well as those though, we also have different environments. Our development or test environment will have different requirements to our production environments.

This is a good, code-filled demonstration with a few videos as well.  If you want to get started with dbachecks, this is a good place to begin.

Loading Data From Excel With Biml

Ben Weissman loads an Excel file with Biml:

Did you know, that you could call GetDatabaseSchema on Excel files? You can!

Just define an ExcelConnection first:

01_Environment.biml
1
2
3
4
5
6
7
8
9
10
11
12
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <ExcelConnection Name="MyExcel" ConnectionString="Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Flatfiles\XLS\MyExcel.xlsx;Extended Properties=&quot;Excel 12.0 XML;HDR=YES&quot;;" />
        <OleDbConnection Name="Target" ConnectionString="Data Source=localhost;initial catalog=MySimpleBiml_Destination;provider=SQLNCLI11;integrated security=SSPI"></OleDbConnection>
    </Connections>
    <Databases>
        <Database Name="MySimpleBiml_Destination" ConnectionName="Target"></Database>
    </Databases>
    <Schemas>
        <Schema Name="dbo" DatabaseName="MySimpleBiml_Destination"></Schema>
    </Schemas>
</Biml>

You can call GetDatabaseSchema on that connection and loop through the tables just like any regular database.

Click through to see what to do with this connection.

Gotchas When Indexing Partitioned Tables

Andrew Pruski gives us a couple of considerations when creating indexes on partitioned tables in SQL Server:

Looking at that data page, we can see that SQL has added a UNIQUIFIER column. Now this is standard SQL behaviour, SQL does this to all non-unique clustered indexes whether they are on a partitioned table or not.

But also look at the CreatedDate column. It’s after the ID column on the page. If this was a non-partitioned table, we would see that after ColA & ColB (the order the columns are on the table). This has happened because SQL has implicitly added the partitioning key into the index definition, which has changed the physical order of the data on the page.

Read the whole thing.

Web Analytics With R

Maelle Salmon performs some analysis on the Locke Data blog:

Often, the URL of a blog post can be guessed based on its title, e.g. this one can be read here. But even if the transition from the Markdown file information to an URL is logical, it was best to get URLs from the in situ blog posts, and then join them to the blog post information collected previously, since some special characters got special treatment that I could not fully understand by looking at blogdown source code.

I first extracted all posts URLs from the website map.

Check it out.

Categories

February 2018
MTWTFSS
« Jan  
 1234
567891011
12131415161718
19202122232425
262728