Analyzing Web Server Logs With Spark

Fisseha Berhane uses web server log analysis to contrast three methods of using Spark:

This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. The first one is available here. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In the second part (here), we saw how to work with multiple tables in Spark the RDD way, the DataFrame way and with SparkSQL. In this third part of the blog post series, we will perform web server log analysis using real-world text-based production logs. Log data can be used monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more. Server log analysis is a good use case for Spark. It’s a very large, common data source and contains a rich set of information.

This tutorial shows you three different ways to solve several problems, including file sizes, counts by response code, top endpoints, etc.

File Growth Trace Flags

Jason Brimhall investigates trace flags 1117 and 1118 and how they work in SQL Server 2016 versus older editions:

With the release of SQL Server 2016, these trace flags were rumored to be a thing of the past and hence completely unnecessary. That is partially true. The trace flag is unneeded and SQL 2016 does have some different behaviors, but does that mean you have to do nothing to get the benefits of these Trace Flags as implemented in 2016?

As it turns out, these trace flags no longer do what they did in previous editions. SQL Server now pretty much has it baked into the product. Buuuuut, do you have to do anything slightly different to make it work? This was something I came across while reading this post and wanted to double check everything. After all, I was also under the belief that it was automatically enabled. So let’s create a script that checks these things for me.

Click through for the script and a summary of his findings.

Choose Your Own Regression Adventure

Jim Frost explains when you might use different types of regression analysis:

Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.

I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.

It’s a good overview of several techniques.

Failure To Connect With A SQL Login

Bert Wagner hits on the most common reason why you might fail to connect with a SQL authentication login:

I thought it would be best to start with a clean slate so I created a new SQL login and database user so that I could definitively figure out which permissions are needed.

Normally I use Windows Authentication for my logins, but this time I thought “since I’m getting crazy learning new things, let me try creating a SQL Login instead.”

After I created my login, I decided to test connecting to my server before digging into the permissions. Result?

After the fifth or sixth time it happens to you, you start making that the first thing you check.

Fun With Undocumented Trace Flags

Joe Obbish has a list of 45 undocumented trace flags:

Below is a list of trace flags which, as far as I can tell, have never been publicly documented. I did not fully investigate many of them and many of the descriptions are just guesses. I make no guarantees and none of these should be used in production. All tests were performed on SQL Server 2017 CU2 with trace flags enabled at the global level.

This is combining a bit of database archaeology and database anthropology.

Blue-Green Deployments

Michael J Swart has started a new series on online deployments and covers the blue-green deployment architecture:

When using the Blue-Green method, basically nothing gets changed. Instead everything gets replaced. We start by setting up a new environment – the green environment – and then cut over to it when we’re ready. Once we cut over to the new environment successfully, we’re free to remove the original blue environment. The technique is all about replacing components rather than altering components.

Check it out for a great explanation, not only of how true blue-green doesn’t jibe well with databases, but how to get it to work well enough.

Goal Tracking With Power BI

Stacia Varga uses New Year’s resolutions to motivate a Power BI tutorial:

As I was thinking about this relationship between goals and feedback, I thought Microsoft Power BI would be a great tracking tool. It’s free, so use it! In years past, I used spreadsheets or checklists in journals or OneNote, any of which is a fine way to accumulate a comprehensive list of all that a person wants to do. However, I never measured progress, thereby denying myself feedback. Consequently, I’d let myself get sidetracked during the year.

This year I promised myself I’d try a different approach and thought I’d share the process with you through a series of blog posts. Although I’m going to discuss goal-tracking from a personal point of view, you can also use the same techniques for your business-oriented goals. Either way, I hope you learn something about Power BI along the way and are inspired to do some goal-setting of your own.

It’s a good use of Power BI.

Columnstore Functionality Per Edition

Niko Neugebauer looks at how columnstore indexes differ between SQL Server Standard Edition, Express Edition, and Enterprise Edition:

One rather small (relatively other features, as I imagine), but an incredibly useful improvement was described in Columnstore Indexes – part 109 (“Trivial Plans in SQL Server 2017”) – is the ability to automatically produce Fully Optimised execution plans for the Database, which compatibility level is set to 140.

Running on both instances (Standard & Express), the following script, while altering the compatibility level between 140 (SQL Server 2017) & 130 (SQL Server 2016), will produce different execution plan for the SELECT COUNT_BIG(*) operation – the fast one (with FULL optimisation in 140 compatibility level) and slow one (with TRIVIAL optimisation in 130 compatibility level):

I am happy that this feature has got no Edition dependence, this is a needed improvement that simply increases the value of the offer and can actually be achieved in a lot of different ways, event without parallelism kicking in.

Niko has also helpfully provided a table at the end of the post to summarize his findings.

2018 Data Professional Survey Results

Brent Ozar has posted data for the 2018 Data Professionals Survey:

A few things to know about it:

  • The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.

  • The spreadsheet includes both 2017 & 2018 results. For the new questions this year, the 2017 answers are populated with Not Asked.

  • The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.

Looks like I’m going to add one more thing to the to-do list for this week…

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031