This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. The first one is available here. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In the second part (here), we saw how to work with multiple tables in Spark the RDD way, the DataFrame way and with SparkSQL. In this third part of the blog post series, we will perform web server log analysis using real-world text-based production logs. Log data can be used monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more. Server log analysis is a good use case for Spark. It’s a very large, common data source and contains a rich set of information.
This tutorial shows you three different ways to solve several problems, including file sizes, counts by response code, top endpoints, etc.
With the release of SQL Server 2016, these trace flags were rumored to be a thing of the past and hence completely unnecessary. That is partially true. The trace flag is unneeded and SQL 2016 does have some different behaviors, but does that mean you have to do nothing to get the benefits of these Trace Flags as implemented in 2016?
As it turns out, these trace flags no longer do what they did in previous editions. SQL Server now pretty much has it baked into the product. Buuuuut, do you have to do anything slightly different to make it work? This was something I came across while reading this post and wanted to double check everything. After all, I was also under the belief that it was automatically enabled. So let’s create a script that checks these things for me.
Click through for the script and a summary of his findings.
Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.
I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.
It’s a good overview of several techniques.
I thought it would be best to start with a clean slate so I created a new SQL login and database user so that I could definitively figure out which permissions are needed.
Normally I use Windows Authentication for my logins, but this time I thought “since I’m getting crazy learning new things, let me try creating a SQL Login instead.”
After I created my login, I decided to test connecting to my server before digging into the permissions. Result?
After the fifth or sixth time it happens to you, you start making that the first thing you check.
Below is a list of trace flags which, as far as I can tell, have never been publicly documented. I did not fully investigate many of them and many of the descriptions are just guesses. I make no guarantees and none of these should be used in production. All tests were performed on SQL Server 2017 CU2 with trace flags enabled at the global level.
This is combining a bit of database archaeology and database anthropology.
When using the Blue-Green method, basically nothing gets changed. Instead everything gets replaced. We start by setting up a new environment – the green environment – and then cut over to it when we’re ready. Once we cut over to the new environment successfully, we’re free to remove the original blue environment. The technique is all about replacing components rather than altering components.
Check it out for a great explanation, not only of how true blue-green doesn’t jibe well with databases, but how to get it to work well enough.
As I was thinking about this relationship between goals and feedback, I thought Microsoft Power BI would be a great tracking tool. It’s free, so use it! In years past, I used spreadsheets or checklists in journals or OneNote, any of which is a fine way to accumulate a comprehensive list of all that a person wants to do. However, I never measured progress, thereby denying myself feedback. Consequently, I’d let myself get sidetracked during the year.
This year I promised myself I’d try a different approach and thought I’d share the process with you through a series of blog posts. Although I’m going to discuss goal-tracking from a personal point of view, you can also use the same techniques for your business-oriented goals. Either way, I hope you learn something about Power BI along the way and are inspired to do some goal-setting of your own.
It’s a good use of Power BI.
One rather small (relatively other features, as I imagine), but an incredibly useful improvement was described in Columnstore Indexes – part 109 (“Trivial Plans in SQL Server 2017”) – is the ability to automatically produce Fully Optimised execution plans for the Database, which compatibility level is set to 140.
Running on both instances (Standard & Express), the following script, while altering the compatibility level between 140 (SQL Server 2017) & 130 (SQL Server 2016), will produce different execution plan for the SELECT COUNT_BIG(*) operation – the fast one (with FULL optimisation in 140 compatibility level) and slow one (with TRIVIAL optimisation in 130 compatibility level):
I am happy that this feature has got no Edition dependence, this is a needed improvement that simply increases the value of the offer and can actually be achieved in a lot of different ways, event without parallelism kicking in.
Niko has also helpfully provided a table at the end of the post to summarize his findings.
A few things to know about it:
The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.
The spreadsheet includes both 2017 & 2018 results. For the new questions this year, the 2017 answers are populated with Not Asked.
The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.
Looks like I’m going to add one more thing to the to-do list for this week…