# Day: March 21, 2017

Probability is an important statistical and mathematical concept to understand. In simple terms – probability refers to the chances of possible outcome of an event occurring within the domain of multiple outcomes. Probability is indicated by a whole number – with 0 meaning that the outcome has no chance of occurring and 1 meaning that the outcome is certain to happen. So it is mathematically represented as P(event) = (# of outcomes in event / total # of outcomes). In addition to understanding this simple thing, we will also look at a basic example of conditional probability and independent events.

It’s a good intro to a critical topic in statistics.  If I would add one thing to this, it would be to state that probability is always conditional upon something.  It’s fair to write something as P(Event) understanding that it’s a shortcut, but in reality, it’s always P(Event | Conditions), where Conditions is the set of assumptions we made in collecting this sample.

The main take away is that we continue the deprecation of items that we changed during the preview phase and introduce a lot of new capabilities including `PIVOT/UNPIVOT` more catalog sharing and much more!

There’s a pretty hefty list of updates to check out.

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

This is an interesting use case for Athena.

This post describes one way that you can read the top N rows from large text files with C#. This is very useful when working with giant files that are too big to open, but you need to view a portion of them to determine the schema, data types, etc.

I’ve used PowerShell many times to do this with large csv files, but in this example we’re going to use C# and look at the Wikipedia XML dump of pages and articles. The 3017-03-01 dump is very large and comes in at 59.5 GB.

I’ve had to write something similar before on Windows machines where I didn’t have access to more/less.  It’s really helpful for perusing the first few lines of gigantic log files.

There have been a lot of questions, posts, answers, guesses and such floating around the SQL blogs lately…most of which seem to suggest that the DBA is going away.

Hogwash.

The DBA position is not going away.  Ever.  Or at least not before I retire to Utah to spend my days mountain biking 😉

That said, Kevin does point out that you shouldn’t rest on your laurels.

One fun anecdote I have about database administration:  I recall some marketing for some NoSQL product about how, by adopting their software, you can get rid of those stodgy database administrators.  Within a couple of years, said product’s parent company was offering developer training on “advanced” techniques, which included taking backups, tuning queries, implementing disaster recovery, and creating good indexes to help with performance.  But hey, at least they don’t have DBAs!

Hidden schedulers are used to process requests that are internal to the engine itself.  Visible schedulers are used to handle end-user requests.  When you run the casual SELECT * query, it will utilize a visible scheduler to process the query.  With this information, if I have a 64 core server and all is well, I should have 64 visible schedulers online to process requests.

However, I discovered that some of the schedulers were set to “VISIBLE OFFLINE”.  This essentially means that those particular schedulers are unavailable to SQL Server for some reason.   How many offline schedulers do I have? A quick query resulted in 24 schedulers currently offline.  24 logical cores means that 12 physical cores are offline.

But why would a scheduler be set to “VISIBLE OFFLINE”?

I was invited to deliver a session for Belgium User Group on SQL Server and R integration. After the session – which we did online using web based Citrix  – I got an interesting question: “Is it possible to use RevoScaleR performance computational functions within Power BI?“. My first answer was,  a sceptical yes. But I said, that I haven’t used it in this manner yet and that there might be some limitations.

The idea of having the scalable environment and the parallel computational package with all the predictive analytical functions in Power BI is absolutely great. But something tells me, that it will not be that straight forward.

Read on for the rest of the story.