Getting Started With Zeppelin

Sangeeta Gulia shows us how to get started building notebooks with Apache Zeppelin on top of Spark:

There are 3 interpreter modes available in Zeppelin.

1) Shared Mode

In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single Scala REPL. In this mode, if NoteA defines variable ‘a’ then NoteB not only able to read variable ‘a’ but also able to override the variable.

2) Scoped Mode

In Scoped mode, each Note has its own Scala REPL. So variable defined in a Note can not be read or overridden in another Note. However, still single SparkContext serves all the Interpreter Groups. And all the jobs are submitted to this SparkContext and fair scheduler schedules the job. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler.

3) Isolated Mode

In Isolated mode, each Note has its own SparkContext and Scala REPL.

The default mode of %spark interpreter is ‘Globally Shared’.

This is mostly a step-by-step on installing Zeppelin, but does go into some detail on how Zeppelin works.

Everyone’s Data Is Dirty

Kevin Feasel

2017-11-16

Data

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Query Store Plan Forcing: You Can’t Always Get What You Want

Kendra Little shows an example where trying to force a Query Store plan results in an oddity:

This is not considered a “failure”

When I check the Query Store DMVs, force_failure_count is 0. The last_force_failure_reason_desc is NONE.

Query Store didn’t fail to apply the narrow plan. Instead, it’s just deciding not to give it to me, now that I’ve forced that plan.

Seems kinda like an adolescent, doesn’t it?

The answer remains a bit of a mystery, but read on to see how Kendra troubleshoots this.

What To Do With A Database In Source Control

Ed Elliott with Database Source Control 102:

This post is for a specific type of person if you are:

  • New to source control
  • Are getting started on your path to the continuous delivery nirvana
  • Have been able to get your database into some sort of source control system
  • Have more than one person checking in code to the database source
  • You are unsure what yo do next

Then this post is for you!

This is a nice post with some next-steps for when you have a database in source control but aren’t quite sure what to do next.

Go Download SQL Operations Studio

The SQL Server Team has announced SQL Operations Studio:

We are excited to announce that SQL Operations Studio is now available in preview. SQL Operations Studio is a free, light-weight tool for modern database development and operations for SQL Server on Windows, Linux and Docker, Azure SQL Database and Azure SQL Data Warehouse on Windows, Mac or Linux machines.

Download SQL Operations Studio to get started.

It’s not SSMS, but it is cross-platform.  And I think that over time, it will end up being better than SSMS.

Defining Result Sets With ML Services

Kevin Feasel

2017-11-16

Python, R

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Using The GROUPING SETS Operator

Kevin Feasel

2017-11-16

T-SQL

Alfonso Hernandez goes into detail with what you can do with GROUPING SETS:

In T-SQL, you summarize data by using the GROUP BY clause within an aggregate query. This clause creates groupings which are defined by a set of expressions. One row per unique combination of the expressions in the GROUP BY clause is returned, and aggregate functions such as COUNT or SUMmay be used on any columns in the query. However, if you want to group the data by multiple combinations of group by expressions, you may take one of two approaches. The first approach is to create one grouped query per combination of expressions and merge the results using the UNION ALLoperator. The other approach is to use the GROUPING SETS operator along with the GROUP BY clause and define each grouping set within a single query.

In this article I’ll demonstrate how to achieve the same results using each method.

Mastering GROUPING SETS makes reporting queries in T-SQL so much more effective.

Thoughts On Dynamic Data Masking

Kellyn Pot’vin-Gorman talks about Dynamic Data Masking:

Anyone with the unmask privilege or DB_OWNER will be able to view the data.  As many development and testing environments grant higher privileges to the users and in SQL Server, it’s not rare for a developer to be the DB_OWNER, (I used to come across this all the time when recoveries were performed by the wrong OS user) this leaves this data still quite vulnerable.  I do like that if you were to take a backup and recover it with masking, the obfuscated data is what is recovered physically.  I’m more concerned about those odd environments where compliance hasn’t been put in place on owners of the database that would still view the originally masked data, but unmasked.

Performance isn’t impacted, (i.e. no referential integrity concerns or execution plans) as the optimizer  performs all steps against the real data, which leads me to wonder what happens with some of the newer monitoring tools that state they can display SQL and bind variable data without accessing the database directly.  Would they “sniff” the masked data or unmasked?  Would it matter who the OS User or roles in the database?

The important thing here is that DDM isn’t really a security product.  It’s a something-or-another product that might be useful to stop shoulder surfing but pretty much nothing else.

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930