Press "Enter" to skip to content

Author: Kevin Feasel

Resilient Distributed Datasets

Spark is built around the concept of Resilient Distributed Datasets.  If you have not read Matei Zaharia, et al’s paper on the topic, I highly recommend it:

Spark exposes RDDs through a language-integrated API similar to DryadLINQ [31] and FlumeJava [8], where each dataset is represented as an object and transformations are invoked using methods on these objects.

Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. Examples of actions include count (which returns the number of elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset to a storage system). Like DryadLINQ, Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations.

In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist. Finally, users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

The link also has a video of their initial presentation at NSDI.  Check it out.

Comments closed

Giving Permissions Through Stored Procedures

Erland Sommarskog has a fantastic article on the right (and wrong!) ways of doing stored procedure security:

Before I go on to the main body of this text, I would like to make a short digression about security in general.

Security is often in conflict with other interests in the programming trade. You have users screaming for a solution, and they want it now. At this point, they don’t really care about security, they just want to get their business done. But if you give them a solution that has a hole, and that hole is later exploited, you are the one that will be hung. So as a programmer you always need to have security in mind, and make sure that you play your part right

One common mistake in security is to think “we have this firewall/encryption/whatever, so we are safe”. I like to think of security of something that consists of a number of defence lines. Anyone who has worked with computer systems knows that there are a lot of changes in them, both in their configuration and in the program code. Your initial design may be sound and safe, but as the system evolves, there might suddenly be a security hole and a serious vulnerability in your system.

By having multiple lines of defence you can reduce the risk for this to happen. If a hole is opened, you can reduce the impact of what is possible to do through that hole. An integral part of this strategy is to never grant more permissions than is absolutely necessary. Exactly what this means in this context is something I shall return to.

This is a must-read for anyone interested in rights management in SQL Server.

Comments closed

Happy Memorial Day

Curated SQL is on semi-holiday for Memorial Day.  Instead of posting new links to items of interest in the wide world of data, I want to point out a couple longer works that I normally would not be able to talk about given the site’s motif.  If you’re in the office on a slow day, here are a few items that will get you through.

Comments closed

Recursive CTEs

Steve Jones gives a simple recursive CTE example and an important lesson on compounding:

Recursion is an interesting computer science technique that stumps lots of people. When I was learning programming, it seemed that recursion (in Pascal) and pointers (in C), were the weed out topics.

However, they aren’t that bad, and with CTEs, we can write recursion in T-SQL. I won’t cover where this might be used in this post, though I will give you a simple CTE to view.

There are two parts you need: the anchor and the recursive member. These are connected with a UNION ALL. There can be multiple items, but we’ll keep things simple.

You can play query golf and find a way to remove the recursion, but it’s good to know how to create a recursive CTE.  It’s also good to know that you typically do not want recursion in a database process…

Comments closed

SystemThread

Ewald Cress looks at SystemThread:

SystemThread, a class within sqldk.dll, can be considered to be at the root of SQLOS’s scheduling capabilities. While it doesn’t expose much obviously exciting functionality, it encapsulates a lot of the state that is necessary to give a thread a sense of self in SQLOS, serving as the beacon for any code to find its way to an associated SQLOS scheduler etc. I won’t go into much of the SQLOS object hierarchy here, but suffice it to say that everything becomes derivable by knowing one’s SystemThread. As such, this class jumps the gap between a Windows thread and the object-oriented SQLOS.

The simplest way to conceptualise a thread in SQL Server is to think of a spid or connection busy executing a simple query, old skool sysprocesses style. It’s not hip, but it’s close enough to the truth to be useful. This conflates a few things that are separate entities, but it is a good starting point for teasing things apart as needed

This is another dive into internals; prepare your thinking caps.

Comments closed

Spark SQL For ETL

Ben Snively discusses using Spark SQL as part of an ETL process:

Now interact with SparkSQL through a Zeppelin UI, but re-use the table definitions you created in the Hive metadata store.   You’ll create another table in SparkSQL later in this post to show how that would have been done there.

Connect to the Zeppelin UI and create a new notebook under the Notebook tab. Query to show the tables. You can see that the two tables you created in Hive are also available in SparkSQL.

There are a bunch of tools in here, but for me, the moral of the story is that SQL is a great language for data processing.  Spark SQL has gaps, but has filled many of those gaps over the past year or so, and I recommend giving it a shot.

Comments closed

Heron

Twitter is open sourcing Heron, a competitor to Apache Storm:

Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.

There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.

It’s good to see this competition in the streaming space.

Comments closed

Box And Whisker Plots

Slava Murygin shows how to create a box and whisker plot in SSMS using spatial data types:

If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots

At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.

We will analyze amounts of Individual lines of Sales Orders within each month.

The first step is to create a Data Set to process.  That Data Set will contain a Month, Single Line amount and order number of that record within a month.

This is really cool…but I wonder if it wouldn’t be better to do this in R, where it’d take a lot less code.  If you can’t reach out to R, though, this is a good way of visualizing results.

Comments closed

Finding Unused Indexes

SQLWayne has a script to help find unused indexes:

Here’s some code that can show you what indexes are unused or empty.  An empty index just means that there’s no data in that table right now, it may always be populated later, so I would not drop an empty index.  Besides, how much space would an empty index take?

For my personal preferences, I order the output by table then index name, also I put a u.* at the end of the select statement so the more interesting usage stat columns can be seen.

If an index truly is unused, it’s a waste of resources.  The problem is, sometimes you’ll think an index is unused but it’s really a vital part of month-end reporting or used for the CEO’s favorite dashboard.

Comments closed

On The Edge

Kendra Little answers a question on version novelty:

Dear SQL DBA,

What are your thoughts on the early adoption of new SQL Server versions? Specifically, if the business is willing to assume the risk of early adoption just to get one new feature that they can probably live without, should DBAs be happy and willing to assume that risk too? Or, is it our responsibility to “just say no” until it has been tested? I would like to hear about any experience you have with this. Thanks.

Bleeding in Edgeville

Go read Kendra’s answer because it’s a good one.  My answer is, I want to be on the edge.  I’ve run into V1 bugs and had to spike projects before they made it to production, but if there’s a good benefit to moving, and if your business side is supportive, I’d lean heavily toward fast upgrades.

Comments closed