Basic Spark Terminology

Kevin Feasel

2016-06-28

Spark

Denny Lee and Jules Damji explain some of the key terms and concepts around Apache Spark:

At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

Some of these concepts are new to Spark 2.0, but all are worth learning.

Analyzing Real-Time Data

Manjeet Chayel connects Spark Streaming to Amazon Kinesis and shows how to analyze the data in real time:

To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:

  1. Create an Amazon Kinesis stream.

  2. Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.

  3. Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.

  4. Connect to the Zeppelin notebook.

  5. Import the Zeppelin notebook from GitHub.

  6. Analyze and visualize the streaming data.

This is a good way of getting started with streaming data.  I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.

Home Labs

Chrissy LeMaire shows off her home lab:

I like to test my scripts against a variety of versions/editions and I don’t like spinning VMs up and down all the time. As for the cost; some people spend their money on golf, Polish pottery and gaming rigs. I spend mine on servers, Belgian beer and travel 😉

As you can see, I also have an old Macbook Pro with 256 SSD, 4TB HDD and 8GB RAM in the mix. It’s for photos and videos, however. And someone gave me an old silver Shuttle from like 2002, but I haven’t had the time to set it up yet.

The “cloud versus local” lab is a tough call, as both sides have their advantages and disadvantages.

Where Polybase Stats Live

I dig into where the statistics against a Polybase table actually live:

Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table).  We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.

Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats.  The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.

The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.

Sensible Auto-Growth Settings

Ajay Jagannathan notes that SQL Server 2016’s database auto-growth has changed to better default values:

model database: New default data and log file size is 8MB and default auto-growth is 64MB. This ensures that any new database created without explicitly specifying the SIZE/FILEGROWTH parameter will have 8MB initial size for all data and log files and 64MB for auto-growth for both data and log files.

For data files, having a 64MB autogrow, aligns with 1 PFS interval (which covers a range of 8088 pages = 64MB). For log files, having a 64MB autogrow helps with sizing the initial VLFs correctly so that they can be garbage claimed (wrapped-around) without which the log can keep growing.

This is much better than the prior default of 1 MB size and 10% auto-growth.  Percentage growth leads to eventual pain.

Watch Those Parentheses

Kenneth Fisher shows how to see open and close parenthetical locations:

You’ll notice that when I go over the parentheses the one I’ve selected and it’s pair turn yellow, unless there isn’t a pair of course. You can also use Ctrl-] to flip between the open and close parenthesis in a pair. This can be particularly useful to make sure that you remembered a close parenthesis at the end of a subquery. In this case that last close parenthesis doesn’t have a match. Now finding out that you are missing an open parenthesis doesn’t mean you know where it’s supposed to go. But you can track the different pairs, making sure that each time you open a parenthesis you close it in the correct place. In this case it belonged right at the beginning.

FYI yellow isn’t the default (it’s a light gray). I find the default hard to see (I’m getting old) so I changed it to yellow in the options under fonts and colors.

Read the whole thing.

SQL Server Event Handling

Dave Mason looks at different levels of event handling within SQL Server:

While event handling for .Net developers is implemented in a unified way, this is not the case for SQL Server. Event handling for SQL Server lacks the “one stop shopping” afforded to .Net developers. *If* we had access to the code base for SQL Server and wanted to handle a specific event, we could add our own code, recompile sqlservr.exe, and be on our way. But since we don’t have this ability, we use SQL Server’s run-time hooks. Consider the following:

  • DDL Triggers: handles Data Definition Language events (synchronously).

  • Event Notifications: handles a wide swath of SQL Server events via Service Broker (asynchronously).

  • SQL Alerts: handles the following events:

    1. Events with a specific error number or severity level that are written to the Windows Event Log.

    2. Events for a specific performance condition.

    3. WMI events.

  • sp_procoption: handles the startup event by specifying a stored procedure to run when the database engine service starts.

  • SQL Agent jobs: handles time-based events defined by user-specified job schedules (ie daily, hourly).

This sounds like the beginning of a new series.

PDF Search With Page Numbers

Kevin Feasel

2016-06-28

Search

Jon Morisi has a solution for how to get page numbers for results back from PDFs when using Full-Text Search:

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup.  If you tried this you may have noticed that although the searches worked, what you got back was a file name.  This isn’t so helpful if your document is an all encompassing 538 pages.  So, how do we get a page number back?  The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that.  I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.

It’s not a particularly pretty solution, but it does work, and that’s important.

Rprofile For Notifications

Kevin Feasel

2016-06-28

R

Steph Locke shows how to use .Rprofile to make your life easier:

First of all, you need a file called .Rprofile that’s stored in your working directory. Some useful resources about .Rprofiles can be found on .Rprofile CRAN docs and an .Rprofile intro.

Now inside that file, you can add a number of functions that are based on a number of events like loading or closing R. I need a .First function for on load and whatever I produce has to be able to print to the console with cat().

With that in mind, instead of showing details, I chose to show the number of breaches I’m in. You can get HIBPwned from CRAN and use it to query the awesome website HaveIBeenPwned.com.

I’ve seen people do things like this in .bash_profile, but didn’t know about .Rprofile before.

Spark Metrics

Swaroop Ramachandra looks at some key metrics for Spark administration:

Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.

  1. In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.

There are a few metrics of note here.  Check it out.

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930