Press "Enter" to skip to content

Author: Kevin Feasel

Spark Metrics

Swaroop Ramachandra looks at some key metrics for Spark administration:

Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.

  1. In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.

There are a few metrics of note here.  Check it out.

Comments closed

Preparing For Disaster Recovery

Kendra Little has a 30-minute video and explanation for how to prepare for a failover event:

The fact that you’re thinking about this is great!

You’re right, there are two major types of fail-overs that you have to think about:

  • Planned failover, when you can get to the original production system (at least for a short time)
  • Unplanned failover, when you cannot get to it

Even when you’re doing a planned failover, you don’t have time to go in and script out settings and jobs and logins and all that stuff.

Timing is of the essence, so you need minimal manual actions.

And you really should have documentation so that whomever is on call can perform the failover, even if they aren’t you.

The short answer is, test, test, test.  Test where it can’t hurt, and then test where it can.  But do read/watch the whole thing.

Comments closed

Monitoring Apache Spark

Swaroop Ramachandra has started a series on monitoring Apache Spark:

Spark provides metrics for each of the above components through different endpoints. For example, if you want to look at the Spark driver details, you need to know the exact URL, which keeps changing over time–Spark keeps you guessing on the URL. The typical problem is when you start your driver in cluster mode. How do you detect on which worker node the driver was started? Once there, how do you identify the port on which the Spark driver exposes its UI? This seems to be a common annoying issue for most developers and DevOps professionals who are managing Spark clusters. In fact, most end up running their driver in client mode as a workaround, so they have a fixed URL endpoint to look at. However, this is being done at the cost of losing failover protection for the driver. Your monitoring solution should be automatically able to figure out where the driver for your application is running, find out the port for the application and automatically configure itself to start collecting metrics.

For a dynamic infrastructure like Spark, your cluster can get resized on the fly. You must ensure your newly spawned components (Workers, executors) are automatically configured for monitoring. There is no room for manual intervention here. You shouldn’t miss out monitoring newer processes that show up on the cluster. On the other hand, you shouldn’t be generating false alerts when executors get moved around. A general monitoring solution will typically start alerting you if an executor gets killed and starts up on a new worker–this is because generic monitoring solutions just monitor your port to check if it’s up or down. With a real time streaming system like Spark, the core idea is that things can move around all the time.

Spark does add a bit of complexity to monitoring, but there are solutions in place.  Read the whole thing.

Comments closed

Detecting Web Traffic Anomalies

Jan Kunigk combines a few Apache products to perform near-real-time analysis of web traffic data:

meinestadt.de web servers generate up to 20 million user sessions per day, which can easily result in up to several thousand HTTP GET requests per second during peak times (and expected to scale to much higher volumes in the future). Although there is a permanent fraction of bad requests, at times the number of bad requests jumps.

The meinestadt.de approach is to use a Spark Streaming application to feed an Impala table every n minutes with the current counts of HTTP status codes within the n minutes window. Analysts and engineers query the table via standard BI tools to detect bad requests.

What follows is a fairly detailed architectural walkthrough as well as configuration and implementation work.  It’s a fairly long read, but if you’re interested in delving into Hadoop, it’s a good place to start.

Comments closed

New SSIS And SSRS Projects

Ginger Grant shows how to create a new SSIS or SSRS project for SQL Server 2016:

In this version of SQL Server Data Tools, Microsoft has finally addressed the common problem of needing to maintain multiple versions of SSIS packages for the different server versions. No longer do you need three different applications to maintain code for SQL Server 2012, 2014 and now 2016. All of these versions are supported with SSDT for Visual Studio 2015. SQL Server will detect which version the code was last saved in so that you don’t have to worry about accidently migrating code. You also have the ability to create an SSIS package in 2012, 2014 or 2016. To select the version you want, right click on the project and select Properties. Under Configuration Properties->General as shown in the picture, the TargetServerVersion, which defaults to SQL Server 2016, has a dropdown box making it possible to create a new package in Visual Studio 2015 for whatever version you need to support. Supporting the ability to write for different versions, is a great new feature and one which I am really happy is included in SSDT for Visual Studio 2015.

I’m also glad that Microsoft has made this move.  It is no fun having two or three different versions of Visual Studio installed because some component requires an older version.

Comments closed

Azure Cortana Intelligence Suite Walkthrough

Rolf Tesmer gives us a high-level walkthrough of the Azure Cortana Intelligence Suite, using management of a wind turbine farm as an example problem:

Event Hub

What is it

https://azure.microsoft.com/en-us/services/event-hubs/

Fully Managed Service (PaaS) for ingesting events/messages at a massive scale (think telemetry processing from websites, IoT etc).

What does it do in our wind farm

Provides a “front door” to our wind farm application to accept all of the streaming telemetry being generated from the turbines.  Event Hubs wont process any of this data per se – its just ensuring that its being accepted and queued (short term) while other components cane come in to consume it.

Before you dig deeply into particular services, it’s nice to see how they fit together at a higher level.

Comments closed

Mobile-Friendly Reports In Power BI

Reza Rad shares some tips on building mobile-friendly reports in Power BI:

Report fitted in my mobile screen, however when I see that in smart phone even with 5 inch screen, it is too small! texts are not readable in that size, and bar or column charts are too small to be selected with touch screen. When you design for smart phone size consider making sizes bigger. Also don’t use too many charts in one page, because it will make things small. few charts in each page will makes things readable and user will be able to highlight them and select items.

You can use formatting to make your font sizes bigger, and titles of charts bigger. However there are some charts and some elements that can’t be resized (for example labels inside tree map, or labels for x-axis in column chart below). Make sure to design big and clear with only few visualization elements in each page. Here is what I build and it shows in mobile phone nicely;

The upshot is that dashboards are about where we’d want mobile development to be—easy to use and “just works”—but reports have a ways to go yet.

Comments closed

Subqueries In Spark 2.0

Davies Liu and Herman van Hövell discuss SQL subqueries in Apache Spark 2.0:

In the upcoming Apache Spark 2.0 release, we have substantially expanded the SQL standard capabilities. In this brief blog post, we will introduce subqueries in Apache Spark 2.0, including their limitations, potential pitfalls and future expansions, and through a notebook, we will explore both the scalar and predicate type of subqueries, with short examples that you can try yourself.

A subquery is a query that is nested inside of another query. A subquery as a source (inside aSQL FROM clause) is technically also a subquery, but it is beyond the scope of this post. There are basically two kinds of subqueries: scalar and predicate subqueries. And within scalar and predicate queries, there are uncorrelated scalar and correlated scalar queries and nested predicate queries respectively.

They also link to a Notebook which you can use to follow along.  If you’re interested in window functions, here are notes from Spark 1.4.

Comments closed

Alternate Credentials

Daniel Hutmacher shows us various techniques for starting Management Studio under different Windows credentials:

The easy way to solve this is to just log on directly to the remote server using Remote Desktop and use Management Studio on that session, but this is not really desirable for several reasons: not only will your Remote Desktop session consume quite a bit of memory and server resources, but you’ll also lose all the customizations and scripts that you may have handy in your local SSMS configuration.

Your mileage may vary with these solutions, and I don’t have the requisite skills to elaborate on the finer points with regards to when one solution will work over another, so just give them a try and see what works for you.

I prefer Daniel’s second option, using runas.exe.

Comments closed