Press "Enter" to skip to content

Month: August 2016

Hadoop: DAS Or NAS?

Jagdish Mirani asks whether you should prefer Direct Attached Storage (DAS) or Network Attached Storage (NAS) for your Hadoop cluster:

If you want to spin up an Apache Hadoop cluster, you need to grapple with the question of how to attach your disks. Historically, this decision has favored direct attached storage (DAS). This approach is in keeping with the fundamental Hadoop principle of moving processing to a where the data lives, thereby taking advantage of disk locality to optimize performance. Disk locality is so core to Hadoop that virtually any description of Hadoop starts with this.

The alternative is to use network attached storage (NAS). In contrast to DAS, NAS separates the compute and storage layers so that storage can be shared across a number of servers by shipping data over the network. Historically, this heavy dependence on the network made NAS an order of magnitude slower. Remember, the state of the art was 1GbE networks, and switches were slower and more expensive. I/O requirements for demanding Hadoop-based applications could only be met by DAS.

This is a very interesting discussion.  In my limited experience, I’ve had trouble selling operations teams on DAS, given the increased ops effort required to keep a bunch of attached disks going.  Hat tip Ari Amster.

Comments closed

VMware Configuration Reports

Allen McGuire has a few Reporting Services reports that he created against vCenter Database:

So you are a DBA and you are in a virtual environment – VMware in particular.  You are curious to know the health of the VMware hosts in terms of CPU and RAM, but you really don’t know how to get the data you need and you’re not certain if the information you are asking for is entirely accurate.  Well, chances are you have access to the VMware databases themselves – if that is the case, you can create these reports based on a blog post from Jonathan Kehayias: “Querying the VMware vCenter Database (VCDB) for Performance and Configuration Information“.

I have created five reports that are based on Jonathan’s queries and you can download the RDL for the SSRS reports below – enjoy!

Click through for the reports.

Comments closed

Calling Azure ML Web Services Using Data Factory

Ginger Grant shows how to call an Azure Machine Learning web service from within Azure Data Factory:

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Pagegrid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

Read the whole thing.

Comments closed

Bullet Charts

Devin Knight continues his custom visuals series:

The bullet chart is a variation of a bar graph but designed to address some of the problems that gauges have.

  • Allows you to split chart by categories

  • Visuals can be vertical or horizontal

Some of the visualizations in this series have been hit-or-miss for me.  I’m on the fence about bullet charts:  they seem potentially useful, but also rather dense.  I like my visuals to be self-explanatory, and I’d be concerned that if I showed this to management, I’d have to explain what’s going on in more detail than I’d like.

Comments closed

In Other News, Oracle Is Still Expensive

Joey D’Antoni compares licensing costs of SQL Server versus Oracle for one particular customer:

When I see those numbers in Microsoft marketing slides, I sometimes wonder if they can be real, but then I put these numbers together myself. Granted you would get some discounts, but the fact that all of these features are built into SQL Server, should convince you of the value SQL Server offers. Pricing discounts are generally similar between vendors, so that is not really a point of argument. If you are doing a really big Oracle deal you may see a larger upfront discount, but you will still be paying your 23% support fees on that very large list price. (Software Assurance from Microsoft will be around 20%, but from a much lower base) Additionally, several of these features ae available in SQL Server Standard Edition. None of these features are in Oracle’s Standard Edition.

In a follow-up post, Joey discusses Postgres:

Postgres is a really good database engine, with a rich ecosystem of developers writing code for it. SQL Server on the other hand, is a mature product that has had a large push to support analytic performance and scale.

Additionally, this customer is leveraging the Azure ecosystem as part of their process, and that is only possible via SQL Server’s tight integration with the platform.

This isn’t a direct comparison to help determine in some absolute sense which product is better, but rather looking at a use case from a customer which takes advantage of many of the features in SQL Server.

Comments closed

Data Compression

Melissa Connors discusses compression options and gives examples of data which will compress and that which will not:

Page Compression is what I like to refer to as “compression for real this time” as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.

First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You’ll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).

I really appreciate getting an idea of what kind of data does not compress well.  You’d think auto-incrementing numbers would be another scenario, but Melissa explains how that’s not necessarily the case.

Comments closed

Diagnosing Duplicate Records

Jesse Seymour walks through his process of finding and fixing unexpected duplicate key violations:

In this case, the error message is quite clear.  There is more than one row in the source (staging) that matches a single row in the target (data warehouse).  When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse.  In most systems, you can use the source system’s primary key to accomplish this.  After all, most systems use a RDBMS of some sort to store data.  However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.

So why are we not using that?  There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint.  If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.

Figuring out the cause of the problem is a multi-step process, as Jesse shows.

Comments closed

Thought Processes Of Application Developers

If that’s not the academic version of a controversial headline, I don’t know what is…

Kendra Little finds a C# developer who wants to become a database administrator:

I’ve been a C# developer since year 2000. I want to move to be a DBA. I’ve started getting involved at user groups and SQL Saturdays but nobody wants to hire me as a DBA.

I have been trying to move to other companies but my resume is strongly inclined to show my C#, front end experience. I know for a fact that I’m really good on SQL as I keep solving problems in every other project but no one seems to pay really attention to the DB. I have noticed that when applying for positions I get called for my C# experience but not when applying only to SQL jobs.

Should I find a Junior DBA position and take a pay cut?

That transition can be difficult, but I think Kendra’s answer is a good one.

On the opposite side, Daniel Janik looks at developers who shouldn’t go down that track:

I recently helped out with a .NET MVC project running on SQL Server 2016 where I found some pretty interesting stored procedures. I’ve seen a lot of really creative SQL but these were completely puzzling.

The database included many to many tables for customers who have addresses and phone numbers. A “mapping” table was created for the tables so they could map to a customer.

Normally you’d think a simple JOIN would suffice to get a list of addresses or phone numbers for a customer. These was done a way that I’ve never seen before.

 

Comments closed

MariaDB Now Commercial

Simon Phipps reports that MariaDB is now a commercial product:

MariaDB Corp. has announced that release 2.0 of its MaxScale database proxy software is henceforth no longer open source. The organization has made it source-available under a proprietary license that promises each release will eventually become open source once it’s out of date.

MaxScale is at the pinnacle of MariaDB Corp.’s monetization strategy — it’s the key to deploying MariaDB databases at scale. The thinking seems to be that making it mandatory to pay for a license will extract top dollar from deep-pocketed corporations that might otherwise try to use it free of charge. This seems odd for a company built on MariaDB, which was originally created to liberate MySQL from the clutches of Oracle.

Interesting.

Comments closed

Image Processing In U-SQL

Rukmani Gopalan and Apostolos Lerios show how to perform image processing using U-SQL:

We have published C# libraries that supply UDOs and UDFs for processing images with U-SQL in our GitHub site. In this section, we introduce these UDOs and UDFs and, in the next section, we use them within a U-SQL walkthrough to operate on images.

The basic flow behind processing images in U-SQL has three stages:

  1. Use the custom UDO extractor ImageExtractor to read a (JPEG or non-JPEG) image file and return the image data as a byte[] column value which contains the same exact image as the file in an (always) JPEG representation. Please note that there is a current limitation in U-SQL that a row cannot exceed a size of 4 MB, so you will run into issues if your image size is greater than 4 MB.

  2. Use the image processing UDFs to manipulate this byte[] (the UDFs support JPEG and non-JPEG representations within this byte[] despite the previous step always producing a JPEG representation). For example, one UDF extracts metadata from an image to produce textual or numeric data. More interesting UDFs derive an output image from an input image; that output represents the visually transformed input (e.g. rotated or scaled/resized), also stored as a byte[] containing an (always) JPEG representation of the output.

  3. Use the custom UDO outputter ImageOutputter to writes each byte[] to a JPEG image file so that we can view the output images of the aforementioned UDFs.

The major value proposition to me for U-SQL is “doing stuff SQL can’t do very well.”  This is one of those cases.

Comments closed