Press "Enter" to skip to content

Curated SQL Posts

Pitfalls Of DIY Hadoop

Ben Davis discusses considerations when rolling your own Hadoop cluster:

5. Security hardening
I find it is easier to deploy Hadoop in a fairly low security configuration. This is because there are a range of ports that Hadoop talks on and having an incorrectly configured firewall can cause you problems. So after deployment, set aside time to identify how to customise your firewalls, user and group settings, Kerberos and ssl settings.

I think the article makes some good points.  DIY is great for a proof of concept or for playing around with a technology, but if you don’t already have a good amount of experience with a technology, you’ll probably make costly mistakes in development and administration.  This is not Hadoop-specific:  I’ve seen companies do terrible things to SQL Server because they didn’t know the correct way to do it but needed to get work done.  As part of a proof of concept, do all the terrible things you’d like; they’re how you’ll learn.  But if this is going to production, it’s a good idea to have people who know what they’re doing involved.

Comments closed

Credentials In Azure Automation

Mark Vaillancourt explains Azure automation credentials:

With that String parameters, it is easy to just type what I want in the textbox. But, for the PSCredential parameter at the bottom, I was unsure what to do. I figured, since this is just a textbox, I couldn’t just magically pass in a Credential Asset. So, I tried passing in an expression that used the Get-AzureRmAutomationCredential cmdlet that returns the Credential Asset specified. That didn’t work very well. I then started digging through documentation, figuring I would see references and examples of passing a Credential Asset in the Test Pane. It took me a bit, but I finally landed on an Azure.com blog post titled Azure Automation: Runbook Input, Output, and Nested Runbooks. While this post didn’t match all that closely with the search criteria around the Test Pane, it did contain my answer in a section on starting Runbooks:

The answer turns out to be pretty simple.

Comments closed

Quality Graphics With R

David Smith discusses building high-quality visuals with R:

Note the use of an attractive colour pallette, style-compatible fonts, and even the official Olympic icons for the sports. I just took a screenshot here, but if you click through to the actual site you’ll notice that these graphics are also scale-independent (you can zoom in on your browser and they’ll look better, not worse) and even interactive (pop-ups appear with country-specific data when you hover over a bar).

Duc-Quang has been generous enough to provide the R code behind these charts if you’d like to try your hand at something similar. The data themselves were scraped from the official Rio 2016 site. The bar charts were created using a standard geom_bar plot using ggplot2, with a custom theme to set the font to OpenSans Condensed. The interactive elements were added using the ggiraph package and the geom_bar_interactive function. The chart titles (including the icons) were created as HTML headers directly, which was then exported along with the interactive charts using the save_html function.

I’m impressed that this all comes from R.  There’s a good bit of work involved in getting this going, but you can get professional-grade graphics quality with R, and that’s pretty cool.

Comments closed

Issues With SSISDB In An Availability Group

Andrea Allred has some lessons learned from a troublesome service pack upgrade:

Here are a few of the fun errors that we saw.

“Script level upgrade for database ‘master’ failed because upgrade step ‘SSIS_hotfix_install.sql’ encountered error 942, state 4, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.”

There are some good lessons here.

Comments closed

Getting Started With Hadoop

Jon Morisi is looking at Hadoop:

From here I think I’ll start playing around on a sandbox.  Each of the distributions offers a way to spin up a VM or log into a cloud based environment.  There are also docker images out there. (search hadoop, cloudera, or hortonworks).  Most of these docker images look fairly new, so don’t cut yourself.

I’m looking at the HortonWorks distro, so I’ll probably setup a Hortonworks Sandbox.

Jon includes some resources he’s used to learn a bit about the topic.  I think he’s going down the right path with videos and mailing lists—the ecosystem changes too quickly for books to have much long-term value, and mailing lists & forums tend to be better for keeping up to date.  My biggest suggestion is to get case studies and play around.  Check out studies on e-mail ingest, real-time data analysis, and analyzing fantasy sports for starters.

Comments closed

Audit Select Statements

Jason Brimhall shows how to build an extended event session which audits all SELECT statements:

I have to be a little honest here. Prior to somebody asking how they could possibly achieve a statement audit via extended events, I had not considered it as a tool for the job. I would have relied on Audit (which is Extended Event related), or some home grown set of triggers. In this particular request, Audit was not fulfilling the want and custom triggers was not an option. Another option might have included the purchase of third party software but there are times when budget does not allow for nice expensive shiny software.

So, with a little prodding, I hopped into the metadata and poked around a bit to see what I could come up with to achieve this low-budget audit solution.

Read the whole thing.

Comments closed

Duplicate Statistics

Shaun J. Stuart discusses removing duplicate statistics:

I puzzled on this for a bit and got sidetracked by the strange way SSMS displays statistics columns on the Property page. Then it got to be the end of the day and I went home. The next day, I had a comment on my previous post from Aaron Bertrand who mentioned there is a related bug with the stats_column_id column of the sys.stats_columns view. It does not contain what the MSDN documentation says it contains. The Connect item for this, along with a workaround, can be found here.

The script I was using did not reference that column, but it did get me thinking that perhaps the script was not correctly identifying the first column in an index.

Shaun has an updated version of a duplicate statistics checker script that you may want to check out.

Comments closed

Azure SQL Data Warehouse Date Dimensions

Meagan Longoria shows how to create a date dimension in Azure SQL Data Warehouse:

Most data warehouses and data marts require a date dimension or calendar table. Those of us that have been building data warehouses in SQL Server for a while have collected our favorite scripts to build out a date dimension. For a standard date dimension, I am a fan of Aaron  Bertrand’s script posted on MSSQLTips.com. But the current version (as of Aug 8, 2016) of Azure SQL Data Warehouse doesn’t support computed columns, which are used in Aaron’s script.

Click through for the script.

Comments closed

Feed The CPUs

SQL Sasquatch is starting a new series on optimizing disk write to maximize CPU throughput:

When I work with SQL Server batch-controlled workflows, I use the theory “feed the CPUs”.  That’s the simplest positive adaptation I could come up with of Kevin Closson’s paradigm “Everything is a CPU problem” 🙂

What I mean by “Feed the CPUs” is that memory and disk response times are primary factors determining the maximum rate for the CPUs to process the data.  Nuts & bolts of such a model for SQL Server are slightly different than a similar model for Oracle.  SQL Server access to persistent data is always through database cache, while Oracle uses shared access to database cache in SGA and private access to persistent data through direct access in PGA.

Click through for more details.

Comments closed

Amazon Machine Learning

Ujjwal Ratan uses patient readmission data to demonstrate Amazon Machine Learning:

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

I think that Azure ML is still ahead of Amazon’s ML solution, but I’m happy to see the competition.

Comments closed