Press "Enter" to skip to content

Day: December 8, 2016

Querying Genomic Data With Athena

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

Comments closed

Importing CSV Files In Power BI

Gil Raviv explains the new “combine binaries” feature of Power BI Desktop:

The Power BI team has recently released an enhanced “combine binaries” experience as part of November 2016 update to Power BI Desktop. (Jargon Alert:  “Combine Binaries” is a scary term.  Instead it should be named “Magically combine multiple files together into one table and make me SUPER happy.”)  The improved experience can drastically help you to import multiple Excel or other files from a folder and avoid writing advanced query functions. But today we will focus on a specific scenario, which is so common that it deserves this special post – Handling CSV files.

In fact, today’s blog post is actually the first post in “The CSV Series”. I hope you will enjoy it. To celebrate the November update of Power BI Desktop, we will review the improved experience, and will walk you through one of the most common scenarios that is now so easy to implement – Importing multiple CSV files from a folder, including parts of their filenames.

This looks very useful.

Comments closed

Query Optimizer Hotfixes

SQL Scotsman covers the query optimizer hotfixes which you can turn on with trace flag 4199:

The query optimiser hotfixes contained under Trace Flag 4199 are intentionally not enabled by default.  This means when upgrading from SQL Server 2008 R2 to SQL Server 2012 for example, new query optimiser logic is not enabled.   The reason behind this according to the article linked above is to prevent plan changes that could cause query performance regressions.  This makes sense for highly optimised environments where application critical queries are tuned and rely on specific execution plans and any change in query optimiser logic could potentially cause unexpected / unwanted query regressions.

Read the whole thing.

Comments closed

Switching In Powershell

Chrissy LeMaire explain the switch command in Powershell:

Even less code and makes total sense. Awesome. There’s even more to switch — the evaluations can get full on complex, so long as the evaluation ultimately equals $true. Take this example from sevecek. Well, his example with Klaas’ enhancement.

The refrain with switch is, always make sure you cover every case and don’t let cases fall through when you don’t intend them to.  Fortunately, Powershell doesn’t allow fallthrough, so that makes it easier.

Comments closed

Microsoft R Server 9.0

David Smith reports that Microsoft R Server 9.0 is now available:

Microsoft R Server 9.0, Microsoft’s R distribution with added big-data, in-database, and integration capabilities, was released today and is now available for download to MSDN subscribers. This latest release is built on Microsoft R Open 3.3.2, and adds new machine-learning capabilities, new ways to integrate R into applications, and additional big-data support for Spark 2.0.

There’s also a new version of Microsoft R Client and Microsoft R Open.

Comments closed

Range-Based Dimensions

Jana Sattainathan has a couple blog posts on range dimensions.  First is durations:

The data is in increments of 300 seconds going from 0 to 31536000 seconds (1 year). So, this table can be used to analyze activities that take less than 1 year. The last row’s Dimension value should be used for everything that takes over one year (or you can generate more rows based on your need).

The second is size ranges:

In the middle there, one of the bar charts is “Backup Count & Duration by Size”. As the title says, this chart helps me determine which backups are small/large and determine how many backups are in each of those “Duration” buckets. The duration bucket that I used in this case could have been easily changed from GB ranges to TB ranges. For example, I filtered the chart to check counts of backups that are over 1 TB.  As one can see, I have a couple of databases that are in the 2.5 to 3 TB backup size range.

Often times, ranges are enough for analysis and that greater detail of a backup being 12.8 GB versus 12.81 GB obscures more useful information.

Comments closed

Using WinDocks

Andrew Pruski demonstrates WinDocks, for people without Windows Server 2016 available:

So the first thing to do is get a new server with Windows Server 2012 R2 installed. Then once that’s up and running, you need to install SQL Server…

…wait, what??

The WinDocks software is different from the previous docker software that we’ve worked with in that it needs an instance of SQL installed on the host in order to use it’s binaries to create SQL within the containers. The instance won’t need to be running, it just needs to be installed.

Check out WinDocks; it’s focused around Dockerizing older versions of SQL Server.

Comments closed

Backup Encryption

Daniel Jones shows how to use backup encryption in SQL Server:

The backup encryption in SQL server is needed due to following reasons:

  • Way to Keep Database File Secure: Users need to encrypt SQL server database backup files because this procedure provides complete security to copy of SQL server data. This security measure will keep transaction logs, tables, and other server data safe from any person, who wants to make use of these data in wrong manner.

  • Accessed Only By Authorized Person: It is impossible to restore an encrypted backup file, if a person is not having certificate or asymmetric key for decryption. Therefore, it means that only authorized persons who are knowing credentials of encrypted backup file can restore data with its full access.

Encrypting backups (and storing the encryption key somewhere independent of the backups themselves) can help prevent a very bad day.

Comments closed