Press "Enter" to skip to content

Author: Kevin Feasel

Microsoft R Server On Spark

Max Kaznady, et al, discuss using Microsoft R Server on Spark to perform rapid prototyping against the NYC Taxi dataset:

Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.

The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.

If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.

Comments closed

Azure SQL Threat Detection

Ron Matchoro discusses use cases for Azure SQL Threat Detection:

Thanks to SQL Threat Detection, we were able to detect and fix code vulnerabilities to SQL injection attacks and prevent potential threats to our database. I was extremely impressed how simple it was to enable threat detection policy using the Azure portal, which required no modifications to our SQL client applications. A while after enabling SQL Threat Detection, we received an email notification about ‘An application error that may indicate a vulnerability to SQL injection attacks’.  The notification provided details of the suspicious activity and recommended concrete actions to further investigate and remediate the threat.  The alert helped me to track down the source my error and pointed me to the Microsoft documentation that thoroughly explained how to fix my code.  As the head of IT for an information technology and services company, I now guide my team to turn on SQL Auditing and Threat Detection on all our projects, because it gives us another layer of protection and is like having a free security expert on our team.”

Anything which helps kill SQL injection for good makes me happy.

Comments closed

Data Cleansing In SQLite

Allison Tharp wants to clear out kinda-sorta duplicates from a SQLite table:

However, now I have a lot of database entries that are unneeded.  I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables).  For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany.  I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days.  However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).

I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM.  I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:

There are several ways to solve this problem; this one is easy and works.  The syntax won’t work for all database platforms, but does the trick for SQLite.

Comments closed

Considerations For Azure SQL Database

Grant Fritchey discusses whether new database administrators might want to start with Azure SQL Database rather than on-premises SQL Server:

Since you are right at the start of your career, you may as well plan on maximizing the life of the knowledge and skills you’re building. By this, I mean spend your time learning the newest and most advanced software rather than the old approach. Is there still work for people who only know SQL Server 2000? Sure. However, if you’re looking at the future, I strongly advocate for going with online, cloud-based systems. This is because, more and more, you’re going to be working with online, connected, applications. If the app is in the cloud, so should the data be. Azure and the technologies within it are absolutely the cutting edge today. Spending your limited learning time on this technology is an investment in your future.

This answer is a tougher call for me.  Looking at new database developers (or development DBAs or database engineers or whatever…), I think the case is pretty solid:  there’s so much skill overlap that it’s relatively easy to move from Azure SQL Database to on-prem.  With production DBAs, the story’s a little different:  as Grant mentions, this is a Platform as a Service technology, and so the management interface is going to be different.  There are quite a few commonalities (common DMVs, some common functionality), but Grant gives a good example of something which is quite different between the PaaS offering and the on-prem offering:  database backup and restoration.  I think the amount of skills transfer is lower, and so the question becomes whether the marginal value of learning PaaS before IaaS/on-prem is high enough.  Given my (likely biased) discussions of Azure SQL Database implementations at companies, I’d stick with learning on-prem first because you’re much more likely to find a company with an on-prem SQL Server installation than an Azure SQL Database.

Comments closed

Filtering Data

Slava Murygin shows various ways to filter data, in particular data from fn_dblog():

Would say we need to extract an information associated with an “UPDATE” for LSNs started at “0000004f:00000087:0001”. You can just specify Starting and Ending LSNs as “fn_dblog” parameters:

[…]

That portion of code would return you ONLY Log records between LSNs “0000004f:00000087:0001″ and “0000004f:00000088:0001″.

Slava’s post uses fn_dblog() as an example but the techniques are applicable across the board, and in practice sum up to “get the fewest number of rows and fewest number of columns you need to solve the problem at hand.”

Comments closed

Identifying Object Names Using fn_dblog()

Frank Gill digs into the transaction log to find object names used in CREATE and ALTER statements:

This is the object id of the view that was created.  So, Jes’s question was answered.  But this led me to one of my other favorite SQL Server topics: string manipulation.  The following script will identify all transactions for a particular Transaction Name and return the object name affected.  The comments provide additional information about the functionality.

Click through to check out Frank’s script.

Comments closed

Joins Versus NOT IN Clause

Kevin Hill explains a potential performance difference between using NOT IN and using a left join:

Basic stuff, right?   Both will return 951 records (books) that I do not own.  And, very quickly…because the tables are tiny.   Sub-1 second is fast.

The issue here is HOW the rows are compared.

English version now, techy stuff later:

In the first query, this is equivalent to you standing at the bookstore and calling home to have someone check to see if the book in your hand is already in your collection.  EVERY time.  One by one.

In the second, you got really smart and brought a list with you, which you are comparing to the books on the shelf at the store.   You’ve got both “lists” in one place, so it is far more efficient.

Even in the case with a few hundred records, you can see why there’d be a performance difference.

Comments closed

Power BI Embedded

Reza Rad looks at Power BI Embedded:

Power BI Embedded is the reporting and analysis solution for mobile and web applications. Power BI Embedded is an Azure service that integrates Power BI solution into mobile and web applications. The report still has to be authored and created in Power BI Desktop. After creating the report it can be published into Power BI workspace in Azure, and using API Keys of Power BI workspace and embedding Power BI report frame into the web/mobile application it will be integrated into the application.

Reza walks through the process step-by-step.  The upshot is that you can take this report you created in Power BI and embed it into your own application, where you can apply your own in-app access controls.  There are limitations, which Reza spells out at the end of the post, so check it out.

Comments closed

Compression Delay

Rob Farley digs into Compression Delay as part of real-time operational analytics:

The thing with Operational Analytics is that the analytical data, reporting data, warehouse-style data, is essentially the same data as the transactional data. Now, it doesn’t look quite the same, because it’s not been turned into a star-schema, or have slowly changing dimension considerations, but for the purposes of seeing what’s going on, it’s data that’s capable of handling aggregations over large amounts of data. It’s columnstore.

Now, columnstore data isn’t particularly suited to transactional data. Finding an individual row within columnstore data can be tricky, and it’s much more suited to rowstore. So when data is being manipulated quite a lot, it’s not necessarily that good to be using columnstore. Rowstore is simply better for this.

But with SQL 2016, we get updateable non-clustered columnstore indexes. Data which is a copy of the underlying table (non-clustered data is a copy – clustered data or heap data is the underlying table). This alone presents a useful opportunity, as we can be maintaining a columnstore copy of the data for analytics, while handling individual row updates in the rowstore.

Read the whole thing.

Comments closed

Pitfalls Of DIY Hadoop

Ben Davis discusses considerations when rolling your own Hadoop cluster:

5. Security hardening
I find it is easier to deploy Hadoop in a fairly low security configuration. This is because there are a range of ports that Hadoop talks on and having an incorrectly configured firewall can cause you problems. So after deployment, set aside time to identify how to customise your firewalls, user and group settings, Kerberos and ssl settings.

I think the article makes some good points.  DIY is great for a proof of concept or for playing around with a technology, but if you don’t already have a good amount of experience with a technology, you’ll probably make costly mistakes in development and administration.  This is not Hadoop-specific:  I’ve seen companies do terrible things to SQL Server because they didn’t know the correct way to do it but needed to get work done.  As part of a proof of concept, do all the terrible things you’d like; they’re how you’ll learn.  But if this is going to production, it’s a good idea to have people who know what they’re doing involved.

Comments closed