Scraping And Importing Web Data

Jon Morisi shows how to scrape a website and load the result into a SQL Server table:

Next save this as a csv file.

Now jump into SQL Server Management Studio, drill down to your database (you may want to create a new, empty database for your snarfing), right-click and start the Import and Export wizard, via “Import Data”:

This is the one-off solution.  If you need to do it regularly, read up on creating scrapers and use Integration Services to load.

Beware Manually Edited SQLPS Files

Cody Konior makes mention of an issue if you’ve manually edited your SqlPSPostscript.ps1 file:

The SQLPS module has been slow to load for years now and has finally been fixed in the April 2016 release. But most of us couldn’t wait a few years and edited the SqlPSPostscript.ps1 file that sits in the module directory.

If you’ve done this on one of the SSMS previews (not SQL 2014 as it’s in a different directory) and then upgraded to the April preview though you’ll be missing something, two things actually, being aliases for the renamed Encode-SqlName and Decode-SqlName, which won’t get put into your modified SqlPSPostScript.ps1 file. And so if you use these your older scripts may break.

There’s an easy fix, though, so all’s well.

Cool Storage

James Serra talks about “cool storage” in Azure Blob Storage:

The access tiers available for blob storage accounts are “hot” and “cold”.  In general, hot data is classified as data that is accessed very frequently and needs to be highly durable and available.  On the other hand, cool data is data that is infrequently accessed and long-lived.  Cool data can tolerate a slightly lower availability, but still requires high durability and similar time to access and throughput characteristics as hot data.  For cool data, slightly lower availability SLA and higher access costs are acceptable tradeoffs for much lower storage costs.  Azure Blob storage now addresses this need for differentiated storage tiers for data with different access patterns and pricing model.  So you can now choose between Cool and Hot access tiers to store your less frequently accessed cool data at a lower storage cost, and store more frequently accessed hot data at a lower access cost.  The Access Tier attribute of hot or cold is set at an account level and applies to all objects in that account.  So if you want to have both a hot access tier and a cold access tier, you will need two accounts.  If there is a change in the usage pattern of your data, you can also switch between these access tiers at any time.

It looks like there shouldn’t be a performance difference between the two;  it’s more of a cost difference in which you might be able to save money by choosing your tier wisely.

Integrating Lambda With Relational Databases

Bob Strahan shows how to integrate AWS Lambda with relational databases running on EC2:

Here are a few reasons why you might find this capability useful:

  • Instrumentation: Use database triggers to call a Lambda function when important data is changed in the database. Your Lambda function can easily integrate with Amazon CloudWatch, allowing you to create custom metrics, dashboards and alarms based on changes to your data.

  • Outbound streaming: Again, use triggers to call Lambda when key data is modified. Your Lambda function can post messages to other AWS services such as Amazon SQS, Amazon SNS, Amazon SES, or Amazon Kinesis Firehose, to send notifications, trigger external workflows, or to push events and data to downstream systems, such as an Amazon Redshift data warehouse.

  • Access external data sources: Call Lambda functions from within your SQL code to retrieve data from external web services, read messages from Amazon Kinesis streams, query data from other databases, and more.

  • Incremental modernization: Improve agility, scalability, and reliability, and eliminate database vendor lock-in by evolving in steps from an existing monolithic database design to a well-architected, modern microservices approach. You can use a microservices architecture to migrate business logic embodied in database procedures into database-agnostic Lambda functions while preserving compatibility with remaining SQL packages.

His specific example is around Oracle/Postgres, but I’d imagine you could do the same on SQL Server with the CLR.

Comparing Paths With Powershell

Derik Hammer introduces us to Join-Path:

It is obvious that the Join-Path method is easier to remember and faster to type. I use this method every time that I compare paths, even if they are full paths with file names.

The output of the Join-Path cmdlet can seem odd when using full paths but the comparison still functions properly.

This is pretty neat. The normal use of Join-Path is to combine a path with a filename to create a file path, but I like this usage of the cmdlet.

Linked Lists

Ewald Cress digs into linked lists to explain (deep) SQLOS internals:

The memory layout of a linked list doesn’t imply specific usage semantics. If we consistently insert at the head and remove from the tail, we have a queue. If we both insert and remove items from the head, we have a stack. And it is possible to have variations of these as well.

Finally, it is clear that insert and remove operations are multi-step, and the list is in an inconsistent state – i.e. not safe to traverse or modify – in the middle of such an operation. For this reason, locking semantics must be implemented. This will typically take the form of a spinlock which must be aquired before trying to access the list for any purpose. The object which owns the list head will then normally have a spinlock as a data member associated with the list head, although it is possible to have one spinlock protect multiple items beyond just a single linked list; this could be a sign of sane design, but conversely it means a coarser locking grain, which can sometimes work against you.

Even at this “simple” level, we’re digging pretty deep here.

Deadlocks Aren’t Blocks

Dave Mason shows the difference between a blocked process and a deadlocked process:

Sometimes our end users may not know the difference between a deadlock and blocking. In fact, there are some that may use these two interchangeably, thinking they are the same thing. A quick demonstration is often helpful to show the difference. Here’s a simple example that uses two global temp tables.

NOTE: I want to stress that this is merely a simple example, and not a deep dive or exhaustive article. Books have been written on concurrency, including one by Kalen Delaney (b|t). It’s free, and highly recommended.

Ultimately, blocked processes can complete (once the blocker finishes its work); a set of deadlocked processes can never complete, which is why (at least) one of the deadlocked processes needs to be rolled back.  I like Dave’s “show me” approach here.

Business Logic

Ed Elliott hits a classic architectural argument—whether business logic should be in stored procedures;

Stackoverflow is a specific use case and they decided to use .Net so they have a specific set of problems to deal with in terms of performance. They deploy (as I understand it) 10 times a day so if they need to change a query then they can quickly and easily – how quickly can you modify code and get it to production to fix a problem causing downtime on your mission critical app written in powerbuilder 20 years ago? (I jest but you get the point)

I like Ed’s back-and-forth arguing, as there are legitimate cases for both sides and the best answer almost always is somewhere in between for line of business apps.   I have three points that I tend to mention whenever this discussion comes up.

First, a lot of “business logic” is actually data logic.  Check constraints, foreign key constraints, unique key constraints, and even primary key constraints (for non-surrogate primary keys) are business rules, but they’re business rules around how the data is shaped and it’s a lot better to use your database system to maintain those rules.

Second, validation rules should be everywhere.  The fancy Javascript library should do validation, the server-side business logic should do validation, and the database should do validation.  You don’t know what’s going to skip one or more of these layers, and your database is the final gatekeeper preventing bad data from sneaking into your system.

Third, at the margin, go where your maintenance developers are most comfortable.  If they’re really good with C# but not good with SQL, the marginal business logic (the stuff you could really go either way on) should stay in the app tier; if your maintainers have really strong SQL skills but are lagging on the .NET side, I’d stick the marginal logic in stored procedures.


May 2016
« Apr Jun »