Press "Enter" to skip to content

Day: February 26, 2020

Good Ideas for Designing Data Lakes

Prateek Shrivastava and Rangasayee Chandrasekaran share some advice on designing data lakes in the cloud:

Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring and analyzing these datasets in their raw form is tedious, because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

Click through for the details.

Comments closed

Using a Spark Listener

Bipin Patwardhan shares with us an event ingestion engine for Apache Spark:

In the last quarter of 2019, I developed a meta-data driven, ingestion engine using Spark. The framework /library has multiple patterns to cater to multiple source and destination combinations. For example, two patterns are available for loading flat files to cloud storage (one to load data to AWS S3 and another to load data to Azure Blob).

As data loading philosophies have changed from Extract-Transform-Load (ETL) to Extract-Load-Transform (ETL), such a framework is very useful, as it reduces the time needed to set up ingestion jobs.

Is anyone else getting Integration Services or Informatica flashbacks? Because I sure am.

Comments closed

Searching for Functions in dbatools

Andy Levy wants to find a function in dbatools:

dbatools has a lot of functions. A lot. Over 550. There is a great command index on the website, and the documentation gets updated every time a new version is published. But sometimes, it feels like you can’t find what you need unless you already know the name of the thing you’re looking for. Other times, you might start writing your own wrapper around dbatools functions, or maybe start a new function from scratch, because it seems like the functionality you need isn’t there.

Read on to see how it works and a couple of examples.

Comments closed

Pivoting Data in R

Dave Mason shows how you can pivot SQL Server data using Machine Learning Services and R:

Pivoting data in SQL Server is something that comes up infrequently for me. Whenever the need arises, I have to pause and ask myself “What is it I’m trying to do again?”. Next I go to the documentation for the T-SQL PIVOT syntax (which I’ll never commit to memory) and realize for the umpteenth time the pivoted values have to be hard coded. Then I ponder using dynamic T-SQL because I won’t always know the values to pivot at query design time.

If T-SQL isn’t a good hammer to PIVOT’s nail, perhaps R is. There are different packages that make summarizing and transposing data frames somewhat easy. It can even dynamically pivot unknown values at runtime. But there is a catch, which I’ll get to in a bit.

This excerpt ends on a cliffhanger, so you’ll have to read Dave’s post to learn about the catch.

Comments closed

Displaying a Toast with Powershell

Jess Pomfret has some fun with dbatools and the BurntToast Powershell module:

I have just a quick tip for you today using the BurntToast module to notify us when a backup is complete. As DBAs there is always plenty to do, so we don’t want to have to sit and watch a long running script to catch the moment when it finishes.  Usually what happens to me is I kick off the script, move on to something else and then totally forget about it, perhaps until someone asks if it’s done yet. Oops. Well this tip will help avoid that.

Best used sparingly, but quite helpful when you’re working on a long-running task and need to get back to Spider Solitaire work.

Comments closed

Managing Performance on Azure SQL Managed Instances

Tim Radney has a few considerations for you if you want to start using Azure SQL Managed Instances:

Storage is a bit more difficult to plan and make considerations for, due to having to consider multiple factors. For storage you need to account for the overall storage requirement for both storage size, and I/O needs. How many GBs or TBs are needed for the SQL Server instance and how fast does the storage need to be? How many IOPS and how much throughput is the on-premises instance using? For that, you must baseline your current workload using perfmon to capture average and max MB/s and/or taking snapshots of sys.dm_io_virtual_file_stats to capture throughput utilization. This will give you an idea of what type of I/O and throughput you need in the new environment. Several customers I’ve worked with have missed this vital part of migration planning and have encountered performance issues due to selecting an instance level that didn’t support their workload.

Tim has a lot of good advice in here, so read the whole thing.

Comments closed

Finding Folder Sizes with Powershell

Jeffrey Hicks shows us a quick way to retrieve information on folder sizes:

It is simple enough to run Get-Childitem and pipe the results to Measure-Object.

But this can be often time consuming. The other drawbacks, besides typing is it takes extra work for format the results into something more user friendly. And if i want to include hidden files, I have to remember to use -Force with Get-ChildItem.

Cmdlets are nice and convenient. And I always recommend to beginner or even intermediate scripters, if there is a cmdlet to use over the .NET Framework, use the cmdlet. But there are always exceptions and as you grow in expertise, you’ll realize there are times when going directly to the .NET Framework is a better choice. Which is what I have done.

Click through for the solution.

Comments closed