Press "Enter" to skip to content

Curated SQL Posts

Data Masking with Powershell

Jeff Hicks wants to stop a shoulder-surfing attack:

If I print this or send it to a file, I might not want the computername to be shown. Or maybe even my name. I want PowerShell to handle this for me. In short, I need to replace strings like ‘Prospero’ and ‘Jeff Hicks’ with some like XXX and ‘Joe Doe’.

PowerShell has a Replace operator. Or I can use regular expressions. Naturally, I wenbt with the latter.

Read on to see what you’d need to do.

Comments closed

Modifying Existing Objects with Powershell

Robert Cain dives a bit further into objects in Powershell:

So far we’ve covered a lot in this series on PowerShell Objects. We began by creating a class using the class keyword introduced in PowerShell 5. We then looked at using PSCustomObject to create new objects, then add methods to those objects. In the previous post, we used C# code to create our PowerShell objects.

In this installment, we’ll see how to add properties and methods to existing objects created by someone else. For this example we’ll use objects returned by the Get-ChildItem cmdlet, but we could use any objects.

Click through to see what you can do.

Comments closed

Making an R Box Plot from a Picture

Tomaz Kastrun builds a plot:

We create a raster image from a picture and calculating the ratio of the pixels on the scale of grayscale. The more the darker colour is represented in the pixels, the bigger the value. And this value is converted into the vector of values. And each vector is represneted as a violin boxplot.

Click through for an example.

Comments closed

Choosing between Duplicate Indexes

David Alcock wants to know what choice you make when all choices lead to the same conclusion:

As there is an index that has the LastName and FirstName columns the optimiser has opted for an index seek operator using the IX_Person_LastName_FirstName_MiddleName index, and if I look into the Plan XML I can see that it’s using a trivial plan: StatementOptmLevel=”TRIVIAL”.
This basically means there’s one obvious way to return the query results so the optimiser has avoided the cost of going through full optimisation and has elected to use this plan straightaway.
So what happens if I create an identical copy of that particular index, in fact let’s create five indexes that are exactly the same:

There’s a Mass Effect 3 joke in my intro line. But read on for the answer and also check out Barney Lawrence’s comment for a bit more elucidation.

Comments closed

get_json_object and json_tuple in Hive

The Hadoop in Real World team looks at a pair of Hive functions:

Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive. 

Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.

Click through for that script as well as seeing how you use each of get_json_object() and json_tuple() and which might be better-suited for your purposes.

Comments closed

Ranking Data in Spark

Landon Robinson continues the Spark Starter Guide:

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.

One very common ranking function is row_number(), which allows you to assign a unique value or “rank” to each row or rows within a grouping based on a specification. That specification, at least in Spark, is controlled by partitioning and ordering a dataset. The result allows you, for example, to achieve “top n” analysis in Spark.

One minor adjustment I’d make is not calling the output of ROW_NUMBER() “Rank” because then it’d make me think that’s the output of the RANK() window function. In the event of ties, those two outputs will differ.

Comments closed

Automating Power BI Data Model Metadata Extraction

Gerhard Brueckl avoids manual processes:

In the past I have been working on a lot of different Power BI projects and it has always been (and still is) a pain when it comes to the deployment of changes across multiple tiers(e.g. Dev/Test/Prod). The main problem here being that a file generated in Power BI desktop (.pbix) is basically a binary file and the metadata of the actual data model (BIM) cannot be easily extracted or derived. This causes a lot of problems upstream when you want to automate the deployment using CI/CD pipelines. Here are some common approaches to tackle these issues:

Click through to see several bad to palatable options and then check out Gerhard’s solution, which is significantly better. CI/CD is a huge pain point for Power BI developers but people like Gerhard are doing what they can to help.

Comments closed

Workers and Requests in Azure SQL Database

Kendra Little has a documentation change of note:

We now explicitly define ‘requests’ and ‘workers’ in the Azure SQL Database documentation, and we’ve cleaned up multiple places where we used to equate the two terms. In this post, I share the history of the two terms when it comes to Azure SQL Database, why the two were ever equated, and why things like this are tricky to change.

There’s a bit of behind-the-scenes around documentation work as well and the types of challenges you might run into when developing software for the public.

Comments closed