June 2020 – Page 9 – Curated SQL

LSTM could not process a single data point. it needs a sequence of data for processing and able to store historical information. LSTM is an appropriate algorithm to make prediction and process based-on time-series data. It’s better to work on the regression problem.
The stock market has enormously historical data that varies with trade date, which is time-series data, but the LSTM model predicts future price of stock within a short-time period with higher accuracy when the dataset has a huge amount of data.

Click through for the process and a demo.

Comments closed

Understanding the RESOURCE_GOVERNOR_IDLE Wait Type in Azure

Published 2020-06-16 by Kevin Feasel

Josh Darnell does some sleuthing:

With a big gap between CPU and elapsed time, it’s often worthwhile to check wait statistics. If the query was running, but not using CPU, it seems reasonable that it was waiting on something. Normally, with on-prem SQL Server, you’d have to check sys.dm_os_wait_stats, and take a diff of the cumulative values before and after.
However, thanks to (relatively) recent enhancements to execution plans (which keep getting better and better!), we can see a subset of what resources the query waited on right in the plan.
Looking at the plan from my Azure query, here’s what I see:
<Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="5733" WaitCount="323" /> <Wait WaitType="RESOURCE_GOVERNOR_IDLE" WaitTimeMs="5545" WaitCount="430" />
Notice that there were 5.5 seconds of RESOURCE_GOVERNOR_IDLE waits during this query. That explains the 5 second gap in CPU and elapsed time. But what does it mean?

Click through to learn more about this in the context of Azure SQL Database.

Comments closed

Tips for Securing a SQL Server Instance

Published 2020-06-16 by Kevin Feasel

Eitan Blumin takes us through some ways of protecting an instance of SQL Server:

After reading up on a bunch of SQL Server penetration testing articles, I found that the steps of a common penetration test are as follows:
1. Discovery
2. Gaining Access
3. Elevating Permissions
4. Attacking (Loot / Destruction)
Logically, these steps mimic the steps taken by a common would-be hacker (except, of course, they try not to actually damage anything).
I’ll briefly describe each step from the point of view of a hacker or penetration tester, the common methodologies of each step, and offer recommendations that we can follow to protect our database systems at every level.

Click through for information on each step.

Comments closed

The Table Scan Operator

Published 2020-06-16 by Kevin Feasel

Hugo Kornelis dives into a common operator:

The Table Scan operator is used to read all or most data from a table that has no clustered index (also known as a heap table, or just as a heap). In combination with a Top operator, it can also be used to read just a few rows from a heap table when data order is irrelevant and there is no nonclustered index that covers all required columns.
The basic behavior of a Table Scan operator is very similar to that of the Index Scan operator when it chooses to do an IAM scan, but with a few very important differences. A heap table has no root, intermediate, and leaf level pages; it has data pages only. Each page read from the IAM is a data page and can be processed. But rows on a data page of a heap table can contain forwarding pointers, that cause out of order data access.

I’d say something like “I hope you don’t have too many table scans” because that means a lot of heaps, though given the use of temp tables without clustered indexes, even that statement failed the nuance test.

Comments closed

Installing Azure Data Studio on CentOS

Published 2020-06-16 by Kevin Feasel

Sreekanth Bandarla walks us through installation and usage of Azure Data Studio on CentOS:

Okay…now what? Where to locate the executable and how do I open Azure data studio in CentOS? GUI in CentOS is not as user friendly as you can see in some other Linux OSs (Mint for eg or few other Ubuntu flavors of Linux). In windows you can locate the program in start menu or even in few desktop experience Linux distributions it’s extremely easy to just search in application center, but that was not the case for me in CentOS 7.

Click through to see how to install and open ADS on a Red Hat-based system.

Comments closed

Converting XLS Files to XLSX with Power Automate

Published 2020-06-16 by Kevin Feasel

Erik Svensen shows how you can create a Power Automation flow to convert old-style Excel files (in .xls format) to newer-style Excel files (.xlsx) via a web service:

In the scenario I will use a trigger when an e-mail is received and use a rest API provided by https://cloudconvert.com/.
OBS – This is a paid service where you pay by the minute the conversion takes – price from $0.02 to $0.01 per minute.

Check out the comments for some additional information about the web service, including a free tier.

Comments closed

The Basics of A/B Testing with R

Published 2020-06-15 by Kevin Feasel

Holger von Jouanne-Diedrich walks us through a simple example of A/B testing and analysis using R:

The bad news is, that you have to understand a little bit about statistical hypothesis testing, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

Check out that article and the example in the blog post as well. R makes it really easy to perform this sort of analysis.

Comments closed

Smoothing Out Write Behavior in Apache Flink

Published 2020-06-15 by Kevin Feasel

Dmitry Tolpeko solves an interesting problem:

It would be nice to smooth S3 write operations between two checkpoints. How to do that?
You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.
So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.
So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency?

Read on for the answer.

Comments closed

Simplified Slope Graphs

Published 2020-06-15 by Kevin Feasel

Andy Kirk spots a few interesting uses of slope graphs:

As somebody who tries to consume as much visualisation work as possible, I always get a little extra joy from seeing clusters of the same techniques emerging. One such recent trend has been the use of simplified slope graphs.
By ‘simplified’ I mean they are stripped right back to a simple function of just showing the direction of change between two points in time, there are no axes and no other chart apparatus, just the trends.

I’m kind of iffy on it. I do like the map showing behavior of states over time, but the first visual had too much going on and the third visual had too much whitespace for my taste.

Comments closed

Choosing an Algorithm for Table.Join in Power Query

Published 2020-06-15 by Kevin Feasel

Chris Webb continues a series on optimizing merge performance in Power Query:

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.
There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

That behavior is the same as if you decided to start writing INNER LOOP JOIN or INNER HASH JOIN for your queries. In the right spot, you may have knowledge the optimizer doesn’t have and can make performance faster, but a lot of the time the approach will be a bit too heavy-handed and end up a net degradation of performance.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Month: June 2020

Stock Price Predictions with LSTM Models

Understanding the RESOURCE_GOVERNOR_IDLE Wait Type in Azure

Tips for Securing a SQL Server Instance

The Table Scan Operator

Installing Azure Data Studio on CentOS

Converting XLS Files to XLSX with Power Automate

The Basics of A/B Testing with R

Smoothing Out Write Behavior in Apache Flink

Simplified Slope Graphs

Choosing an Algorithm for Table.Join in Power Query