2020-06-16 – Curated SQL

In this post, we explained how the redesigned Flink interpreter works in Zeppelin 0.9.0 and provided some examples for performing streaming ETL jobs with Flink and Zeppelin. In the next post, I will talk about how to do streaming data visualization via Flink on Zeppelin. Besides that, you can find an additional tutorial for batch processing with Flink on Zeppelin as well as using Flink on Zeppelin for more advance operations like resource isolation, job concurrency & parallelism, multiple Hadoop & Hive environments and more on our series of posts on Medium. And here’s a list of Flink on Zeppelin tutorial videos for your reference.

Click through for the demo, and stay tuned for part 2.

Comments closed

The Architecture of Apache Kafka

Published 2020-06-16 by Kevin Feasel

Michael Carter walks us through the key components and concepts behind Apache Kafka:

Despite its name’s suggestion of Kafkaesque complexity, Apache Kafka’s architecture actually delivers an easier to understand approach to application messaging than many of the alternatives. Kafka is essentially a commit log with a very simplistic data structure. It just happens to be an exceptionally fault-tolerant and horizontally scalable one.
The Kafka commit log provides a persistent ordered data structure. Records cannot be directly deleted or modified, only appended onto the log. The order of items in Kafka logs is guaranteed. The Kafka cluster creates and updates a partitioned commit log for each topic that exists. All messages sent to the same partition are stored in the order that they arrive. Because of this, the sequence of the records within this commit log structure is ordered and immutable. Kafka also assigns each record a unique sequential ID known as an “offset,” which is used to retrieve data.

Read on to learn more about the key ideas, such as producers, consumers, and partitions.

Comments closed

Stock Price Predictions with LSTM Models

Published 2020-06-16 by Kevin Feasel

Thenuja Shanthacumaran walks us through training a Long Short-Term Memory neural network model for predicting stock prices:

LSTM could not process a single data point. it needs a sequence of data for processing and able to store historical information. LSTM is an appropriate algorithm to make prediction and process based-on time-series data. It’s better to work on the regression problem.
The stock market has enormously historical data that varies with trade date, which is time-series data, but the LSTM model predicts future price of stock within a short-time period with higher accuracy when the dataset has a huge amount of data.

Click through for the process and a demo.

Comments closed

Understanding the RESOURCE_GOVERNOR_IDLE Wait Type in Azure

Published 2020-06-16 by Kevin Feasel

Josh Darnell does some sleuthing:

With a big gap between CPU and elapsed time, it’s often worthwhile to check wait statistics. If the query was running, but not using CPU, it seems reasonable that it was waiting on something. Normally, with on-prem SQL Server, you’d have to check sys.dm_os_wait_stats, and take a diff of the cumulative values before and after.
However, thanks to (relatively) recent enhancements to execution plans (which keep getting better and better!), we can see a subset of what resources the query waited on right in the plan.
Looking at the plan from my Azure query, here’s what I see:
<Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="5733" WaitCount="323" /> <Wait WaitType="RESOURCE_GOVERNOR_IDLE" WaitTimeMs="5545" WaitCount="430" />
Notice that there were 5.5 seconds of RESOURCE_GOVERNOR_IDLE waits during this query. That explains the 5 second gap in CPU and elapsed time. But what does it mean?

Click through to learn more about this in the context of Azure SQL Database.

Comments closed

Tips for Securing a SQL Server Instance

Published 2020-06-16 by Kevin Feasel

Eitan Blumin takes us through some ways of protecting an instance of SQL Server:

After reading up on a bunch of SQL Server penetration testing articles, I found that the steps of a common penetration test are as follows:
1. Discovery
2. Gaining Access
3. Elevating Permissions
4. Attacking (Loot / Destruction)
Logically, these steps mimic the steps taken by a common would-be hacker (except, of course, they try not to actually damage anything).
I’ll briefly describe each step from the point of view of a hacker or penetration tester, the common methodologies of each step, and offer recommendations that we can follow to protect our database systems at every level.

Click through for information on each step.

Comments closed

The Table Scan Operator

Published 2020-06-16 by Kevin Feasel

Hugo Kornelis dives into a common operator:

The Table Scan operator is used to read all or most data from a table that has no clustered index (also known as a heap table, or just as a heap). In combination with a Top operator, it can also be used to read just a few rows from a heap table when data order is irrelevant and there is no nonclustered index that covers all required columns.
The basic behavior of a Table Scan operator is very similar to that of the Index Scan operator when it chooses to do an IAM scan, but with a few very important differences. A heap table has no root, intermediate, and leaf level pages; it has data pages only. Each page read from the IAM is a data page and can be processed. But rows on a data page of a heap table can contain forwarding pointers, that cause out of order data access.

I’d say something like “I hope you don’t have too many table scans” because that means a lot of heaps, though given the use of temp tables without clustered indexes, even that statement failed the nuance test.

Comments closed

Installing Azure Data Studio on CentOS

Published 2020-06-16 by Kevin Feasel

Sreekanth Bandarla walks us through installation and usage of Azure Data Studio on CentOS:

Okay…now what? Where to locate the executable and how do I open Azure data studio in CentOS? GUI in CentOS is not as user friendly as you can see in some other Linux OSs (Mint for eg or few other Ubuntu flavors of Linux). In windows you can locate the program in start menu or even in few desktop experience Linux distributions it’s extremely easy to just search in application center, but that was not the case for me in CentOS 7.

Click through to see how to install and open ADS on a Red Hat-based system.

Comments closed

Converting XLS Files to XLSX with Power Automate

Published 2020-06-16 by Kevin Feasel

Erik Svensen shows how you can create a Power Automation flow to convert old-style Excel files (in .xls format) to newer-style Excel files (.xlsx) via a web service:

In the scenario I will use a trigger when an e-mail is received and use a rest API provided by https://cloudconvert.com/.
OBS – This is a paid service where you pay by the minute the conversion takes – price from $0.02 to $0.01 per minute.

Check out the comments for some additional information about the web service, including a free tier.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Day: June 16, 2020

Using Apache Flink in Zeppelin Notebooks

The Architecture of Apache Kafka

Stock Price Predictions with LSTM Models

Understanding the RESOURCE_GOVERNOR_IDLE Wait Type in Azure

Tips for Securing a SQL Server Instance

The Table Scan Operator

Installing Azure Data Studio on CentOS

Converting XLS Files to XLSX with Power Automate