Kevin Feasel – Page 781

The Main Components of Apache Spark

Published 2020-10-19 by Kevin Feasel

Manoj Pandey walks us through the key components in Apache Spark:

1. Spark Driver:
– The Driver program can run various operations in parallel on a Spark cluster.
– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.
– And in parallel it instantiates SparkSession for the Spark Application.
– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.
– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.

Click through for additional elements and how they fit together.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Published 2020-10-19 by Kevin Feasel

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1, link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Changing Power BI Slicer Appearance

Published 2020-10-19 by Kevin Feasel

Prathy Kamasani has a video:

In my recent open data project, I created a single page report model with a sparse slicer. It’s a good trick for anyone who wants to make their slicer look a bit sleeker. Like any other visual in Power BI, Slicers also have many properties. By default, below is how slicer looks in Power BI, but I made few changes to make it look like the one on left, in a few steps.

Click through for the video.

Comments closed

Parsing Parameter Default Values in Powershell

Published 2020-10-19 by Kevin Feasel

Aaron Bertrand continues a series:

In part 1 and part 2 of this series, I introduced ParamParser: a PowerShell module that helps parse parameter information – including default values – from stored procedures and user-defined functions, because SQL Server isn’t going to do it for us.
In the first few iterations of the code, I simply had a .ps1 file that allowed you to paste one or more module bodies into a hard-coded $procedure variable.

Read on to see what’s new in the ParamParser repo.

Comments closed

Swart’s Ten Percent Rule: User Connections

Published 2020-10-19 by Kevin Feasel

Michael J. Swart applies Swart’s 10% Rule to maximum simultaneous user connections:

The maximum number of user connections that SQL Server can support is 32,767. That’s it. That’s the end of the line. You can buy faster I.O. or a server with more CPUs but you can’t buy more connections.
I actually mentioned this limit in the post where I introduced Swart’s 10% rule: “If you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong” In that post, I was guarded about that statement as it applied to the user connection limit. But I’d like to upgrade that to elevated.

This is Threat Level Vermillion, people!

Comments closed

Validating Data Model Results

Published 2020-10-19 by Kevin Feasel

Paul Turley continues a discussion on Power BI data model validation:

We often have users of a business intelligence solution tell us that they have found a discrepancy between the numbers in a Power BI report and a report produced by their line-of-business (LOB) system, which they believe to be the correct information.
Using the LOB reports as a data source for Power BI is usually not ideal because at best, we would only reproduce the same results in a different report. We typically connect to raw data sources and transform that detail data, along with other data sources with historical information to analyze trends, comparisons and ratios to produce more insightful reports.
However, if the LOB reports are really the north star for data validation, these can provide an effective means to certify that a BI semantic model and analytic reports are correct and reliable.

Click through for more details.

Comments closed

The Big Red Button for Query Store

Published 2020-10-19 by Kevin Feasel

Erin Stellato shows us the emergency off switch for Query Store:

Have you ever tried to turn off Query Store when there was an issue, and you thought the problem might be related to Query Store, and the ALTER DATABASE statement was blocked? And then you couldn’t do anything but wait? Me too. Imagine my excitement when I discovered that the SQL Server team snuck a helpful back door into ALL versions for which Query Store is supported.

Read on for more, including which SP / CU levels support it.

Comments closed

Power BI September 2020 Update

Published 2020-10-19 by Kevin Feasel

Joseph Yeates looks at some of the new additions to Power BI:

It seems like I cover updated functionality to the Q&A feature every month now! This month, it is the ability to do arithmetic in the visual!
Below, I added together two recognized terms to have the sum displayed. I didn’t need to create a DAX measure to achieve this, it is all done within the visual.

Click through for more.

Comments closed

Indexing S3 Data with NiFi and CDP Data Hubs

Published 2020-10-16 by Kevin Feasel

Eva Nahari, et al, walk us through text indexing of S3 data with Solar, NiFi, and Cloudera Data Platform:

Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc).
Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take.

Click through to see how well they do at that.

Comments closed

Kafka and Zookeeper: a Breakup in the Making

Published 2020-10-16 by Kevin Feasel

Gautam Goswami walks us through the situation with Apache Kafka and Apache Zookeeper:

Zookeeper is completely a separate system having its own configuration file syntax, management tools, and deployment patterns. In-depth skill with experience is necessary to manage and deploy two individual distributed systems and eventually up and running Kafka cluster. The person who manages both the system together should have enough troubleshooting information to find out issues in both the systems.
There could be a possibility of making mistake on Zookeeper’s configuration files that might lead to breaking down of Kafka cluster. So having expertise in Kafka administration without Zookeeper won’t be able to help to come out from the crisis especially in the production environment where Zookeeper runs on a completely isolated environment (Cloud). Even though to setup and configure a single-node Kafka cluster for learning and R&D, we can’t proceed without Zookeeper.

Read on for the rest of the answer, as well as how Kafka is dis-integrating Zookeeper.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Author: Kevin Feasel