Category: Hadoop

From Databricks to Power BI

Published 2020-10-22 by Kevin Feasel

Arun Sirpal shows how to connect your Databricks cluster to Power BI:

A very common approach is to query data straight from Databricks via Power BI. For this you need Databricks token and the JDBC address URL. This is found within Account settings of the cluster.

Read on to see how to get this token and how to connect to Databricks using it.

Comments closed

The Spark Starter Guide

Published 2020-10-20 by Kevin Feasel

Landon Robinson has some good news for us:

If you visit hadoopsters.com/spark or thesparkguide.com, you’ll see something new and exciting from us. It’s official: we’ve written and are publishing a comprehensive guide to Apache Spark.
This guide will be completely online and completely free. A book’s worth of content, containing exercises in Python and Scala to teach you Spark, at your fingertips. Again, free.

Landon has posted chapter 1, section 1 already:

This section introduces the concept of data pipelines – how data is processed from one form into another. It’s also the generic term used to describe how data moves from one location or form, and is consumed, altered, transformed, and delivered to another location or form.
You’ll be introduced to Spark functions like join, filter, and aggregate to process data in a variety of forms. You’ll learn it all through interactive Spark exercises in Scala and Python.

This is very early in the process but I’m excited.

Comments closed

The Main Components of Apache Spark

Published 2020-10-19 by Kevin Feasel

Manoj Pandey walks us through the key components in Apache Spark:

1. Spark Driver:
– The Driver program can run various operations in parallel on a Spark cluster.
– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.
– And in parallel it instantiates SparkSession for the Spark Application.
– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.
– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.

Click through for additional elements and how they fit together.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Published 2020-10-19 by Kevin Feasel

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1, link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Indexing S3 Data with NiFi and CDP Data Hubs

Published 2020-10-16 by Kevin Feasel

Eva Nahari, et al, walk us through text indexing of S3 data with Solar, NiFi, and Cloudera Data Platform:

Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc).
Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take.

Click through to see how well they do at that.

Comments closed

Kafka and Zookeeper: a Breakup in the Making

Published 2020-10-16 by Kevin Feasel

Gautam Goswami walks us through the situation with Apache Kafka and Apache Zookeeper:

Zookeeper is completely a separate system having its own configuration file syntax, management tools, and deployment patterns. In-depth skill with experience is necessary to manage and deploy two individual distributed systems and eventually up and running Kafka cluster. The person who manages both the system together should have enough troubleshooting information to find out issues in both the systems.
There could be a possibility of making mistake on Zookeeper’s configuration files that might lead to breaking down of Kafka cluster. So having expertise in Kafka administration without Zookeeper won’t be able to help to come out from the crisis especially in the production environment where Zookeeper runs on a completely isolated environment (Cloud). Even though to setup and configure a single-node Kafka cluster for learning and R&D, we can’t proceed without Zookeeper.

Read on for the rest of the answer, as well as how Kafka is dis-integrating Zookeeper.

Comments closed

Querying Multiple Data Sources in Azure Synapse Analytics

Published 2020-10-16 by Kevin Feasel

James Serra walks us through querying Data Lake Storage Gen2, Cosmos DB, and a table created in an Azure Synapse serverless Apache Spark pool:

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.

Click through to see how.

Comments closed

Spark Infer Schema vs ADF Get Metadata

Published 2020-10-15 by Kevin Feasel

Paul Andrew compares two techniques for retrieving metadata:

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.
Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone

Read on for a comparison of the two techniques.

Comments closed

MLOps with Azure Databricks and MLflow

Published 2020-10-14 by Kevin Feasel

Oliver Koernig walks us through some of the basics of MLOps using MLflow and Azure Databricks:

Most organizations today have a defined process to promote code (e.g. Java or Python) from development to QA/Test and production. Many are using Continuous Integration and/or Continuous Delivery (CI/CD) processes and oftentimes are using tools such as Azure DevOps or Jenkins to help with that process. Databricks has provided many resources to detail how the Databricks Unified Analytics Platform can be integrated with these tools (see Azure DevOps Integration, Jenkins Integration). In addition, there is a Databricks Labs project – CI/CD Templates – as well as a related blog post that provides automated templates for GitHub Actions and Azure DevOps, which makes the integration much easier and faster.
When it comes to machine learning, though, most organizations do not have the same kind of disciplined process in place.

Read on for a demonstration of the process.

Comments closed

Measuring Advertising Effectiveness

Published 2020-10-12 by Kevin Feasel

Layla Yang and Hector Leano walk us through measuring how effective an advertising campaign was:

At a high level we are connecting a time series of regional sales to regional offline and online ad impressions over the trailing thirty days. By using ML to compare the different kinds of measurements (TV impressions or GRPs versus digital banner clicks versus social likes) across all regions, we then correlate the type of engagement to incremental regional sales in order to build attribution and forecasting models. The challenge comes in merging advertising KPIs such as impressions, clicks, and page views from different data sources with different schemas (e.g., one source might use day parts to measure impressions while another uses exact time and date; location might be by zip code in one source and by metropolitan area in another).
As an example, we are using a SafeGraph rich dataset for foot traffic data to restaurants from the same chain. While we are using mocked offline store visits for this example, you can just as easily plug in offline and online sales data provided you have region and date included in your sales data. We will read in different locations’ in-store visit data, explore the data in PySpark and Spark SQL, and make the data clean, reliable and analytics ready for the ML task. For this example, the marketing team wants to find out which of the online media channels is the most effective channel to drive in-store visits.A

Click through for the article as well as notebooks.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31