Press "Enter" to skip to content

Month: January 2018

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Comments closed

Stream Reactor Update

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0:

Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka into other systems. The connectors support KCQL (Kafka Connect Query Language), an open source component of Lenses SQL Enginethat provides an elegant and simple SQL like syntax for selecting fields and routing from sources or topics to Kafka or the target system (topic to target entity mapping, field selection, auto creation, auto evolution, error policies).

We hope you find Stream Reactor useful, and want to give it a try! Stream Reactor has over 25 connectors available, tested and documented, supporting both Kafka 0.11 and Kafka 1.0 and you can give it a go by downloading Lenses Development Environment or find the jars on GitHub, or even build the code locally and help us improve and add even more connectors.

Read on for more details, as well as a link to the GitHub repo.

Comments closed

Filtering Class Type In Server Audits

Solomon Rutzky shows how to filter the class_type field in a SQL Server audit to filter out scalar valued functions:

According to the documentation for CREATE SERVER AUDIT, I should be able to add a WHERE clause (starting in SQL Server 2012) to do simple filtering. The documentation states that the list of fields that can be filtered on is found in the documentation for the [sys.fn_get_audit_file] system function. Selecting from that function showed that the class_type field contains the “object type” (“FN” = Function, “P” = Stored Procedure, etc.) value. That’s just what I needed. So I’m just about done, right? Not so fast!

Nothing’s ever that easy, it seems.  Read on for the full story.

Comments closed

SQL Server And Polygons

Bert Wagner shows that SQL Server interprets polygon definitions a bit differently from GeoJSON:

Discerning eyes might notice that SQL Server didn’t shade in the area inside of the polygon — it instead shaded in everything in the world EXCEPT for the interior of our polygon.

If this is the first time you’ve encountered this behavior then you’re probably confused by this behavior — I know I was.

Read on to learn more about left-hand and right-hand polygon specifications and how to translate from one to the other.

Comments closed

Defining A Data Lake

Derik Hammer gives us a definition of the data lake:

Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.

The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.

I would like to take this opportunity to remind readers that the Aristotelian opposite of the Data Lake is the Data Swamp.  Derik uses this term as well and it makes me feel warm and fuzzy inside to see broad adoption of this term.

Comments closed

The SQLUndercover Inspector

Adrian Buckman announces the SQLUndercover Inspector:

In a Nutshell:

The SQLUndercover Inspector is a configurable daily report written in SQL that will send you an email (or log the report to a SQL Table) showing you information about specific parts of SQL Server in HTML format including highlighted warnings/advisory conditions, the report has configurable thresholds and settings leaving you in control.

Click through to see what they track.

Comments closed

DRY With Biml

Cathrine Wilhelmsen shows how to use tiered Biml files to prevent repetition:

Many Biml solutions start very simple, with just a single Biml file that generates a few SSIS packages. Most developers quickly see the need for a more complex solution for multiple sources. One way to reuse code and apply the Don’t Repeat Yourself software engineering principle in Biml is to use Tiered Biml Files.

In addition to using Tiered Biml Files, there are four other main ways you can avoid repeating your Biml code:

In this post we will look at how to use Tiered Biml Files.

Tiering is a helpful mechanism for doing work in one location and using the subsequent results of that work within your Biml solution.  Read the whole thing.

Comments closed

CYA Parameters In Powershell

Rob Sewell describes a class of Powershell parameters designed to keep you from doing something awful on accident:

If you wanted to see what would happen, you could edit the script and add the WhatIf parameter to every changing command but that’s not really a viable solution. What you can do is

this will set all commands that accept WhatIf to use the WhatIf parameter. This means that if you are using functions that you have written internally you must ensure that you write your functions to use the common parameters

Once you have set the default value for WhatIf as above, you can simply call your script and see the WhatIf output

WhatIf is a great parameter and when developing cmdlets, you should add in support.

Comments closed

Unit Testing Spark Streaming DStreams

Anuj Saxena shows how to create unit tests for DStreams in Spark Streaming:

The method ‘ testOperation ‘ takes the output of the operation performed on the ‘inputPair’ and check whether it is equal to the ‘outputPair’ and just like this, we can test our business logic.

This short snippet lets you test your business logic without forcing you to create even a Spark session. You can mock the whole streaming environment and test your business logic easily.

This was a simple example of unary operations on DStreams. Similarly, we can test binary operations and window operations on DStreams.

Click through for an example with code.

Comments closed

Markov Chains In Python

Sandipan Dey shows off various uses of Markov chains as well as how to create one in Python:

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Comments closed