Press "Enter" to skip to content

Day: July 6, 2018

Ingesting Google Analytics Data Into Kafka With Python

Bill Ward shows us an example using Python of reading Google Analytics data into a Kafka topic:

Google Analytics is a very powerful platform for monitoring your web site’s metrics including top pages, visitors, bounce rate, etc. As more and more businesses start using Big Data processes, the need to compile as much data as possible becomes more advantageous. The more data you have available, the more options you have to analyze that data and produce some very interesting results that can help you shape your business.

This article assumes that you already have a running Kafka cluster. If you don’t, then please follow my article Kafka Tutorial for Fast Data Architecture to get a cluster up and running. You will also need to have a topic already created for publishing the Google Analytics metrics to. The aforementioned article covers this procedure as well. I created a topic called admintome-ga-pages since we will be collection Google Analytics (ga) metrics about my blog’s pages.

In this article, I will walk you through how to pull metrics data from Google Analytics for your site then take that data and push it to your Kafka Cluster. In a later article, we will cover how to take that data that is consumed in Kafka and analyze it to give us some meaningful business data.

This is a good tutorial and I’m looking forward to part two.

Comments closed

Machine Learning With F#

Diogo Souza gives us an introduction to using Accord.NET in F#:

F# is a scripting as well as a REPL language. REPL comes from Read-Eval-Print Loop, which means that the language processes single steps one at a time like reading the user inputs (usually expressions), evaluating their values and, in the end, returning the result to the same user. All that happens in a loop until the loop ends. Visual Studio provides a great F# Interactive view that runs the scripts in REPL mode and shows the results. Take the following Hello World example:

This code just creates a single variable (let keyword) and assigns a string value to it. When you run this code (select all the code text and press Alt + Enter), you’ll see the following result in the F# Interactive window (Figure 5):

You can also use C# with Accord.NET, but there’s a strong bias toward F# among people in the .NET space who work with ML, for the same reason that there’s a bias toward Scala over Java for Spark developers:  the functional programming paradigm works extremely well with mathematical concepts.  Also, in addition to Accord.NET, you might also want to check out Math.NET.  My experience has been that this package tends to be a bit faster than Accord.

Comments closed

Constrained Optimization In Python: pyomo

Jeff Schecter introduces us to pyomo, a Python package for constrained optimization problems:

Constrained optimization is a tool for minimizing or maximizing some objective, subject to constraints. For example, we may want to build new warehouses that minimize the average cost of shipping to our clients, constrained by our budget for building and operating those warehouses. Or, we might want to purchase an assortment of merchandise that maximizes expected revenue, limited by a minimum number of different items to stock in each department and our manufacturers’ minimum order sizes.

Here’s the catch: all objectives and constraints must be linear or quadratic functions of the model’s fixed inputs (parameters, in the lingo) and free variables.

Constraints are limited to equalities and non-strict inequalities. (Re-writing strict inequalities in these terms can require some algebraic gymnastics.) Conventionally, all terms including free variables live on the lefthand side of the equality or inequality, leaving only constants and fixed parameters on the righthand side.

To build your model, you must first formalize your objective function and constraints. Once you’ve expressed these terms mathematically, it’s easy to turn the math into code and let pyomo find the optimal solution.

I haven’t touched it in a decade, but I did have some success with LINGO for solving the same type of problem.

Comments closed

Benefits To Federating The Hadoop NameNode

Hanisha Koneru and Arpit Agarwal show us a few benefits to NameNode federation:

The Apache Hadoop Distributed File System (HDFS) is highly scalable and can support petabyte-sizes clusters.  However, the entire Namespace (file system metadata) is stored in memory. So even though the storage can be scaled horizontally, the namespace can only be scaled vertically. It is limited by the how many files, blocks and directories can be stored in the memory of a single NameNode process.

Federation was introduced in order to scale the name service horizontally by using multiple independent Namenodes/ Namespaces. The Namenodes are independent of each other and there is no communication between them. The Namenodes can share the same Datanodes for storage.

KEY BENEFITS

Scalability: Federation adds support for horizontal scaling of Namespace

Performance: Adding more Namenodes to a cluster increases the aggregate read/write throughput of the cluster

Isolation: Users and applications can be divided between the Namenodes

Read on for examples.

Comments closed

SQL Server Vulnerability Assessment Powershell Cmdlets

Ronit Reger announces a new set of SQL Server vulnerability assessment Powershell cmdlets:

SQL Vulnerability Assessment (VA) is a service that provides visibility into your security state, and includes actionable steps to resolve security issues, and enhance your database security. It can help you:

  • Meet compliance requirements that require database scan reports.
  • Meet data privacy standards.
  • Monitor a dynamic database environment where changes are difficult to track.

VA runs vulnerability scans on your database, flagging security vulnerabilities and highlight deviations from best practices, such as misconfigurations, excessive permissions, and unprotected sensitive data. The rules are based on Microsoft’s best practices and focus on the security issues that present the biggest risks to your database and its valuable data. These rules also represent many of the requirements from various regulatory bodies to meet their compliance standards.

Results of the scan include actionable steps to resolve each issue and provide customized remediation scripts where applicable. An assessment report can be customized for your environment by setting an acceptable baseline for permission configurations, feature configurations, and database settings. This baseline is then used as a basis for comparison in subsequent scans, to detect deviations or drifts from your secure database state.

Read on for more, and if you’re interested, the cmdlets are available in the SqlServer Powershell module.

Comments closed

Read-Scale Availability Groups

Ryan Adams explains how to create a Read-Scale Availability Group:

A Read-Scale Availability Group is a Clusterless Availability Group.  It’s sole purpose and design is to scale out a read workload.  More importantly is what it is not.  It is NOT a High Availability or Disaster Recovery solution.  Since this design has no cluster under it, you lose things like automatic failover and database level health detection.  For example, You have reports that run for customers that are in your DMZ that is fire-walled off from your internal network.  Opening up ports for Active Directory so that you can have a cluster means opening a ton of ephemeral ports and ports with high attack vectors.  Remember the Slammer worm?  This solution removes those dependencies.

Click through for the setup scripts as well as a video Ryan created of him putting it all together.  As long as you recognize the trade-offs involved, this can be a nice solution to certain problems.

Comments closed

Resumable Online Index Creation In Azure SQL Database

Niko Neugebauer looks at a feature coming in SQL Server vNext:

It is about the time to create our first Clustered Online Resumable Index:

CREATE CLUSTERED INDEX CI_SampleDataTable	ON dbo.SampleDataTable (c1)	WITH ( ONLINE = ON, RESUMABLE = ON ) ;

But all we shall get is an error message:

Msg 155, Level 15, State 1, Line 25
'RESUMABLE' is not a recognized CREATE CLUSTERED INDEX option.

I was shocked and I was disappointed, but I have understood that it was my own mind’s fault. Nobody, I repeat – NOBODY has told me that it would work for the CLUSTERED Indexes, but when I see an announcement that the Indexes are supported, I was totally believing that the traditional (not XML, no CLR, no LOB’s) Rowstore Indexes would be totally supported. Oh yes, I know that it is crazy difficult. I know that this is a pretty forward-facing feature, but come on – my mind played trick on me, telling me the story that does not exist, for now, at least.

After realising my mind’s mistake I took a deeper breath and decided to try out the Resumable Nonclustered Index Creation with the following command:

CREATE NONCLUSTERED INDEX NCI_SampleDataTable	ON dbo.SampleDataTable (c1)	WITH ( ONLINE = ON, RESUMABLE = ON );

Hopefully we get a bit more support as SQL Server vNext is developed and eventually released.  In the meantime, Niko hits some limitations but his timings for the feature look good.

Comments closed

Finding Scalar Functions In Execution Plans

Kendra Little points out that scalar user-defined functions can hide in the most unassuming of places:

After we find matches based on the customer id, we have more work “left over” — that’s the “residual” bit.

For every row that matches, SQL Server is plugging values into the Website.CalculateCustomerPrice() function and comparing the result to the Unit price column, just like we asked for in the where clause.

In other words, this is happening for every row in Sales.InvoiceLines that has a matching row in Sales.Invoices.

Which is every single invoice & invoice line, as it turns out.

It’s a shame there’s no “this is why your query is slow” plan operator for scalar UDFs.

Comments closed

Faking Arrays In T-SQL With Custom Types

Jovan Popovic shows how to use custom types as pseudo-arrays in SQL Server:

One of the missing language features in T-SQL language is array support. In some cases you can use custom types to work with arrays that are passed as parameters to your stored procedures.

Custom types in T-SQL enable you to create alias for some table, .Net or built-in type. Once you create a custom type, you can use it for local variables and parameters of functions.

I go back and forth on whether I’d like full array support in T-SQL, as on the plus side, it simplifies interactions with external tools.  On the other hand, it can promote bad habits like violating first normal form.

Comments closed