Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Fokko Driesprong gives us a proof of concept architecture for real-time analytics in the Hadoop ecosystem:

Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.

Divolte can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems. By using a JavaScript tag in the browser of the customers, it gathers data about their behaviour on the website or application. You’re in full control what you do, and don’t want to capture.

Click through for the example.

Combining Neo4j with Kafka

David Allen shows how you can use Neo4j to visualize graphic data living in Kafka:

We’re enabling the plugin to work as both a source and a sink. In the NEO4J_streams_sink_topic_cypher_friends item, we’re writing a Cypher query. In this query, we’re MERGE-ing two Person nodes. The plugin gives us a variable named event, which we can use to pull out the properties we need. When we MERGE nodes, it creates them only if they do not already exist. Finally, it creates a relationship between the two nodes (p1) and (p2).

This sink configuration is how we’ll turn a stream of records from Kafka into an ever-growing and changing graph. The rest of the configuration handles our connection to a Confluent Cloud instance, where all of our event streaming will be managed for us. If you’re trying this out for yourself, make sure to replace KAFKA_BOOTSTRAP_SERVERSAPI_SECRET, and API_KEY with the values that Confluent Cloud gives you when you generate an API access key.

Click through for the example.

SQL Server 2019 RC 1.1

Amit Banerjee announces a minor numeric change and a big update to SQL Server 2019 RC1:

In continuation with our announcement of SQL Server 2019 release candidate last week, we’re announcing that the release candidate refresh for SQL Server 2019 is now available to download. The release candidate now includes bits for Big Data Clusters in SQL Server 2019 in this refresh.

Back in July, we announced the preview of Big Data Clusters in SQL Server 2019 and since then we’ve seen our customers actively bringing their big data analytical workloads to SQL Server 2019 to operationalize their AI and machine learning projects.

Read on for more.

SQL Injection without Dynamic SQL

Erik Darling has a card trick for us:

I always try to impart on people that SQL injection isn’t necessarily about vandalizing or trashing data in some way.

Often it’s about getting data. One great way to figure out how difficult it might be to get that data is to figure out who you’re logged in as.

There’s a somewhat easy way to figure out if you’re logged in as sa.

Wanna see it?

Of course you do.

Fixing Windows Power Settings

Jeff Iannucci takes us through power settings within T-SQL:

Well, not exactly, but it’s definitely like that. The default Power Setting is “Balanced” which means during periods of lower activity the clock speeds of your CPUs are reduced to conserve power and save your battery.

Apparently all Windows installations think they are on laptops. SPOILER ALERT: your database servers are probably not laptops.

Jeff has a T-SQL script to fix this. Unfortunately, it won’t fix the other power-based performance killer: power settings in BIOS.

Understanding Power BI Data Virtualization Queries

Gerhard Brueckl walks us through a few examples of queries Power BI makes when virtualizing data:

Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom.

Read on for the queries and for Gerhard’s analysis.

AzCopy, Batch Files, and the Task Scheduler

Randolph West shares the results of persistent, relentless experimentation:

This coincidentally has caused an issue if we are using Windows Task Scheduler to schedule the synchronization process, especially if we use a SAS (Shared Access Signature) token which can be quite long. What then happens is we have a command that is longer than Windows Task Scheduler allows, and the task will fail with a very unhelpful error message:

Task Scheduler failed to execute task "\AzureBlobStorageSync". Attempting to restart. Additional Data: Error Value: 2147942487.

Click through to see how Randolph fixed this problem, which created a new problem necessitating Randolph solve it as well.

Estimates outside the Histogram Range

Josh Darnell shows us how SQL Server calculates estimates for input values outside of the range of your relevant statistic’s histogram:

I have the impression that CSelCalcColumnInInterval “fails” if the predicate doesn’t fall within any of the histogram intervals. The estimation logic then chooses to try the CSelCalcAscendingKeyFiltercalculator (a reference to the “ascending key problem”) if the predicate is specifically higher than the last histogram interval.

Josh includes a couple of demos as well, so check them out.

Checking Spark Config on Windows

Ed Elliott has a Powershell script to tell you if your Spark configuration on Windows is incorrect:

There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.

To help, I have written a small powershell script that a) validates that the setup is correct and then b) runs one of the spark examples to prove that everything is setup correctly.

Click through for the script.

Eliminating Tail Calls in Python

Kevin Feasel

2019-08-29

Python

John Mount shows how you can eliminate tail calls in Python:

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

Click through for John’s riff on the topic.

Categories

August 2019
MTWTFSS
« Jul Sep »
 1234
567891011
12131415161718
19202122232425
262728293031