Press "Enter" to skip to content

Day: August 30, 2019

Real-Time Analytics with Divolte, Kafka, Druid, and Superset

Fokko Driesprong gives us a proof of concept architecture for real-time analytics in the Hadoop ecosystem:

Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.

Divolte can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems. By using a JavaScript tag in the browser of the customers, it gathers data about their behaviour on the website or application. You’re in full control what you do, and don’t want to capture.

Click through for the example.

Comments closed

Combining Neo4j with Kafka

David Allen shows how you can use Neo4j to visualize graphic data living in Kafka:

We’re enabling the plugin to work as both a source and a sink. In the NEO4J_streams_sink_topic_cypher_friends item, we’re writing a Cypher query. In this query, we’re MERGE-ing two Person nodes. The plugin gives us a variable named event, which we can use to pull out the properties we need. When we MERGE nodes, it creates them only if they do not already exist. Finally, it creates a relationship between the two nodes (p1) and (p2).

This sink configuration is how we’ll turn a stream of records from Kafka into an ever-growing and changing graph. The rest of the configuration handles our connection to a Confluent Cloud instance, where all of our event streaming will be managed for us. If you’re trying this out for yourself, make sure to replace KAFKA_BOOTSTRAP_SERVERSAPI_SECRET, and API_KEY with the values that Confluent Cloud gives you when you generate an API access key.

Click through for the example.

Comments closed

SQL Server 2019 RC 1.1

Amit Banerjee announces a minor numeric change and a big update to SQL Server 2019 RC1:

In continuation with our announcement of SQL Server 2019 release candidate last week, we’re announcing that the release candidate refresh for SQL Server 2019 is now available to download. The release candidate now includes bits for Big Data Clusters in SQL Server 2019 in this refresh.

Back in July, we announced the preview of Big Data Clusters in SQL Server 2019 and since then we’ve seen our customers actively bringing their big data analytical workloads to SQL Server 2019 to operationalize their AI and machine learning projects.

Read on for more.

Comments closed

SQL Injection without Dynamic SQL

Erik Darling has a card trick for us:

I always try to impart on people that SQL injection isn’t necessarily about vandalizing or trashing data in some way.

Often it’s about getting data. One great way to figure out how difficult it might be to get that data is to figure out who you’re logged in as.

There’s a somewhat easy way to figure out if you’re logged in as sa.

Wanna see it?

Of course you do.

Comments closed

Fixing Windows Power Settings

Jeff Iannucci takes us through power settings within T-SQL:

Well, not exactly, but it’s definitely like that. The default Power Setting is “Balanced” which means during periods of lower activity the clock speeds of your CPUs are reduced to conserve power and save your battery.

Apparently all Windows installations think they are on laptops. SPOILER ALERT: your database servers are probably not laptops.

Jeff has a T-SQL script to fix this. Unfortunately, it won’t fix the other power-based performance killer: power settings in BIOS.

Comments closed

Understanding Power BI Data Virtualization Queries

Gerhard Brueckl walks us through a few examples of queries Power BI makes when virtualizing data:

Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom.

Read on for the queries and for Gerhard’s analysis.

Comments closed

AzCopy, Batch Files, and the Task Scheduler

Randolph West shares the results of persistent, relentless experimentation:

This coincidentally has caused an issue if we are using Windows Task Scheduler to schedule the synchronization process, especially if we use a SAS (Shared Access Signature) token which can be quite long. What then happens is we have a command that is longer than Windows Task Scheduler allows, and the task will fail with a very unhelpful error message:

Task Scheduler failed to execute task "\AzureBlobStorageSync". Attempting to restart. Additional Data: Error Value: 2147942487.

Click through to see how Randolph fixed this problem, which created a new problem necessitating Randolph solve it as well.

Comments closed

Estimates outside the Histogram Range

Josh Darnell shows us how SQL Server calculates estimates for input values outside of the range of your relevant statistic’s histogram:

I have the impression that CSelCalcColumnInInterval “fails” if the predicate doesn’t fall within any of the histogram intervals. The estimation logic then chooses to try the CSelCalcAscendingKeyFiltercalculator (a reference to the “ascending key problem”) if the predicate is specifically higher than the last histogram interval.

Josh includes a couple of demos as well, so check them out.

Comments closed