Divolte Collector is a scalable and performant application for collecting clickstream data and publishing it to a sink, such as Kafka, HDFS or S3. Divolte has been developed by GoDataDriven and made available to the public under the Apache 2.0 open source license.
Click through for the example.
We’re enabling the plugin to work as both a source and a sink. In the
NEO4J_streams_sink_topic_cypher_friendsitem, we’re writing a Cypher query. In this query, we’re
Personnodes. The plugin gives us a variable named
event, which we can use to pull out the properties we need. When we
MERGEnodes, it creates them only if they do not already exist. Finally, it creates a relationship between the two nodes
This sink configuration is how we’ll turn a stream of records from Kafka into an ever-growing and changing graph. The rest of the configuration handles our connection to a Confluent Cloud instance, where all of our event streaming will be managed for us. If you’re trying this out for yourself, make sure to replace
API_KEYwith the values that Confluent Cloud gives you when you generate an API access key.
Click through for the example.
There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.
To help, I have written a small powershell script that a) validates that the setup is correct and then b) runs one of the spark examples to prove that everything is setup correctly.
Click through for the script.
The Apache Flink project’s goal is to develop a stream processing system to unify and power many forms of real-time and offline data processing applications as well as event-driven applications. In this release, we have made a huge step forward in that effort, by integrating Flink’s stream and batch processing capabilities under a single, unified runtime.
Significant features on this path are batch-style recovery for batch jobs and a preview of the new Blink-based query engine for Table API and SQL queries. We are also excited to announce the availability of the State Processor API, which is one of the most frequently requested features and enables users to read and write savepoints with Flink DataSet jobs. Finally, Flink 1.9 includes a reworked WebUI and previews of Flink’s new Python Table API and its integration with the Apache Hive ecosystem.
Click through for the major changes.
Simply put, an External Table is a table built directly on top of a folder within a data source. This means that the data is not hidden away in some proprietary SQL format. Instead, the data is completely accessible to outside systems in its native format. The main reason for this is that it gives us the ability to create “live” queries on top of text data sources. Every time a query is executed against the table, the query is run against the live data in the folder. This means that we don’t have to run ETL jobs to load data into the table. Instead, all we need to do is put the structured files in the folder and the queries will automatically surface the new data.
Each language has its own way of doing things, but they all use the Hive metastore under the covers.
We need compute to run our notebooks and this is achieved by creating a cluster. A cluster is merely a number of Virtual Machines behind the scenes used to form this compute resource. The benefit of Azure Databricks is that compute is only chargeable when on.
Let’s now click the Clusters icon and set up a simple cluster. Once you have loaded the page you can use the “Create Cluster” button.
Click through for an explanation of what each of the settings means.
Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.
It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.
We are loading the YCSB dataset with 1000,000,000 records with each record 1KB in size, creating total 1TB of data. After loading, we wait for all compaction operations to finish before starting workload test. Each workload tested was run 3 times for 15min each and the throughput* measured. The average number is taken from 3 tests to produce the final number.
The post argues that there’s an improvement, but when a majority of cases end up worse (even if just a little bit), I’m not sure it’s much of an improvement.
When a user creates a Delta Lake table, that table’s transaction log is automatically created in the
_delta_logsubdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with
000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as
000001.json, the following as
000002.json, and so on.
It’s interesting that they chose JSON instead of a binary transaction log like relational databases use.
The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the .NET code and .NET Java classes (from the JAR file) which then translate the requests into Java requests in the Java VM which hosts Spark.
The .NET driver is added to a .NET program using NuGet and ships both the .NET library as well as two Java jars. One jar is for Spark 2.3 and one for Spark 2.4, and you do need to use the correct one on your installed version of Scala.
As much as I’ve enjoyed his series, getting it in a single-post format is great.