One of the key benefits of using Kafka Streams over other streaming engines is that the stream processing apps / microservices don’t need a cluster. Rather, each microservice can be run as a standalone app (e.g: jvm container). You can then spin multiple instances of each to scale up the microservice. Kafka will treat this as a single consumer group with multiple instances. Kafka streams takes care of consumer partition reassignments for scalability.
You can see how to start these three microservices here.
If you’re trying to wrap your head around streaming apps, give this a try. George has all of the code in his GitHub repo.
The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, it is now possible to overcome such limitation if we have the possibility to change our approach.
Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Databricks Delta is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.
It’s an interesting contrast and I recommend reading the whole thing.
Leona Zhang has a series going on Apache Kafka. Part one covers some of the concepts around messaging systems:
There is a difference between batch processing applications and stream processing applications. A visible boundary determines the most significant difference between batch processing and stream processing. If it exists, it is called batch processing. For example, a client collects the data once every hour, sends this data to the server for statistics, and then saves the statistical results in the statistical database.
If the boundary doesn’t exist, the processing is called streaming data (stream processing). Here is an example of stream processing: logs and orders are generated continuously on a large website just like a data flow. If the processing of each log and order takes less than several hundred milliseconds or several seconds after its generation, the application is called a stream application. If the collection of logs and orders happens once every hour followed by a unified transmission, the original stream data converts into batch data.
Occasionally, stream processing becomes imperative. For example, Jack Ma wanted to display the orders and sales on Tmall for November 11 on a large screen. If the data center works in a T+1 mode and can obtain data for November 11 on November 12, Jack Ma would not be happy.
Kafka uses the group concept to integrate the producer/consumer and publisher/subscriber models.
One topic may have multiple groups, and one group may include multiple consumers. Only one consumer in the group can consume one message. For different groups, consumers are in the publisher/subscriber model. All groups receive one message.
Note: Allocate one partition to only one consumer in the same group. If there are three partitions and four consumers in one of the groups, one consumer is redundant and cannot receive any data.
This looks to be the start to a good series.
When creating a new PowerApp using the Power BI integration, you get an additional data source – PowerBIIntegration that serves as the connection to the Power BI report. Whenever a filtering action occurs in the Power BI report, this information is available in this property.
During the PowerApps creation action I selected the action to add a new form which in the next step needs to get a connection to the Article table (which holds the additional article details).
Check out the entire series too.
Since it is Friday and time for some more PowerShell fun, and I’ve been sharing some of my prompt functions, I thought I’d re-share my kitchen sink prompt. This PowerShell prompt function does *a lot* to things and gives you a snapshot view of your system everytime you press enter. It will work cross-platform, but because the function is using Get-CimInstance to retrieve system information it needs Windows. The prompt function will not only customize the onscreen prompt but also the title bar.
Click through for a link to the prompt as well as seeing it in action.
Cloud Shell is a lightweight way to run scripts using either Bash or PowerShell. You can run scripts in a browser using the Azure portal or shell.azure.com, with the Azure mobile app, or using the VS Code Azure Account extension. If you have seen the “Try it now” links in Azure documentation pages, that will direct you to use Cloud Shell.
The rest of this post focuses on using PowerShell with Cloud Shell.
Click through for the demo.
As we’ve discussed many times, the performance of the storage layer has an outsized impact on the total cost of ownership (TCO) for your complete analytics pipeline. This is due to the fact that every percentage point improvement in storage performance results in that same percentage reduction in the requirement for the very expensive compute layer. Given that the disaggregated storage model allows us to scale compute and storage independently, that percentage reduction in compute requirement results in almost the same (compute typically equates to 90 percent of the TCO) reduction in TCO.
So, when I say that ADLS Gen2 provides performance improvements ranging from 10-50 percent, depending on the nature of the workload over existing storage solutions, this equates to VERY significant reductions in the monthly analytics spend. It also has the added benefit of providing your insights sooner!
Check out all of the changes.
If memory consumption is below the Low limit everything is fine and it is free to stay in memory. Once the consumption passes the Low limit a cleaner thread wakes up and tries to clean up memory. At this point price of memory is no longer zero. It starts from 2 at the Low limit and goes as high as 1000 when memory consumption reaches the Total limit. The higher the memory pressure the more aggressive the cleaner gets. Once memory consumption reaches the Hard limit all connections/sessions are closed and queries are cancelled with an out of memory error.
This is a thorough explanation with some good demos and terrible queries. Give it a read.
In Java, there are also helper components, (a topic for future posts), but the integration is not as tight, so when we want to pass data into and out of Java we need to code somewhat more explicit to make data passing possible.
In our Java code, we need to represent the data passed in and out as class member column arrays. You define in your classes, one array per column passed in, and one array per column returned. These column arrays are some of the “magic” members I mentioned above, and they are the equivalent to
The components that are part of the Java extension need to know about these members as the components either populate them when pushing data into Java or read from them when returning data from Java. The way that the components know about the members is based on a naming standard.
It’s definitely easier to pass data in and get data back from R and Python, but I suppose part of that is Java being a static, compiled language.
Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.
PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.
Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!
Click through for the rest of the story.