When customers were asking for an S3 connector, there were already several Kafka-to-S3 solutions out there at the time, so we had to decide whether to adopt an existing S3 connector, modify the Kafka Connect HDFS connector (as some developers attempted to do) or write a new connector from scratch.
We knew that our users needed three things from the connector:
1. Integration with the Kafka Connect API: Connect’s scaling and fault tolerance capabilities were important to have, and users didn’t want yet another system that they’d need to learn how to use, deploy and monitor.
2. Exactly once: Users didn’t want to waste expensive compute cycles on deduplicating their data. And no one likes missing events.
3. No extra dependencies: Especially dependencies on additional datastores. Kafka clients and the S3 SDK libraries should be all you need to get events from Kafka to S3. Simplicity rules, especially in a distributed systems world where simple is often the key to being reliable.
When we considered the existing connectors, we noticed that none of them delivered the reliability and exactly once capabilities we wanted. They treat S3 like it’s another file system—though it isn’t really. For example, S3 lacks file appends, it is eventually consistent, and listing a bucket is often a very slow operation.
Click through for a dive into what Confluent did and how it works.
Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.
To use all the resources available in a cluster, set the
maximizeResourceAllocationparameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the
spark-defaultssettings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for
spark.default.parallelismis only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.
Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).
There’s a lot in here, much of which applies to Spark in general and not just EMR.
HDFS offers a caching mechanism that takes advantage of the Data nodes memory. Blocks are loaded in memory and pinned there so that when a client requests those blocks they can be served directly from memory which is much faster than disk. There are some 3rd party products out there that does the same, but this option comes with Hadoop out of the box.
Hadoop has a special set of commands for managing this cache – the cacheadmin commands.
You must explicitly cache a directory or a file, and in case you cache a directory the caching is not recursive and sub directories will not be cached automatically. The full documentation can be found here. I was curious to see if Cloudera has integrated cache commands into their Cloudera manager, but was surprised to see that their documentation about it is basically a copy of the Apache hadoop guide and you still have to use the command line cacheadmin.
Click through to see how it performed in Guy’s scenario.
Now that we know, some basic definitions. It’s time we ask the main question! Why do I care?
There are many reasons you might wanna use Docker. I will give my perspective on why I started to learn about Docker.
I had to test my Kafka producers and consumers locally instead of deploying my code in DEV/QA even before I was sure things are working fine but also be sure that the same code, when deployed in other environments, should behave the same.
There are a few really good reasons for containers and testing is one of them.
In the Froid white paper, Microsoft talked about how they were working on fixing the function problem. When I read that white paper, my mind absolutely reeled – it’s awesome work, and I love working with a database doing that kind of cool stuff. Now that 2019 is getting closer to release, I’ve been spending more time with it. Most of the time, it works phenomenally well, and it’s the kind of improvement that will drive adoption to SQL Server 2019. Here, though, I’ve specifically picked a query that runs worse only to show you that not every query will be better.
To activate Froid, just switch the compatibility level to 150 (2019), and the query runs in 1 minute, 45 seconds, or almost twice as slow. Dig into the actual plan to understand why?
Read on to learn why.
Previously you could use Power BI Helper to connect to a model in Power BI Desktop and analyze that model, getting the list of all tables, columns, measures, alongside with measure dependency and modeling advise, and also documenting everything at the end. The good news is that now with having XMLA endpointavailable, you can connect directly to Power BI datasets in the service, and get all those functionalities with that.
Read on for the full change set.
The other day someone checked in some code and every now and then the build would fail with the error
Msg 15151, Level 16, State 19, Line 51
Cannot drop the event session ‘ProcsExecutions’, because it does not exist or you do not have permission.
I decided to take a look at the code and saw what the problem was. I will recreate the code here and then show you what needs to be changed. This post will not go into what Extended Events are, you can look that up in the SQL Server Extended Events documentation
I like these
IF NOT EXISTS checks on release scripts as that makes them re-runnable. Even if you don’t use continuous integration for release scripts, you may sometimes hit F5 one too many times.
You have batch mode adaptive joins, but no row mode adaptive joins. Why?
Adaptive joins are more appropriate for scenarios where the join-input row count fluctuates significantly. Batch mode assumes a higher row flow vs. an OLTP low-row typical pattern. Row mode adaptive joins would likely be too prone to regressions. Batch mode on rowstore opens up adaptive joins for scenarios where we estimate higher row counts for join-inputs.
There are some good questions and answers in this set.
Public endpoint, ability to connect to Azure SQL Database Managed Instance from Internet, without VPN has reached global availability today. The release of this feature will help support many new integration scenarios.
The public endpoint for Managed Instance can today be enabled/disabled via PowerShell script. The support for Azure portal will be coming within the next week or so as soon as all updates are rolled out.
Click through to learn how to enable it with Powershell.