The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina will be accumulating a lot of events which are to be processed to discover anomalies. The exact sequence of processing is still being prototyped at this point in time, but there is a solid requirement of a tool/mechanism to load the Kafka cluster with lots of data in a hurry.
The requirements pointed me in direction of looking for ‘Kafka Load Testing’. Firstly thinking of load testing, one tool comes to mind which is used very widely for load testing of Java based systems: ‘Jmeter’. Jmeter has rich toolset to perform various types of testing. It also comes with many advantages viz. Open source, easy to use, platform independent, distributed testing etc. I can use Jmeter and test its ability to perform cluster loading.
Read on for the demonstration.
We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.
Click through for more additions and demos.
Did you know that you regularly read type set in size 8, or even smaller? In printed materials, captions and less important information (think: photograph credits, newsletter headline subtext, magazine staff listings) are usually reduced to something between 7.5 to 9 points. We generally read that size type without much issue, like glasses. The reason why we can comfortably read those small sizes is because the designers chose an effective font that keeps its clarity and legibility when shrunk.
Designers don’t make the font that tiny to give you a headache. They do it to establish a font hierarchy. Our brains interpret the biggest size as the most important and the littlest size as the least important. So we can create a hierarchy of font sizes to structure our work and communicate even more clearly.
The font hierarchy is important, but so is picking a font which is clear at the sizes you want to use.
The primary use case we had in mind when developing this function was troubleshooting page-related waits. Some of the key performance scenarios that require page details to diagnose are tempdb contention, last page insert contention (also see this blog articlefor code samples) and page-level blocking. All of these scenarios present as either PAGELATCH or lock wait types in sys.dm_exec_query_stats with a page identifier (db_id:file_id:page_id) in the wait_resource column. In order to gain any insight into the problem, you have to know more details about what that page is. Is it a data page, an index page or something else? If it’s a data or index page, which object does it belong to? Prior to SQL Server 2019 CTP 2.0, the only way to gain any of this insight was to use the lightlydocumented command DBCC PAGE. Unfortunately, if you’re using some sort of monitoring tool or script to gather data about waits and blocking, there’s no good way to include this DBCC PAGE call in those scripts. Not to mention it requires sysadmin privileges. Enter sys.dm_db_page_info.
The sys.dm_db_page_info function takes 4 parameters: database ID, file ID, page ID and mode (‘LIMITED’ or ‘DETAILED’). It returns a table with a single row that contains the header information from the page, including things like object_id, index_id and partition_id. As a standalone function, this provides a documented and fully supported way to retrieve this information without the need for a DBCC command. The real benefit however is the ability to JOIN this function with other DMVs such as sys.dm_exec_requests or sys.sysprocesses to gather this information automatically when there is any sort of page-related contention. If you’re familiar with these views you know that the wait_resource column can contain all sorts of different resources, so it’s not in the right format to facilitate a direct join to sys.dm_db_page_info. To support this use case, we had to add another column to these views called page_resource. Whenever the wait_resource column contains a page resource, the page_resource column will be populated with an 8-byte hexadecimal value. The page ID is the first four bytes, followed by 2 bytes for the file ID and then 2 bytes for the database ID.
There are also a couple of supporting functions, so check those out.
When messing about (developing) things that are based on the query store I came across an unusual phenomenon. We were missing query store intervals. It’s not that data had been deleted but rather than it was never created in the first place. Take a look at this;
I’d set the query store to 5 minute intervals for testing purposes and left it. Notice that there is no data between 15:20 and 15:25. It’s not that the data has been deleted, we know that because the runtime_stats_interval_id is still creating increments correctly.
Click through for the answer.
This month we had 15 post submissions about this daunting topic. Two of the posts were from people who had never posted before. To them I say welcome and I hope you enjoyed the experience.
I think, in general, you were all brave to write about this sensitive topic. I know a lot of you are consultants (which is a great way to expose yourself to a death march project) and must be careful about telling stories that could be misconstrued by clients. Nonetheless, with enough obfuscation and redaction you have brought forth some truly horrifying posts! I am going to group the submissions by the most terrible themes that many posts shared.
It is best to read this post in Vincent Price’s voice.
I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica. The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up. What I’m about detail is not an ideal solution. In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential). But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.
Click through for the demo.