In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.
To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
188.8.131.52 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
Click through for the demonstration.
Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more.
If you were to go Kafka-first… well, it isn’t impossible. But all those responsibilities now belong to you as a developer. And if you are thinking there may be multiple webservers handling user requests and passing them to Kafka, you have to solve fairly challenging problems.
Read the whole thing.
Many instance-level items that you have been used to configuring on full installations are off limits. Some of these items include:
– Setting min and max server memory
– Enabling optimize for ad hoc workloads
– Changing cost threshold for parallelism
– Changing instance-level max degree of parallelism
– Optimizing tempdb with multiple data files
– Trace flags
Tim does point out workarounds for some of these and gives us the list of things which are possible, so check that out.
As can be seen, this schema is slightly
denormalisedas the number of lines per transaction areprecalculated in the
column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.
Now, in the previously linked Stack Overflow question, a report was desired that would calculate:
– An aggregation of sums as provided by the line items
– An aggregation of averages as provided by the transactions
As Lukas points out, doing this in two queries is easy, but doing it in one is sublime.
So a few things here. I’m using a global temp table instead of a local one because it makes it easier to reference. Local temp tables aren’t listed in tempdb under their name while global ones are.
The first part of this message (the bit in black) is a warning basically telling us that if there is a temp table (a # at the front of the name) it’s going to ignore the multi part reference. In other words, you’re going to get this message any time you try to use a multi part name and a temp table. Linked server or not. The second part of the message (the bit in red) just tells us that there isn’t a temp table named ##DBList.
Kenneth finds a way, but I can’t think of a scenario where accessing a temp table on a different instance turned out to be a good idea.
I’ve supplied these two python scripts in my
githubrepo at the following link. Firstwe need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, bash or cmd Prompt
pip install azure-eventhub
Check it out if you need pub-sub in Azure.
When I run this script I see the output in the screenshot below, a MOF file has been created in my output folder. Managed Object Format (MOF) files are used to describe Common Information Model (CIM) classes, these are industry standards which gives us flexibility in working with DSC. In DSC this is important as the MOF file is the artefact that will actually be used to configure our nodes. This MOF will be delivered to our target node and enacted by the Local Configuration Manager (LCM).
The LCM will be covered in more detail in a later post, but for now know that it can be configured to be in either ‘Push’ mode or ‘Pull’ mode. Pull mode is more complicated to set up but perhaps more appropriate for managing a large number of servers. For now, we will look at the ‘Push’ mode where we will deliver the MOF manually to the target node for the LCM to enact.
There are a lot of TLAs to watch out for within Desired State Configuration.
I’m terrible at naming things. I recently wrote some quick code to reproduce a design problem and demonstrate several options for solutions, and later realized that I’d named my objects dbo.Foo, dbo.FooFoo, and dbo.Wat.
But I feel strongly about a few important principles about referring to objects that are already named in a relational database, specifically SQL Server and Azure SQL Database.
Most of the time, you should use a two part-name for objects in the current database. It’s important to know your database context. When referencing an object in the current database, you should not specify the database name in the reference.
Read the comments as well. I’m not as hard-set against three-part naming for cross-database queries but can understand the sentiment.