Press "Enter" to skip to content

Category: Architecture

Supersized Tables

Deborah Melkin tells a story of a design battle she lost:

The programmers came to me and said we need to add a large number of columns to this table for one piece of functionality. It would more than double the total number of columns on the table. Oh, and all of the new columns would be NULL since we would only need to populate them if they were using that functionality and even then, not all of them would require data. The final result would be that 65-75% of the table would end up having nullable fields with the majority of those having NULL for the value.

I said what I think any sane DBA would say to this request: No.

Click through for the rest of the tale.

Comments closed

Serverless Azure

Christos Matskas has an article on Azure Functions, Service Fabric, and Batch:

This service is the hidden gem of HPC (high performance computing) within the Azure Compute service family. As the name implies, Azure Batch is designed to run large-scale and high-performance computing applications efficiently in the cloud. When you’re faced with large workloads, all you have to do is to use Azure Batch to define compute resources to execute your applications in parallel and at the desired scale. A good use-case for Azure Batch would be to perform financial risk modelling, climate data analysis or stress testing. What makes Batch so useful is the fact that you don’t need to manually manage the node cluster, virtual networks or scheduling because all this is handled by the service. You need to define a job, any associated data and the number of nodes you want to utilise. It makes no difference if you need to run on one, a hundred or even thousands of nodes. The service is designed to scale according to the workload needs.

The cheapest server may very well be no server, and we’re at the point where relatively simple services could just run as Azure Functions or AWS Lambda functions.

Comments closed

That 53rd Week

Jens Vestergaard notes that you can sometimes have a 53rd week in the year:

There are a lot of great examples out there on how to build your own custom Time Intelligence into Analysis Services (MD). Just have a look at this, this, this, this and this. All good sources for solid Time Intelligence in SSAS.
One thing they have in common though, is that they all make the assumption that there is and will always be 52 weeks in a year. The data set I am currently working with is built on ISO 8601 standard. In short, this means that there is an (re-) occurrence of a 53rd full week as opposed to only 52 in the Gregorian version which is defined by: 1 Gregorian calendar year = 52 weeks + 1 day (2 days in a leap year).

The 53rd occurs approximately every five to six years, though this is not always the case. The last couple of times  we saw 53 weeks in a year was in 1995, 2000, 2006, 2012 and 2015. Next time will be in 2020. This gives you enough time to either forget about the hacks and hard-coded fixes in place to mitigate the issue OR bring your code in a good state, ready for the next time.

Dates and currency are hard problems.

Comments closed

Finding Candidate Keys

Daniel Hutmacher explains ways to find candidate keys:

Let’s assume we have a temp table heap called #table, with 9 columns and no indexes at all. Some columns are integers, one is a datetime and few are numeric. As I’m writing this post, my test setup has about 14.4 million rows.

In the real world, when you’re investigating a table for primary key candidates, there are a few things you’ll be looking for that are beyond the scope of this post. For instance, it’s a fair assumption that a numeric or float column is not going to be part of a primary key, varchar columns are less probable candidates than integer columns, and so on. Other factors you would take into consideration are naming conventions; column names ending with “ID” and/or columns that you can tell are foreign keys would also probably be good candidates.

It’s useful to think of all the candidate keys, as getting to Boyce-Codd Normal Form or 4th/5th NF involves dealing with all potential primary keys, not just the one you selected.  Daniel’s post gives you several different methods of searching existing data; combine that with domain knowledge and a bit of logic and you have a pretty decent start at finding candidate keys.

Comments closed

Leading Wildcard Seek Triggers

Aaron Bertrand demonstrates the triggers you could use if you wanted to build leading wildcard seek tables:

In my last post, “One way to get an index seek for a leading wildcard,” I mentioned that you would need triggers to deal with maintaining the fragments I recommended. A couple of people have contacted me to ask if I could demonstrate those triggers.

To simplify from the previous post, let’s assume we have the following tables – a set of Companies, and then a CompanyNameFragments table that allows for pseudo-wildcard searching against any substring of the company name

Read on for the triggers.

Comments closed

IoT Versus Event Hub

James Serra clarifies the differences between Azure’s IoT Hub and its Event Hub:

The majority of the time, if the data is coming directly from the devices, either directly or via a field-based gateway, IoT Hub will be the more appropriate choice.  Event Hub will generally be the more appropriate choice if either the data will not be coming to Azure directly from the devices, but rather either cloud-to-cloud through another provider, intra-cloud, or if the data is already landing on-premise and needs to be streamed to the cloud from a small number of endpoints internally.  There are exceptions to both conditions, of course.

Both solutions offer very high throughput data ingestion and can handle tremendous streaming data volumes.  In fact, today, IoT Hub is primarily a set of additional services that wrap an underlying Event Hub.

Read on for more scenarios and limitations in each.  They definitely serve different use cases.

Comments closed

Thinking About Availability Group Outages

Brent Ozar reminds us to think about graceful degradation of applications:

There’s a gray bar across the top that says, “This site is currently in read-only mode; we’ll return with full functionality soon.”

That’s not a hidden feature of Always On Availability Groups. Rather, it’s a hidden feature of really dedicated developers whose application:

This is where a bit of foresight and hard work can really pay off.  Read the whole thing.

Comments closed

Thinking Post-DRAM

Joe Chang argues that we may benefit more from a hardware architecture which uses lower-latency, lower-capacity RAM:

There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be about six times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.

There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.

Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.

There is a lot of depth in this blog post.

Comments closed

The Importance Of Auditing

Louis Davidson has a parable about database design and systems auditing:

This brings me to my data question. If an order is processed in a store, but the expected data is not created, did that order ever occur?

Very often, the staff of a business are very focused on pleasing the customer, making sure they get their product, but due to software limitations, may not end up not capturing information about every sale in a satisfactory manner. Most of the blame I have seen lies in software that doesn’t meet the requirements of a customer, making capturing desired details tedious to achieve when the process is in the norm. Very often the excuse programmers give is that too much work of the work to build a system would need to be done for the atypical cases, but requirements are requirements, and it is generally essential that every action that occurs in a business is captured as data.

Read on for more.  My conjoined twin case is, how much information do we have about why users give up?  For example, if you have a three-part form, how many users get through part one, part two, and part three?  There’s some natural level of attrition, but if you see an abnormally low follow-through rate, that might indicate a bug or major issue.  Auditing is hard work, as you have to hit both sides of the problem at the same time.

Comments closed

Where Azure Analysis Services Fits

Melissa Coates explains where Azure Analysis Services fits in common BI architectures:

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.

  • Directly from various systems.  This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).

If you’re thinking about Azure Analysis Services, this post is a good one.

Comments closed