Press "Enter" to skip to content

Author: Kevin Feasel

TOP And UNIONs

Andy Levy points out that the TOP operator doesn’t always apply to each element in a UNION:

This system uses TOP every now and then trying to limit the number of records it gets back (and the developers always seem to use the arbitrary 301 – I’m guessing some degree of cargo cult programming going on). I’d prefer a well-constructed WHERE clause to limit the result set but beggars can’t be choosers when working with legacy vendor code.

What I found odd was that the Ledger1 table didn’t get a lot of traffic – with the WHEREclause in use (omitted here for brevity), you’d only get a handful of records, maybe a dozen at most.

Click through to see the rest of the problem, as well as Andy’s solution.

Comments closed

Thoughts On Data Sizing

Greg Low has some thoughts around data types and sizes:

I was recently at a site where they were changing all their bigint columns to uniqueidentifier columns (ie: GUID columns) because they were worried about running out of bigint values. In a word, that’s ridiculous. While it’s easy to say “64 bit integer”, I can assure you that understanding the size of one is out of our abilities. In 1992, I saw an article that said if you cleared the register of a 64 bit computer and put it in a loop just incrementing it (adding one), on the fastest machine available that day, you’d hit the top value in 350 years. Now machines are much faster now than back then, but that’s a crazy big number.

If you’re at risk of running out of bigints, you have a lot of data—that’s more than 18 quintillion rows that you can hold, and if my quick math is up to snuff, would be roughly 127 thousand petabytes of data just to store the IDs.

Comments closed

When AT TIME ZONE Is Too Slow

Robert Davis troubleshoots a performance problem relating to time zones:

Time Zones were definitely being a drag today. I got an email from one of the developers at work asking about the performance difference between 2 queries. The only difference between the 2 queries is that one of them uses the AT TIME ZONE clause that was added in SQL Server 2016. I have not played around with this particular clause, but we do store quite a bit of data in the datetimeoffset data type. In the table in the developer’s queries, the data is all stored in the Eastern time zone, but they are considering storing it in additional time zones and will want to be able to display it in the Eastern time zone even if not stored that way. Thus, AT TIME ZONE.

When the developer was testing the conversion function, he noticed that the query slowed waaaayyyyy down when he added AT TIME ZONE. Before adding AT TIME ZONE to the query, STATISTICS TIME for the query was: CPU time: 145549 ms, elapsed time: 21693 ms.. It returned 8,996 rows, but if I removed the DISTINCT, it returned over 72M rows. That’s a lot of clams … er, data.

Read on for the rest of the story, including Robert’s solution.  Also check out his Connect item related to this.

Comments closed

Roll Your Own SSAS Performance Monitoring Tool

Shabnam Watson has a great post on building an Analysis Services performance monitoring tool from scratch using Power BI:

In many cases, SSAS works efficiently with default settings right out of the box. However, when you have large databases, substantial number of concurrent users, insufficient resources on your server, or when best practices are not followed during SSAS database design, you can run into performance bottlenecks and problems. In these scenarios, you need to know what to measure and how to measure them, what’s normal for your environment (benchmark), and you need to have some amount of historical measurements to be able to see the events that lead to a certain bad performance/failure point. Once you have this data, you can improve your server’s performance by addressing the problem(s).

This is a tour de force of an article, absolutely worth reading if you plan on dealing with Analysis Services at some point.  Even if you don’t build your own tool, you’ll learn a lot about what drives SSAS performance and what indicates that there might be a problem.

Comments closed

Fun With Dynamic SQL: Implicit Casting Can Allow SQL Injection

Remus Rusanu shows an example where implicit casting from NVARCHAR to VARCHAR can introduce a SQL injection vulnerability that you otherwise wouldn’t expect:

In both examples above the SQL executed apparently should had been safe from SQL injection, but it isn’t. Neither REPLACE nor QUOTENAME were able to properly protect and the injected division by zero was executed. The problem is the Unicode MODIFIER LETTER APOSTROPHE(NCHAR(0x02bc)) character that I used in constructing the NVARCHAR value, which is then implicitly cast to VARCHAR. This cast is converting the special ‘modifier letter apostrophe’ character to a plain single quote. This introduces new single quotes in the value, after the supposedly safe escaping occurred. The result is plain old SQL Injection.

Click through for the script.  The upside of this is that it’s entirely under your control and you should be able to get it right by using NVARCHAR consistently.

Comments closed

HDFS Federation

Sangeeta Gulia explains what HDFS Federation is and how it differs from classic HDFS:

HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.

Namenodes are federated, that is, all these NameNodes work independently and don’t require any coordination with each other.

It’s one way to reduce the number of potential single points of failure in a Hadoop environment.

Comments closed

Pandas Basics

Kevin Jacobs has a tutorial on Python’s Pandas library:

There are a few things worth mentioning. Often, Pandas is abbreviated as pd (like Numpy which is often abbreviated as np). If you look at other code, you will see that DataFrames are often abbreviated by df. Here, the DataFrame is constructed using data from a list of lists. The columns argument specifies the keys of the data.

This is a high-level intro, but helps you get your feet wet if you’ve not played with the library.

Comments closed

A Look At Azure SQL Database Price By Tier

Arun Sirpal puts together a comparison of Azure SQL Database prices (in GBP) by service tier:

Hopefully this paints a picture for you. I will have my say though. Basic tier database is something that you should NOT be using for production workloads, its quite obvious with the 2GB limit but worth reinforcing the point. Standard tier is more for your common workloads and premium is designed for high transactional volume where I/O performance is much more important to you – my hunch maybe they are utilizing SSDs? I am not sure but the premium costs are much higher.

There is another service tier called Premium RS (In preview). My understanding is that performance is similar to that of Premium HOWEVER only useful for workloads that can tolerate data loss up to 5-minutes due to service failures. I will probably not use this for production but then again it seems to be nearly half the cost of premium. Choices choices choices.

Two notes with this one.  First, these are prices as of when Arun put together the notes; they will probably fluctuate over time.  Second, there might be differences in prices by data center.  At the very least, though, this gives you an idea of the relative price spread.

Comments closed

Getting Distinct Dimension Count Based On A Filtered Measure

Gogula Aryalingam shows us a neat trick in Power BI:

Enthusiastic as we were, one of the hardest nuts to crack, though it seemed so simple during requirements gathering, was to perform a distinct count of a dimension based on a filtered measure on a couple of the reports. To sketch it up with some context; you have products, several more dimensions, and a whole lot of measures including one called Fulfillment (which was a calculation based on a couple of measures from two separate tables). The requirement was to get a count of all those products (that were of course filtered by other slicers on the Power BI report) wherever Fulfillment was less than 100%, i.e. the number of products that had not reached their targets.

Simple as the requirements seemed, the hardest part in getting it done, was the limited knowledge in DAX, specifically, knowing which function to use. We first tried building the data model itself, but our choice in DAX formulae, and the number of records we had (50 million+) soon saw us running out of memory in seconds on a 28GB box; Not too good, given the rest of the model didn’t even utilize more than half the memory.

Click through for the answer.

Comments closed

Improving Code Quality With SonarQube

Samir Behara has a quick look at SonarQube, an open source static analysis engine:

In my project, we have also integrated SonarQube with our TFS CI/CD build and have configured the Quality Gates.

For example – If I try to inject a security threat or a known coding issue — the TFS build will fail, the check in will get rejected, the quality gate fails and SonarQube points me to the exact issue – which I can rectify and do another check-in. So it will basically stop you from checking in code with potential issues.

Currently the only way to catch such issues is during manual coding reviews. SonarQube will help in automating that process. You can write your own rules to look for known issues in the code and stop it before the code gets checked in to source control.
So overall you can ensure good quality code going to Production and less regression defects coming up at a later point of time.

Read on for an example where a SonarQube rule can find a SQL injection vulnerability and thereby fail the build.

Comments closed