The policy shown below establishes four Sentry roles based on the
Administrators are entitled to all actions.
Operators are granted update and query privileges.
Users are granted query privileges.
Tech users are granted update privileges.
These are pretty straightforward role-based access controls. The authors also look at accessing the data via Flume and a couple other technologies.
The gaining popularity of self-service analytical tools such as Tableau increases the necessity of having curated data in your database. These tools aim to allow the end users to intuitively query data “at the speed of thought” from the data warehouse and visualize the results quickly. That type of capability allows users to go through several different iterations of the data to really explore the data and generate unique insights. These tools do not work well when the underlying database tables have not been curated properly.
This is a difficult and lengthy process, but it’s vital; data minus context is a lot less relevant than you’d hope.
S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. Due to its underlying infrastructure, S3 is excellent for retrieving objects with known keys. S3 maintains an index of object keys in each region and partitions the index based on the key name. For best performance, keys that are often read together should not have sequential prefixes. Keys should be distributed across many partitions rather than on the same partition.
For large datasets like genomics, population-level analyses of these data can require many concurrent S3 reads by many Spark executors. To maximize performance of high-concurrency operations on S3, we need to introduce randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.
Reading the title, I wanted it to be an article on knobs to turn in S3 to maximize read performance. It’s still an article well worth reading, but focuses from the other side: how to write to S3 without stepping on your own toes.
There are three important things here: first, our Zookeeper port is 2181. Zookeeper is great for centralized configuration and coordination; if you want to learn more, check out this Sean Mackrory post.
The second bit of important information is how long our retention period is. Right now, it’s set to 7 days, and that’s our default. Remember that messages in a Kafka topic don’t go away simply because some consumer somewhere accessed them; they stay in the log until we say they can go.
Finally, we have a set of listeners. For the sandbox, the only listener is on port 6667. We connect to listeners from our outside applications, so knowing those addresses and ports is vital.
This is still quick-start level stuff, but I’m building up to custom development, honest!
Whilst I won’t be showing code in this, there is a companion sample database project. This is on GitHub and each key stage is shown by a branch of work. This means you can jump in at most stages and work from there. If you need some git advice, check out my quick git in 5 presentation.
The core DB is a super hero themed database.
Database unit testing involves a lot of compromises (because most of what we care about as database professionals tends to be integration tests rather than unit tests), but having imperfect tests generally beats having no tests, so check this out.
- “A minimum of 16 core licenses is required for each server.”
- “A minimum of 8 core licenses is required for each physical processor.”
When most sysadmins see that, they’ll think, “Okay, so I shouldn’t bother buying a server smaller than 2 8-core processors.”
There are plenty of scenarios in which this doesn’t hurt (much): mainly when you need a hefty server with more than 16 cores, or when you are running in a virtualized environment and can split that hardware across a number of logical servers.
As for me, this is one reason why I’m looking forward to SQL Server on Linux.
You are going to cause all sorts of problems if you write code like this. First off, if you really do have these three queries within the same stored procedure, how hard will it be to confuse which is table ‘a’ in each of the queries when you go back to edit them? Pretty easy.
It gets worse though. I know that none of us will ever write a query that exceeds 26 tables in a JOIN… well, except that one time… and that other time. In fact, it happens. Oh, it’s not always a good thing, but it’s a thing. How do we respond to that? I’ve seen this:
Some of this stuff is egregious. Some of it is debatable. But either way, well-formatted code helps prevent bugs and aid understanding of queries.
You’ll need a linked server from your secondary replica to the primary replica. This linked server should feature the absolute bare-minimum of permissions, preferably with only the “be made using the login’s current security context” selected.
The account running the procedure (or the mapped login in the linked server) will need SELECT access to the following DMVs on the remote (primary) server:
The code itself is a download from Daniel’s website; go check it out.
Before you think this is to perf tuning what boxed wine is to pest extermination; it’s not. It’s another tool that has pros and cons. The plan cache is cool too, but cached plans don’t have all the information that actual plans do. You can run Traces or Profiler or Extended Events, but they all sort of have their own caveats, gotchas, and overhead. If you don’t have a monitoring tool, though, what are you left with?
Let’s take a look at what you can do with STATISTICS PROFILE, and then the (rather obvious) limitations. Here’s the setup and a simple query.
I’ll admit that outside of learning what they are, I’ve never used text execution plans. I’ll read the XML, view the graphical results, pipe them out to SentryOne Plan Explorer (formerly SQL Sentry), etc. But the text plans never held much allure for me.
The range locks are usually acquired only in SERIALIZABLE isolation level; however, there is another, pretty much undocumented case, when SQL Server can use those locks. It happens even in READ UNCOMMITTED and READ COMMITTED SNAPSHOT modes when you havenonclustered indexes that have IGNORE_DUP_KEY=ON option. In that case rows with the duplicated index keys would not raise an error but rather being ignored. SQL Server would not insert then into the table.
This behavior leads to very hard to explain cases of blocking and even deadlocks in the system. Let’s look at the example and create the table with a few rows as shown below. As you see, nonclustered index on the table has IGNORE_DUP_KEY option enabled.
This is an interesting risk when using IGNORE_DUP_KEY.