In Qubole’s 2018 Data Activation Report, we did a deep-dive analysis of how companies are adopting and using different big data engines. As part of this research, we found some fascinating details about Hadoop that we will detail in the rest of this blog.
A common misconception in the market is that Hadoop is dying. However, when you hear people refer to this, they often mean “MapReduce” as a standalone resource manager and “HDFS” as being the primary storage component that is dying. Beyond this, Hadoop as a framework is a core base for the entire big data ecosystem (Apache Airflow, Apache Oozie, Apache Hbase, Apache Spark, Apache Storm, Apache Flink, Apache Pig, Apache Hive, Apache NiFi, Apache Kafka, Apache Sqoop…the list goes on).
I clipped this portion rather than the direct analysis because I think it’s an important point: the Hadoop ecosystem is thriving as the matter of primary importance switches from what was important a decade ago (batch processing of large amounts of data on servers with direct attached storage) to what is important today (a combination of batch and streaming processing of large amounts of data on virtualized and often cloud-based servers with network-attached flash storage).
Now, it’s been pointed out, including by Adam Machanic (@adammachanic) in a tweet referencing Aaron’s post about GROUP BY v DISTINCT that the two queries are essentially different, that one is actually asking for the set of distinct combinations on the results of the sub-query, rather than running the sub-query across the distinct values that are passed in. It’s what we see in the plan, and is the reason why the performance is so different.
The thing is that we would all assume that the results are going to be identical.
But that’s an assumption, and isn’t a good one.
Rob starts out with
READ UNCOMMITTED but then gets into the “normal”
READ COMMITTED transaction isolation level that most places use.
When you create a “logical” Azure SQL Server (I say logical because we are not really physically creating anything) there is a setting that is ticked ON by default which is called “Allow Azure services to access server”.
The question is, what does it mean? (See the highlighted section below)
Read on to see what this does and why Arun doesn’t like the default.
I’ve been playing around with SQL Server running in Kubernetes in Azure Container Services (AKS)for a while now and I think that the technology is really cool.
You can get a highly available instance of SQL Server up and running with a few lines of code! Ok, there’s a bit of setup to do but once you get that out of the way, you’re good to go.
One thing that has been missing though, is persistent storage. Any changes made to the SQL instance would have been lost if the pod that it was running in failed and was brought back up.
Click through to learn how. It’s certainly not trivial, but Andrew does a good job showing us the step-by-step.
In this module you will learn how to use the TreeViz Custom Visual. The TreeViz is a breakdown tree that allows you to expand or collapse levels of hierarchical data.
Click through for the video, showing more. For limited, hierarchical, categorical data, this could work pretty well.
According to MSDN – The DATE_CORRELATION_OPTIMIZATION database SET option improves the performance of queries that perform an equi-join between two tables whose date or datetime columns are correlated, and which specify a date restriction in the query predicate.
How many of you read what MSDN says and thinks “wuuuuuttt, English please”? I do.
Read on for the English translation.
A Self Service culture will allow to address analysts to generate their own reports, lists, and dashboards without dependence on the schedule and availability of IT staff. In these cases reports combine different sources of information are generated, many of which may not have been used historically in the company, and this in turn implies that a large number of cases which source you do not know used to implement certain reports.
Azure Data Catalog comes as an option to break that cycle of discovery that is usually done manually. This means that after the first cycle where the business analyst discovers the sources of optimal data to generate certain reports the can register, and add information (metadata) to make this source easier to discover future analysts requiring such data for the implementation of similar reports. The discovery of these sources, and capability to add metadata are procedures do not have to give at the same time but Data Catalog allows work annotations by analysts as a continuous work in time where more information is added to the repository every time.
Click through for a demo.
Then as far as SQL Server is concerned you took a backup, but the backup file is never actually created.
So why would you ever want to do that? SQL thinks you took a backup, but you have nothing to recover from. Sounds a bit, well, stupid, doesn’t it? Well, there are a few reasons.
Bonus points for Andrew Notarian’s comment.
Still … there is a trick here, and I don’t see a lot of people using it. Maybe it’s just me, maybe I’m lazier than the average dev, but I often find myself using the Script Table As menu and choosing SELECT To and Clipboard. This creates a nice select statement with all my fields wrapped in hard brackets. I can then copy this into an INSERT query I might be working on to save myself some typing. I can quickly copy the field list from the Query ‘Script Table As’ gives me and use it in the top of my INSERT query, then I can copy the entire SELECT query into the bottom of my INSERT query and Bob’s yer Uncle, I’ve got a simple INSERT query ready to go. Note: This is most useful when I’m trying to create a new table based on an existing table with only minor changes to field names. I use this frequently when I’m establishing a reporting database based on staging tables.
That’s three ways to do it in Management Studio; the next step in the process is using SMO to script using a .NET language (C#, F#, Powershell).
Well at PASS Summit, both in our booth and during my presentation on security in Azure DB, another idea came up—exporting data from production to development, while not releasing any sensitive data. This is a very common scenario—many DBAs have to export sensitive data from prod to dev, and frequently it is done in an insecure fashion.
Doing this requires a little bit of trickery, as dynamic data masking does not work for administrative users. So you will need a second user.
Read on for details.