Hadoop 3, as it currently stands (which is subject to change), won’t look significantly different from Hadoop 2, Ajisaka said. Made generally available in the fall of 2013, Hadoop 2 was a very big deal for the open source big data platform, as it introduced the YARN scheduler, which effectively decoupled the MapReduce processing framework from HDFS, and paved the way for other processing frameworks, such as Apache Spark, to process data on Hadoop simultaneously. That has been hugely successful for the entire Hadoop ecosystem.
It appears the list of new features in Hadoop 3 is slightly less ambitious than the Hadoop 2 undertaking. According to Ajisaka’s presentation, in addition to support for erasure coding and bug fixes, Hadoop 3 currently calls for new features like:
- shell script rewrite;
- task-level native optimization;
- the capability to derive heap size or MapReduce memory automatically;
- eliminating of old features;
- and support for more than two NameNodes.
The big benefit to erasure coding is that you can potentially cut data usage requirements in half, so that can help in very large environments. Alex also notes that the first non-beta version of Hadoop 3 is expected to release by the end of the year.
They didn’t give you parameter 26837 – I’m just giving you that so you can see an execution plan.
You don’t have to talk me through the query itself, or what you’d want to do to fix it. In fact, I want you to avoid that altogether.
Instead, tell me what things you need to know before you start tuning, and explain how you’re going to get them.
I think, based on the noise in the comments section, that this is a good question. Good interview questions are separating in equilibrium (as opposed to pooling). The question itself is straightforward, but people have such a tendency to jump the gun that they try to answer a question which isn’t being asked. Then, when reading the question, the set of steps and processes people have is interesting because of how much they differ.
Bonus question: take your interview answer (“I would do X and Y and Z and then A and B and C and maybe D.”) and apply it to the last time you had this scenario come up. How many of [A-DX-Z] did you actually do?
For those not familiar with the SQL Server Connector, it enables SQL Server to use Azure Key Vault as an Extensible Key Management (EKM) Provider for its SQL encryption keys. This means that you can use your own encryption keys and protect them in Azure Key Vault, a cloud-based external key management system which offers central key management, leverages hardware security modules (HSMs), and allows separation of management of keys and data, for additional security. This is available for the SQL encryption keys used in Transparent Data Encryption (TDE), Column Level Encryption (CLE), and Backup encryption.
When using these SQL encryption technologies, your data is encrypted with a symmetric key (called the database encryption key) stored in the database. Traditionally (without Azure Key Vault), a certificate that SQL Server manages would protect this data encryption key (DEK). With Azure Key Vault integration for SQL Server through the SQL Server Connector, you can protect the DEK with an asymmetric key that is stored in Azure Key Vault. This way, you can assume control over the key management, and have it be in a separate key management service outside of SQL Server.
Check it out, as it might be a solution to some key management issues.
The R data frame is a high level data structure which is equivalent to a table in database systems. It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.
The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”
Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.
The big reason that dimensional modeling increases clarity is that the dimensional model seeks to flatten data as much as possible. Let’s compare two examples. Both of these examples are for a fictional health clinic.
The first example is that we want a report on how many male patients were treated with electric shock therapy by provider, grouped monthly and spanning year to date range.
Those big Kimball-style warehouses do a great job of making it easier for people who are not database specialists to query data and get meaningful, consistent results to known business questions. The trick to understanding data platforms is that they tend to be complements rather than substitutes: introducing Spark-R in your environment does not replace your Kimball-style warehouse; it complements it by letting analysts find trends more easily. Similarly, a Hadoop cluster potentially lets you complement an existing data warehouse in a few ways: acting as a data aggregator (which allows you to push some ETL work off onto the cluster), a data collector (especially for information which is useful but doesn’t really fit in your conformed warehouse), and a data processor (particularly for those gigantic queries which are not time-sensitive).
Convert on-premises physical machine to Hyper-V VHD, upload to Azure Blob storage, and then deploy as new VM using uploaded VHD. Use when bringing your own SQL Server license, when migrating a database that you will run on an older version of SQL Server, or when migrating system and user databases together as part of the migration of database dependent on other user databases and/or system databases. Use on SQL Server 2005 or greater to SQL Server 2005 or greater
Ship hard drive using Windows Import/Export Service. Use when manual copy method is too slow, such as with very large databases. Use on SQL Server 2005 or greater to SQL Server 2005 or greater
If you’re looking for notes on where to get started, this is a good link.
An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:
I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.
You see the CHECKSUM on the backup along with the RESTORE VERIFYONLY. The code was generated by right clicking on the database, selecting Tasks, then Backup, plug in the parameters, and select Script. I put it in a new query window as I may back up several databases in the same job. Sometimes I’ll just do a find/replace for the other databases since my backup. The Restore Verifyonly gives you some confidence that your backup is recoverable: NEVER assume that just because your backup ran that the database is restorable! The ONLY way to know is to actually restore it to another file! You don’t want to accidentally clobber your production that probably has newer data in it.
Corruption is a serious event when your entire job revolves around protecting data. Be prepared.
But sometimes you want to run a series of statements or procedures where you only want the execution plan for some of the statements. Here’s how:
The actual execution plan is enabled by turning on SET STATISTICS XML., not unlike enabling STATISTICS IO or TIME. And just like SET NOCOUNT, the SET statements apply to the current context, which could be a stored procedure, a session, etc. When this context ends, the setting reverts to that of the parent context.
I see code snippets with STATISTICS IO and TIME fairly regularly, but almost never see STATISTICS XML; instead, I see people (including myself) hit Ctrl-M or select the “Include Actual Execution Plan” button when generating execution plans is desirable.
This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.
Good advice. If you have Pluralsight, I recommend Paul Randal’s course on database corruption. Watch that ideally before you have corruption…