I’m pleased to announce tidyr 0.5.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette.
Check out the latest version of tidyr; it’s one of the most useful data manipulation packages on the R platform.
We are excited to announce the General Availability (GA) of Databricks Community Edition(DCE). As a free version of the Databricks service, DCE enables everyone to learn and exploreApache Spark, by providing access to a simple, integrated development environment for data analysts, data scientists and engineers with high quality training materials and sample application notebooks.
Less than four months ago, at Spark Summit New York, we introduced Databricks Community Edition (DCE) beta. Its introduction generated tremendous interest with thousands of people requesting accounts. Today, we are delighted to report that more than 8,000 users have signed on DCE, many of them using the service heavily. The top 10% active users are averaging over 6 hours per week, and are executing over 10,000 commands on average.
They also just started an EdX course on an introduction to Spark yesterday. If you’re interested in Spark but haven’t had the time to learn, this might be a good course to take.
The Databricks just-in-time data platform takes a holistic approach to solving the enterprise security challenge by building all the facets of security — encryption, identity management, role-based access control, data governance, and compliance standards — natively into the data platform with DBES.
- Encryption: Provides strong encryption at rest and inflight with best-in-class standards such as SSL and keys stored in AWS Key Management System (KMS).
- Integrated Identity Management: Facilitates seamless integration with enterprise identity providers via SAML 2.0 and Active Directory.
- Role-Based Access Control: Enables fine-grain management access to every component of the enterprise data infrastructure, including files, clusters, code, application deployments, dashboards, and reports.
- Data Governance: Guarantees the ability to monitor and audit all actions taken in every aspect of the enterprise data infrastructure.
- Compliance Standards: Achieves security compliance standards that exceed the high standards of FedRAMP as part of Databricks’ ongoing DBES strategy.
In short, DBES will provide holistic security in every aspect of the entire big d
As enterprises come to depend on technologies like Spark and Hadoop, they need to have techniques and technologies to ensure that data remains secure. This is a sign of a maturing platform.
This problem is easy to solve, right? You can write scripts that run jobs in sequence, and use the output of one program as the input to another—no problem. But what if your workflow is complex and requires specific triggers, such as specific data volumes or resource constraints, or must meet strict SLAs? What if parts of your workflow don’t depend on each other and can be run in parallel?
Building your own infrastructure around this problem can seem like an attractive idea, but doing so can quickly become laborious. If, or rather when, those requirements change, modifying such a tool isn’t easy . And what if you need monitoring around these jobs? Monitoring requires another set of tools and headaches.
This is a pretty detailed look at the basics of Oozie.
Unit testing can sometimes throw up unexpected issues that are entirely concerned with the unit tests and not the application under test. Recently I’ve been having issues with DateTime formats affecting the results of the unit tests: the format of the DateTime was different on the build server than it was on my local machine. The stored procedure could not be altered (because reasons*). The datetime formats on both SQL and Windows were identical to my machine, and running the test itself on the build box caused the test to go green, so clearly there was something funky going on in the build process. SO I needed to find a way of logging the values during the process to verify where the issue was. And, as luck would have it, MSTest has such a method to help you. First create a private instance of the TestContext object in the TestClass, then you create a public property.
This turns out to be really easy to do.
Now you’ve migrated the logins with their passwords, SIDs, and a few default properties. But you don’t have the logins’ server roles, server permission sets, database roles or database permission sets. So now you gotta find and use someone’s modified version of sp_help_revlogin, but you’re still left with manually executing the procedure against your source and destination servers.
Oh, and don’t forget different versions of SQL Server use different hashing algorithms, so you’ll need to use one of the many different versions of sp_help_revlogin if you have a mixed environment.
Let’s hope you only have one or two SQL Servers to migrate and not hundreds.
Chrissy has a couple of great Powershell cmdlets to help get rid of this procedure, as well as a nice explanation of each and Youtube videos should you be so inclined. Definitely check it out, as well as her dbatools Powershell suite.
If you are running a PCI compliant system on SQL Server 2005 you are going to fail your next audit. One of the audit requirements is that the vendors must support the version of your software which you are running on. As Microsoft no longer offers support for SQL Server 2005 that’s going to cause you to fail your next PCI audit.
Microsoft’s serious about sunsetting old versions of SQL Server, and at this point, there have been five versions of SQL Server released since 2005.
If you look at the code from the interactive window, you will notice that the error occurred with trying to run rxSummary. In both cases I didn’t get the error when I changed the compute context to SQL Server from local, but when I tried to run a function which runs on the server. In both cases the R tools where installed prior to installing SQL Server 2016. The Open Source R tools install to C:\Program Files\R\R-3.3.0 (your version number may be higher). The Microsoft R Open installs to C:\Program Files\Microsoft\MRO\R-3.2.5. To use the libraries needed for the RevoScaleR libraries included in R Server, the version of Microsoft R required is Microsoft RRE, which is installed here C:\Program Files\Microsoft\MRO-for-RRE\8.0. Unfortunately, SQL Server 2016 shipped with version 8.0.3 not 8.0.0. If you are getting data and using a local compute context, you will have no problems. However, when you want to change your compute context to run on SQL Server, you will get an error.
While I received a different error on the server than my laptop, the reason for both messages was the same. Neither computer was running version 184.108.40.206 of the R client tools. On the server I was able to fix the error without downloading a thing. After installing a stand-alone version of R Server from the SQL Server Installation Center, the error went away and I got results when trying to run rxSummary. Unfortunately, it was not possible for me to run R Server on my laptop, as R Server is disabled from within the Installation Center. I believe that is because I have SQL Server 2016 developer edition on a laptop, not on a server. I needed to do something else to make it work.
Click the link for the full story.
In SQL Server 2012 and 2014, creating an Availability Group could take a significant amount of work. One of the more tedious tasks is setting up your replica databases. This is because that you need to restore your database to your replica node in a state close enough to the primary to allow synchronization to happen. It can take several log backup restores to hit that magic window where you can join the database on the secondary node. Then, you get to do it again on the next replica!
Enter direct seeding in 2016. With this feature you no longer have to do any of the restores. You simply create your replicas with direct seeding enabled, then when you add a database to the AG, SQL Server will directly seed the database to your replica nodes. It’s surprisingly simple.
This sounds pretty interesting.
COMPRESS()function takes an input of string or binary data, and applies the gzip algorithm to it. It returns a value of type
varbinary(max). In essence, instead of storing string or binary data, you can gzip it up and store it in a varbinary(max) column. There’s also a
DECOMPRESS()function for when you are reading the data and need to unzip it.
This costs some CPU, but gzip can save quite a bit of space. How much space, and whether it’s worth the CPU cost will vary depending on your data and workload. In this blog post, we’ll take a look at one table. We’ll look at the space savings we get out of using
COMPRESS(), and we’ll look at the effort necessary to implement it.
Read on for Andy’s test and thoughts.