Press "Enter" to skip to content

Day: July 3, 2018

Enabling LDAP Authentication On Cassandra

Kurt Greaves shows off a new LDAP authenticator for Apache Cassandra:

The LDAPAuthenticator is implemented using JNDI, and authentication requests will be made by Cassandra to the LDAP server using the username and password provided by the client. At this time only plain text authentication is supported.

If you configure a service LDAP user in the ldap.properties file, on startup Cassandra will authenticate the service user and create a corresponding role in the system_auth.roles table. This service user will then be used for future authentication requests received from clients. Alternatively (not recommended), if you have anonymous access enabled for your LDAP server, the authenticator allows authentication without a service user configured. The service user will be configured as a superuser role in Cassandra, and you will need to log in as the service user to define permissions for other users once they have authenticated.

The authenticator itself is hosted on GitHub, so you can check out its repo too.

Comments closed

RStudio Integration With Databricks

Brian Dirking, et al, announce support between RStudio and the Databricks platform:

With Databricks RStudio Integration, both popular R packages for interacting with Apache Spark, SparkR or sparklyr can be used the inside the RStudio IDE on Databricks. When multiple users use a cluster, each creates a separate SparkR Context or sparklyr connection, but they are all talking to a single Databricks managed Spark application allowing unique opportunities for collaboration between users. Together, RStudio can take advantage of Databricks’ cluster management and Apache Spark to perform such as a massive model selection as noted in the figure below.

I like seeing this level of integration, especially from a language like R, which has historically been limited to operating on a single machine’s memory.

Comments closed

Auto-Scaling Amazon ElasticMapReduce

Brandon Schller gives us some tips on sizing and scaling ElasticMapReduce:

EMR scaling is more complex than simply adding or removing nodes from the cluster. One common misconception is that scaling in Amazon EMR works exactly like Amazon EC2 scaling. With EC2 scaling, you can add/remove nodes almost instantly and without worry, but EMR has more complexity to it, especially when scaling a cluster down. This is because important data or jobs could be running on your nodes.

To prevent data loss, Amazon EMR scaling ensures that your node has no running Apache Hadoop tasks or unique data that could be lost before removing your node. It is worth considering this decommissioning delay when resizing your EMR cluster. By understanding and accounting for how this process works, you can avoid issues that have plagued others, such as slow cluster resizes and inefficient automatic scaling policies.

If you’re using EMR today or think you might use it in the future, you should read this.

Comments closed

Using DAX With Reporting Services

David Stelfox gives us an example of using DAX when connecting SQL Server Reporting Services to a SQL Server Analysis Services Tabular model:

Tools like Power BI have changed reporting allowing power users to leverage tabular cubes to present information quicker and without the (perceived) need for developers. However, experience tells us many users still want data in tables with a myriad of formatting and display rules. Power BI is not quite there yet in terms of providing all this functionality in the same way that SSRS is. For me, SSRS’s great value and, at the same time its curse, is the sheer amount of customisation a developer can do. I have found that almost anything a business user demands in terms of formatting and display is possible.

But you have invested your time and money in a tabular SSAS model which plays nicely with Power BI but your users want SSRS reports so how to get to your data – using DAX, of course. Using EVALUATE, SUMMARIZECOLUMNS and SELECTCOLUMNS you can return data from a tabular model in a tabular format ready to be read as a dataset in SSRS.

It’s a good post and a good example.  The only quibble I have is in the motivating paragraph; Power BI and SQL Server Reporting Services have different end goals—Power BI isn’t (and I think never will be) a pixel-perfect report building product; it’s meant to be a dashboarding technology.  That quibble aside, the example is well worth checking out.

Comments closed

Where Columnar Databases Struggle

Teo Lachev makes a good point regarding columnar databases:

A large company uses the SAP HANA ERP system. Users requires real-time access to transactional data. To avoid performance degradation, SLT replication (trigger-based change data capture) replicates data to another SAP HANA system that is used solely for reporting. The problem is that the more detailed the report gets and the more columns it has, the slower it gets and SAP HANA throws out of memory exceptions.

SAP HANA is an in-memory columnar database like Tabular. So, it stores data in columns, not rows. Columnar databases are primarily designed for analytical reports which typically have a few columns (sales by customer, product, date), but can potentially aggregate large datasets. As the reporting grain lowers and more columns are added (order number, order line item, customer name, phone number, etc.), a columnar database has to cross-join more and more columns. This is not efficient and performance quickly degrades irrespective that storage is fast. SSAS Tabular and Power BI are no different. SAP HANA complicates the issue further by preventing direct access to tables and requiring “analytical” views that join tables and potentially nest other views.

Read the whole thing.  Teo has a great point:  there are trade-offs between different data platform technologies, and choosing the right one is important.

Comments closed

Azure Data Lake Store: Now On Blob Storage

James Serra announces Azure Data Lake Store Gen2:

Big news!  The next generation of Azure Data Lake Store (ADLS) has arrived.  See the official announcement.

In short, ADLS Gen2 is the combination of the current ADLS (now called Gen1) and Blob storage.  Gen2 is built on Blob storage.  By GA, ADLS Gen2 will have all the features of both, which means it will have features such as limitless storage capacity, support all Blob tiers (Hot, Cool, and Archive), the new lifecycle management feature, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

A Gen2 capability is what is called “multi-modal” which means customers can use either Blob object store APIs or the new Gen2 file system APIs.  The key here is that both blob and file system semantics are now supported over the same data.

One very interesting thing to me is that Gen2 pricing is half of Gen1.

Comments closed

Reporting Services And Availability Groups

Thomas Rushton has a good rant worked up:

Yes, you may have an availability group – well done – and you may have installed SSRS on both servers. But you’ve only set up the reporting application to point to one of those? And you’ve given the link https://<<Listener_Name>>/reports out to the users? Head/desk. I told you at the time that SSRS doesn’t play nicely with AGs. [Nearly misposted as SSRS doesn’t play nicely with SSRS, which, while valid, isn’t the point here…]

Here’s what you need to do to fix this / make sure it doesn’t happen:

Click through to learn what you need to do to make sure there are no problems.

Comments closed

Testing Network Speed Between Azure Regions

Dave Bermingham has some quick inter-region tests for Azure network performance:

This is the question I asked myself today and of course I couldn’t find this documented anywhere. I’m assuming there is no guarantee and it probably depends on current utilization, etc. If I’m wrong, someone please point me to the documentation that states the available speed. I primarily looked here and here.

So I set up two Windows 2016 D4s v3 instances, one in Central US and one in East US 2, which are paired regions.

If you don’t know what peering is, it essentially lets you to easily connect two different Azure virtual networks. Peering is very easy to setup, just make sure you configure it from both Virtual Networks, I made that mistake at first. Once it is configured properly it will look something like this.

Read on for Dave’s results.

Comments closed

Running SSMS As A Different Windows Account

Jason Brimhall shows a couple ways of running SQL Server Management Studio with different Active Directory credentials:

One of the tasks I find myself doing on a fairly regular basis is running SSMS as a different Windows User. The two biggest use cases for this are: a) to test an account to prove that it is working (or not) and has the appropriate level of access, and b) to use SSMS to connect to a Domain SQL Server from a computer in a different domain (or not on the domain).

In addition to needing to do these tasks for myself, I find that I need to show somebody else how to do the same thing on a fairly consistent basis. Considering the finite keystrokes we all have (which I referenced here), it is time for me to “document” how to do this task.

The “/netonly” command line parameter is one I’ve occasionally forgotten to my inevitable chagrin.

Comments closed