Autocompleter For Hue

Kevin Feasel

2016-09-06

Hadoop

The Hue team shows off their new SQL editor’s autocomplete capabilities:

We’ve fine-tuned the live autocompletion for a better experience and we’ve introduced some options under the editor settings where you can turn off live autocompletion or disable the autocompleter altogether (if you’re adventurous). To access these settings open the editor and focus on the code area, press CTRL + , (or on Mac CMD + ,) and the settings will appear.

The autocompleter talks to the backend to get data for tables and databases etc. by default it will timeout after 5 seconds but once it has been fetched it’s cached for the next time around. The timeout can be adjusted in the Hue server configuration.

I haven’t used Hue in a while, but that’s a nice feature.  Just don’t use ANSI-89 syntax like in that first example…

Deploying SSDT Scripts

Richie Lee has concerns with database deployments:

At any rate, the script is generated and maybe reviewed….. so then what? In SSDT there is no way to create and deploy script in one step; they are two distinct steps. And even if they were one step, this would still not resolve the issue that troubles me. So what is this issue?

The issue is that by creating a script, and then running the deploy, you cannot be sure that the database is in the exact same state that it was when the initial script was generated. If you don’t already know, SSDT runs a deploy entirely in memory, so as mentioned there is no script created. You have to explicitly create the script as part of the process. Or, if you have already created one, you have to re-create the script.

I’m on the fence here.  In simpler environments, I think Richie has a good point.  But in a complex environment, I wouldn’t even use auto-generated deployment scripts; when you’re changing hundreds of database objects (including adding and modifying columns, backfilling data, modifying indexes, etc.), that automated deployment script is almost guaranteed to fail.  And if it does fail, it could leave you in a state of irreparable harm.

Regular Expressions Against Large Data Sets

Liz Bennett explains types of regular expressions which do not scale:

With recursive backtracking based regex engines, it is possible to craft regular expressions that match in exponential time with respect to the length of the input, whereas the Thompson NFA algorithm will always match in linear time. As the name would imply, the slower performance of the recursive backtracking algorithm is caused by the backtracking involved in processing input. This backtracking has serious consequences when working with regexes at a high scale because an inefficient regex can take orders of magnitude longer to match than an efficient regex. The standard regex engines in most modern languages, such as Java, Python, Perl, PHP, and JavaScript, use this recursive backtracking algorithm, so almost any modern solution involving regexes will be vulnerable to poorly performing regexes. Fortunately, though, in almost all cases, an inefficient regex can be optimized to be an efficient regex, potentially resulting in enormous savings in terms of CPU cycles.

There’s a significant performance difference, so if you work frequently with regular expressions, check this out.

HBase Performance Tips

Ashish Thapliyal has nine tips for optimizing HBase performance:

Does your RowKey’s looks like 1,2,3…….. or 00000001, 00000002, 00000003, or do you have Row Key that starts with date-time (starting with the year)? If you answered yes, bad news is that HBase will not scale for you, you have so many options to improve the HBase performance but there is nothing that will compensate for the bad rowkey design.

When rowkey is in sorted order, all the writes go to the same region and other regions will sit ideal doing nothing. you will see one of your node is very stressed trying to cope up with all the writes where as other nodes are thanking you for not giving them enough work. So, always salt your keys by adding random numbers or characters to the row key prefix.

If you are using Phoenix on top of HBase, Phoenix provides a way to transparently salt the row key with a salting byte for a particular table. You need to specify this in table creation time by specifying a table property “SALT_BUCKETS” typical practice is to set the value of SALT_BUCKET =number of region server

I think the biggest one is to design your data structures correctly.  This is particularly important if you’re coming at it from a relational background and are thinking in terms of what makes relational databases fast.

Distributed File System Replication And Backups

James Anderson discusses an interesting setting within Distributed File System Replication:

Ideas of a cmd job step (after the backup step) that renamed the .bak files to .BTFU started to form, but a quick search showed that there is a default filter on DFSR folders.

  • Files starting with ~ (temporary files created by programs like Word)

  • Files with .tmp extension

  • Files with a .bak extension.

Read on to learn what you can do to remove extension filters within DFSR.

Pearson’s Correlation Coefficient

Kevin Feasel

2016-09-06

R, T-SQL

Mala Mahadevan explains correlation coefficients:

The statistical definition of Pearson’s R Coefficient, as it is called, can be found in detail here for those interested. A value of 1 indicates that there is a strong positive correlation(the two variables in question increase together), 0 indicates no correlation between them, and -1 indicates a strong negative correlation (the two variables decrease together). But you rarely get a perfect -1, 0 or 1. Most values are fractional and interpreted as follows:
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.

Mala includes R and T-SQL code so you can follow along.

SSPI

Kevin Hill diagnoses an SSPI error:

Apparently, the account was either locked out from our failed logon attempts, or had been disabled in Active Directory due to its age.  They do that sometimes.   Most likely the issue was locked.

We restarted the SQL Server (O/S restart) and that resolved it once the AD group unlocked it.

My assumption is that the lockout either blocked Kerberos authentication due to SPN no longer being valid, or the SPN itself got corrupted.  It was still there, just not working.   Verified its existence through running SetSPN -L with the account name.

This is on my top five list of least helpful error messages.  Even if it is literally true, it does not help you diagnose and correct the issue.  There are a number of potential causes and it’s up to you to troubleshoot each one (assuming you even know that it could be an issue) until it just works again.

Throwing Hardware At The Problem

Erik Darling says get more RAM:

I’m not saying you need a 1:1 relationship between data and memory all the time, but if you’re not caching the stuff users are, you know, using, in an efficient way, you may wanna think about your strategy here.

  • Option 1: Buy some more RAM
  • Option 2: Buy an all flash array

You’ll still need to blow some development time on tuning queries and indexes, but hardware can usually bridge the gap if things are already critical.

Looking at hardware is a reasonable approach.  The best bet is to satisfy the most pressing need at the margin.  Sometimes that means more (or better) hardware, sometimes it means tuning queries, and sometimes it means application-level changes to retrieve data differently.

Group Workspaces

Ginger Grant shows how to use Group Workspaces within Power BI to allow a team to work together on the same files without stepping on each other’s feet:

Many of the functionality people associate with source control programs live inside the group one drive which is created for Power BI. Looking at the picture of the group screen, which was created when a Power BI Workspace was created, you will see that this group contains 7 members and four files. The members of this groups are the only ones who have access to the files. The file AcmeThree.pbix is selected, and cClicking on the elipise (…) brings up a menu for the file. Notice one of them is Check Out. If I check out a file, the icon next to the name changes, providing a visual queue to all who wish to edit the file that it is being working on. The menu option for me would change to Check In, providing the ability to check the file in to the directory, allowing others to check out the file and work on it. Notice Version History also exists. This feature allows previous versions of the file to be loaded, which means that changes made to a file can be rolled back.

It’s good that this is available, and I’d make use of it.  For Power BI Desktop, it seems prudent to continue using source control.

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930