Press "Enter" to skip to content

Author: Kevin Feasel

HDP 2.5 Sandbox

The Hortonworks Data Platform 2.5 sandbox is now available:

GET STARTED IN 4 STEP

1) Download HDP Sandbox as a VM image(VMware and Virtualbox or Docker
2) Setup and Start the VM image.
3) Try a Sandbox tutorial, check out the list of free tutorials, or jump directly into an Hello to HDP hands-on tutorial.
4) Need more help? Visit the Hortonworks Community Connection(HCC) and interact directly with the community and our development team.

It looks like they’ve bumped up the RAM requirements to 8 GB and have added new tutorials.

Comments closed

Specifying Columns In Entity Framework

Richie Rump shows how to limit the number of columns returned in an Entity Framework query:

This one’s a bit more tricky but let’s walk through it. We’re getting data from the Posts table where the Tags column equals “<sql-server>” and selecting every column from both the Posts and PostTags tables. We can tell because there are no specified properties in the Select. Even though this statement looks more complex it’s only three lines and looks somewhat like a SQL statement. But it’s really a LINQ (Language Integrated Query) statement, specifically a LINQ to Entities statement. This LINQ statement will be translated into this SQL statement:

Read the whole thing.

Comments closed

SSAS Timezone Conversions

Meagan Longoria notes that last processed date is UTC but the Properties page is local time:

The datetime returned by this query is in UTC. My query returns 9/19/2016 7:43:03 PM.

If I go into the properties of my SSAS database, I can see this same info, but the timezone conversion has already been done for me (this server is in Central time zone).

I think that on net, that’s the best way to do it:  store everything in UTC and use the presentation layer to convert those to local times.

Comments closed

SQL Server 2014 SMO And TruncateData

Max Trinidad finds a version inconsistency in SMO:

Hum! I just found out that in SQL Server 2014 (SP2 installed), while migrating from SQL Server 2005, one of my PowerShell script (I’ve been using for a long time) that uses SMO to truncate tables. But, when running it against a SQL Server 2014 database, I’m getting an error:

“..this property is not available on SQL Server 2014.”

For mi surprise, I ran the same PowerShell script against SQL Server 2016 and it works fine.

That seems rather odd.  If this affects you, vote up his UserVoice item.

Comments closed

Changing Identity Start Value

Kenneth Fisher has a good post on what happens when you change the seed value of an identity column:

Well Paul told me this wasn’t the case. Now when Paul tells me something I believe him, but I also like to run tests. So I decided to usesys.fn_PhysLocCracker(%%physloc%%). %%physloc%% returns a varbinary that gives you the location of the row. When passed tosys.fn_PhysLocCracker(%%physloc%%) it returns the database file, page in the file, and slot number where the row can be found. So to start with I create an identity(1,1) and I run 20 inserts, one at a time, checking row locations each time. This is to confirm I’m right about this part.

Clicking through is worth it for the hypnotizing animated GIFs.

Comments closed

SSMS Connection Colors

Andrew Pruski shows how to change window bar colors within SQL Server Management Studio:

A simple but effective setting in SQL Server Management Studio is using custom colours to identify which server you are about to execute a query on. It’s simple to setup but not everyone who uses SSMS is aware of it so I thought I’d quickly run through the steps here.

This is a nice visual way of figuring out you’re in production before you run that truncate table script.

Comments closed

Azure SQL Data Warehouse Setup

Arun Sirpal configures a new instance of Azure SQL Data Warehouse:

The information shown here is the DSQL (Distributed SQL) plan – When you send a SQL query to SQL Data Warehouse, the Control node processes a query and converts the code to DSQL then the Control node sends the command to run in each of the compute nodes.

The returned query plan depicts sequential SQL statements; when the query runs it may involve parallelized operations, so some of the sequential statements shown may run at the same time. More information can be found at the following URL https://msdn.microsoft.com/en-us/library/mt631615.aspx.

Arun also looks at running a simple Power BI report off of Azure SQL Data Warehouse; click through for that.

Comments closed

Reporting On Unstructured Data

Jen Underwood discusses a tool which helps provide structure to text documents:

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Looks like an interesting tool.

Comments closed

One-Sample T Tests

Mala Mahadevan shows how to perform one-sample T Tests:

For this post I decided to go with a simple example of how many steps I walked with my per day for the month of August. My goal is 10,000 steps per day – that has been my average over the year but is this true of the data I gathered in August? I have a simple table with two columns – day and steps. Each record has how many steps I took in August per day, for 30 days. So – SELECT AVG(steps) FROM [dbo].[mala-steps] gives me 8262 as my average number of steps per day in August. I want to know if am consistently under performing my goal, or if this is a result of my being less active in August alone. Let me state my problem first – or state what is called ‘null hypothesis’:

I walk 10,000 steps on an average per year. 

Read on for T test operations in T-SQL (although not all operations are available) and R.

Comments closed

Clickstream Anomaly Detection

Chris Marshall shows how to perform anomaly detection using AWS Kinesis Analytics:

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

This kind of scenario is pretty cool—you could also do things like detecting service outages in streams (fewer than X events in a window, where X is some very small number relative to your overall data) or changes in advertising campaigns.

Comments closed