Press "Enter" to skip to content

Day: February 16, 2017

Twitter Sentiment Analysis Using doc2vec

Sergey Bryl uses word2vec and doc2vec to perform Twitter sentiment analysis in R:

But doc2vec is a deep learning algorithm that draws context from phrases. It’s currently one of the best ways of sentiment classification for movie reviews. You can use the following method to analyze feedbacks, reviews, comments, and so on. And you can expect better results comparing to tweets analysis because they usually include lots of misspelling.

We’ll use tweets for this example because it’s pretty easy to get them via Twitter API. We only need to create an app on https://dev.twitter.com (My apps menu) and find an API Key, API secret, Access Token and Access Token Secret on Keys and Access Tokens menu tab.

Click through for more details, including code samples.

Comments closed

Plan Comparison

Max Vernon looks at plan comparison in SQL Server Management Studio:

The table has 10,000,000 rows. I’ve create a non-clustered columnstore index on the table, which I’ll talk about in a future post. I’ve included it here because it provides a succinct difference in the two plans.

To compare the plans visually, side-by-side, you need to save the first plan by right-clicking on the plan window, clicking “Save Execution Plan As…”, and specifying a filename. Next, right-click on the plan window, and choose “Compare Showplan”:

I’ve only used this once or twice, but it is an interesting feature.

Comments closed

Hadoop In The Cloud

Peter Coates talks about pros and cons to Hadoop in the cloud:

Hadoop was developed for deployment over Linux running on bare metal. Cloud deployment implies virtual machines, and for Hadoop it’s a huge difference.

As detailed in other articles (for instance, Your Cluster Is an Appliance or Understanding Hadoop Hardware Requirements), bare-metal deployments have an inherent advantage over virtual machine deployments. The biggest of these is that they can use direct attached storage, i.e., local disks.

Not every Hadoop workload is storage I/O bound, but most are, and even when Hadoop seems to be CPU bound, much of the CPU activity is often either directly in service of I/O, i.e., marshaling, unmarshaling, compression, etc., or in service of avoiding I/O, i.e., building in-memory tables for map-side joins.

Read the whole thing.

Comments closed

String Splitting And Concatenation

Aaron Bertrand and Steve Hughes talk about string splitting, and Aaron also discusses string concatenation.  First Aaron:

That may not look like a massive simplification, but don’t forget about all the logic buried behind the table-valued function in the first example. And if you’re like several shops I know, if you look across your codebase and see all the messy uses you have for either of these methods, the benefits should be even more clear – and testing should bear that the performance savings compared to traditional, expensive methods are the sweetest part of the deal.

And Steve:

The STRING_SPLIT function will return a single column result set. The column name is “value”. The datatype will be NVARCHAR for strings that are NCHAR or NVARCHAR. VARCHAR is used for strings that are CHAR or VARCHAR types.

These two functions are small, but come in handy quite frequently.

Comments closed

Galaxy Classification With SQL Server

David Smith points out a nice Microsoft demo for classifying galaxies using SQL Server:

The SQL Server Blog has since published a step-by-step tutorial on implementing the galaxy classifier in SQL Server (and the code is also available on GitHub). This updated version of the demo uses the new MicrosoftML package in Microsoft R Server 9, and specifically the rxNeuralNet function for deep neural networks. The tutorial recommends using the Azure NC class of virtual machines, to take advantage of the GPU-accelerated capabilities of the function, and provides details on using the SQL Server interfaces to train the neural netowrk and run predictions (classifications) on the image database. For the details, follow the link below.

If you’re going to get into SQL Server R Services at any level of seriousness, I highly recommend R Tools for Visual Studio, as it will make building those external stored procedure calls much easier.

Comments closed

AT TIME ZONE And Reports

Rob Farley shows how to use AT TIME ZONE without sacrificing performance:

Because how am I supposed to know whether a particular date was before daylight saving started or after? I might know that an incident occurred at 6:30am in UTC, but is that 4:30pm in Melbourne or 5:30pm? Obviously I can consider which month it’s in, because I know that Melbourne observes daylight saving time from the first Sunday in October to the first Sunday in April, but then if there are customers in Brisbane, and Auckland, and Los Angeles, and Phoenix, and various places within Indiana, things get a lot more complicated.

To get around this, there were very few time zones in which SLAs could be defined for that company. It was just considered too hard to cater for more than that. A report could then be customised to say “Consider that on a particular date the time zone changed from X to Y”. It felt messy, but it worked. There was no need for anything to look up the Windows registry, and it basically just worked.

But these days, I would’ve done it differently.

Now, I would’ve used AT TIME ZONE.

Read on for the scenario.

Comments closed

Instant File Initialization In DMVs

Rodney Landrum shows off a couple new columns in SQL Server 2016 SP1 DMVs:

Microsoft announced many new features in SQL Server 2016 SP1 and the fanfare was mostly centered around the Enterprise features now available in SQL Server 2016 Standard Edition.  Many may have missed some hidden gems in the announcement.  Two of these are columns added to the existing DMVs, sys.dm_server_services and sys.dm_os_sys_info. The columns provide information for two specific features that previously had to be gathered by opening gpedit.msc and/or scrolling through SQL error logs. I am referring to Lock Pages in Memory and Instant File Initialization (enabled via Perform Volume Maintenance Tasks privilege).

It is now possible to simply query the DMVs to determine if these are being used for the running SQL Server instance.

Click through for the details.

Comments closed

Session Context

Ewald Cress looks at SESSION_CONTEXT() as a replacement for CONTEXT_INFO():

SESSION_CONTEXT() brings two major innovations. Firstly, it replaces a 128-byte scalar payload with a key-value structure that can accommodate 256kB of data. You can really go to town filling this thing up.

The second change is less glamorous, but possibly more significant: it is possible to set an entry to read-only, meaning that it can safely be used for the kind of contextual payload you don’t want tampered with. This makes me happy, not because I currently have a great need for it, but because it neatly ties in with things I have been thinking about a lot lately.

Read on for more.

Comments closed