Overview of Spark Streaming.
Fault-tolerance Semantics & Performance Tuning.
Spark Streaming Integration with Kafka.
Click through for the slide deck. Combine that with the AWS blog post on the same topic and you get a pretty good intro.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
Interactively manipulate Spark data using both dplyr and SQL (via DBI).
Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
Create extensions that call the full Spark API and provide interfaces to Spark packages.
Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.
So what’s the difference between sparklyr and SparkR?
@zedoring sparkR is “inspired by dplyr” and distributed with Spark, sparklyr is a proper dplyr back-end which will be on CRAN.
— Jeff Allen (@TrestleJeff) June 28, 2016
This might be the package I’ve been awaiting.
This has been something I’ve wanted to investigate for a while now. I’ve know you could use Profiler and set up server-side traces to capture long-running events, but was curious how to do the same with Extended Events. I then came across this post from Pinal Dave ( b | t ) that pointed me in the right direction. I followed along with the guidelines he was suggesting and was having trouble finding the “Duration” filter. Turns out I had a bit too much selected in my filtering options or perhaps the Wizard was giving me fits seeing it, but I eventually selected just the Batch Completed or RPC Completed events to see and set the Duration filter. The one change that I’d make from Dave’s script is to set the duration to 500,000 because Duration in SQL 2012 is in microseconds, not milliseconds. I also want queries longer than 5 seconds to start.
Click through for the script.
Stream processing walkthrough
The entire pattern can be implemented in a few simple steps:
Set up Kafka on AWS.
Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.
Create a Kafka topic.
Run the Spark Streaming app to process clickstream events.
Use the Kafka producer app to publish clickstream events into Kafka topic.
Explore clickstream events data with SparkSQL.
This is a pretty easy-to-follow walkthrough with some good tips at the end.
Why Model Outside Azure ML?
Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation. I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments. Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.
So Why Operationalize with Azure ML?
AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things. This really does make life easier for me. Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that. I want to build a model, deploy it, and move to the next one. My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.
Read the whole thing.
I’m sure many people are experimenting with VMs and SQL Server. If you’re like me, many of you just default to installing Windows 7/10 or Windows Server xx Standard for your testing. Those systems work fine, but I’ve been trying to build slimmer systems, which means looking at Server Core. Installing Server Core is much the same as other versions, though you end up with only a command line. If you’re like me, using VMWare, you also might end up with a server name like “WIN-LKR3R4FfL5T”.
I want to change that. It’s a fine name if I’m working locally. It’s not to much fun connecting across a network. This post looks at how to rename that machine.
This is probably a good idea to do before installing any major software. Renaming a server under SQL Server is possible, but there are a few extra steps to the process.
After installing, we need to customized their setting by creating connection(s) to our SQL Server. We do this by opening VS Code “User Preferences” and under “Default Settings.json” we search for the “vscode-mssql” settings to be copied over to our working folder “settings.json” file.
I played with this very early on and would like to see it continue to be developed, but it’s no replacement for Management Studio.
Git is a version control system (VCS), which is just what it sounds like: a system to help keep track of different versions of software. Git isn’t the only VCS out there (others include CVS, SVN, and Fossil), but it is one of the more popular systems, particularly for open source projects. You’ve certainly used software that was developed using Git (Firefox and Chrome are two big ones!).
Version control is really helpful when you are working with other people. Without version control, if I send you a file I’m working on and you make changes to it, we would suddenly have two versions. If I integrate your changes into my file, then we’d only have one file but no history! Even when working alone, version control is really helpful for us to keep track of how the project is moving along.
Understanding at least one source control platform is vital for software development. Git can be like pulling teeth (and then there are the times when it gets really painful), but if you are developing software (even personal scripts!) and don’t have source control in place, you’re walking a tightrope without a net.
This is a quick post to blog a script that allows spinlock statistics to be captured for a defined period of time (as I need to reference it in my next post). Enjoy!
Click through if you don’t already know the correct DMV to use.
Correlated Datetime Columns works. Clearly it’s not something you’re going to enable on all your databases. Probably most of your databases don’t have clustered indexes on datetime columns let alone enough tables with correlation between the data stored in them. However, when you do have that type of data correlation, enabling Correlated Datetime Columns and ensuring you have a clustered index on the datetime column is a viable tuning mechanism. Further, this is a mechanism that has been around since 2005. Just so you know, I did all my testing in SQL Server 2016, so this something that anyone in the right situation can take advantage of. Just remember that TANSTAAFL always applies. Maintaining the statistics needed for the Correlated Datetime Columns is done through materialized views that are automatically created through the optimization process. You can see the views in SSMS and any queries against the objects. You’ll need to take this into account during your statistics maintenance. However, if Correlated Datetime Columns is something you need, this is really going to help with this, fairly narrow, aspect of query tuning.
I don’t know that I’ll ever do this, but it’s worth filing away just in case.