Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.
In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.
This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.
If I’m looking in SQL Server’s Execution Plan Cache, I like to use the sys.dm_exec_text_query_plan dynamic management view. This stores those XML query plans as text.
I learned about using this DMV from Grant Fritchey in his post, “Querying the Plan Cache, Simplified.” Grant points out that while doing wildcard searches in the text version of a query plan isn’t fast, querying it as XML is often even slower.
This is definitely worth a read.
There’s a gray bar across the top that says, “This site is currently in read-only mode; we’ll return with full functionality soon.”
That’s not a hidden feature of Always On Availability Groups. Rather, it’s a hidden feature of really dedicated developers whose application:
Tries to connect to the SQL Server, and if that fails, shows a total-site-down page
If it’s not writeable, shows a polite banner across the top of the site, and still gracefully runs readable queries only
This is where a bit of foresight and hard work can really pay off. Read the whole thing.
Today early adopters of Amazon Athena are using it for big data analytics pipeline projects along with Kinesis streaming data and other Amazon data sources.
Athena is serverless parallel query pay-per-use service. There is no infrastructure to set up or manage. It scales automatically and can handle large datasets or complex distributed queries.
The easy way of thinking about Athena is that it’s ElasticMapReduce (a pay-as-you-go Hadoop cluster) without the ceremony of administering or spinning up the cluster.
The package that we are going to use to develop our custom visualization is ggplot2. The ggplot2 package is arguably the most popular data visualization package in R. It is based on the “grammar of graphics” concept that was created by the statistician, Leland Wilkinson. The ggplot2 package allows you to approach creating charts and graphs in the same manner that Bob Ross approached painting trees in the forest. With ggplot2 you are able to start with a blank canvas and add layers upon layers via short code snippets that builds on each other until you end up with the desired visualization.
The pbix file that is being used in this blog can be found here: http://bit.ly/2jwoCyP. The GentleIntroToR_ChartExample.pbix file contains an example of using R to create a box plot chart that shows the distribution of player scores for the L.A. Lakers. Chiclet slicers were added that allows you to filter by division and/or opponent. The R visualization was created in four steps.
Check out the PBIX file.
In addition, to seeing where the high and low distribution of values are you will also find the Box and Whisker will provide:
A count of values
Range of values
A simple box and whisker chart goes a long way.
What I’m going to do is setup two instances of SQL Server running on linux and log ship one database from one to another. So the first thing I did was get two VMs running Ubuntu 16.04.1 LTS which can be download from here.
Once both servers were setup (remember to enable ssh) I then went about getting SQL setup, I’m not going to go through the install in this post as the process is documented fully here. Don’t forget to also install the SQL Tools, full guide is here.
Read on for the guide, but also be sure to read his disclaimer.
Lets look at some examples, I’ll be using tables from the new sample database for SQL Server 2016 WorldWideImporters which you can download from this link
I’m of two minds with JSON support: I think it’s very useful for building output sets for service calls and might be fine for inputs when you can’t use a table-valued parameter for some reason, but if you’re doing a lot of JSON splitting of data in a table, that’s a violation of first normal form.
I recently wrote a blog about how to tell why your SQL login isn’t working. There were a lot of good comments and several of them suggested that I create a connect entry to make the error easier to understand. There was also a question of how to create a connect entry, and it wasn’t the first time I’d heard that, this week. So I’m going to give a quick demo on how to create a connect entry. I’m not going to create it using that particular login error because, as another person pointed out, this error is actually by design. We don’t want to make it easier for someone trying to hack in right?