Click through for an example.
DIM AND GLIMPSE
Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.
And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.
Spending some quality time doing EDA can save you in the long run, as it can help you get a feel for things like data quality, the distributions of variables, and completeness of data.
We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.
As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.
In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.
Check it out. If this is interesting, Priya’s blog has the full series.
Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.
The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data transformations.
The syntax is quite similar to the
PIVOT syntax that SQL Server uses.
The Forecast measure in the demo model is quite an advanced piece of DAX code that would require a full article by itself. The curious reader will find more information on how to reallocate budget at different granularities in the video Budgeting with Power BI. In this article, we use the Forecast measure without detailed explanations; our goal is to explain how to compute the next measure: Remaining Forecast.
The Remaining Forecast measure must analyze the Sales table, finding the last day for which there are sales, and only then computing the forecasts.
Read the whole thing.
Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.
- Category Indexing
- One-Hot Encoding
Click through for the code and explanation.
Generating test data to fill the development database tables can also be performed easily and without wasting time for writing scripts for each data type or using third party tools. You can find various tools in the market that can be used to generate testing data. One of these wonderful tools is the dbForge Data Generator for SQL Server . It is a powerful GUI tool for a fast generation of meaningful test data for the development databases. dbForge data generation tool includes 200+ predefined data generators with sensible configuration options that allow you to emulate column-intelligent random data. The tool also allows generating demo data for SQL Server databases already filled with data and creating your own custom test data generators. dbForge Data Generator for SQL Server can save your time and effort spent on demo data generation by populating SQL Server tables with millions of rows of sample data that look just like real data. dbForge Data Generator for SQL Server helps to populate tables with most frequently used data types such as Basic, Business, Health, IT, Location, Payment and Person data types.
I have a love-hate relationship with test data generation tools, as they tend not to create reasonable data, where reasonable is a combination of domain (hi, birth date in the early 1800s!) and distribution.
The command is simple. You can get or set the property using the value for given property path. As usual – when you get the value, you leave the value blank. The path – well – is the path to the element in the package or the project. You use backslashes to separate elements in the package tree, and at the end, you use
.Properties[PropertyName]to read the property. If you use the elements collection – like connection managers – you can pick a single element using square brackets and the name of this element.
Read on for more, including limitations and useful testing scenarios.
Let’s start with the positive.
Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.
It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.
It’s not all positive, but Koen seems quite happy to work with the product.
SMO connects to SQL Server using the ADO.NET SQLClient library which has 13+ years of features which help mask the passwords you pass in for SQL Authentication. SMO bypasses some of those features to often leak the passwords in clear-text.
- Even where it would normally be hidden.
- Even where you use
Persist Security Infointroduced in 2005.
- Even where you use
System.Security.SecureStringintroduced in 2012.
- Though thankfully not where you use
System.Data.SqlClient.SqlCredentialalso introduced in 2012. However… there’s some caveats here too.
We’ll prove it through repeatable tests that can be used to track if Microsoft fix the problem or not.
Read the whole thing.