In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process — data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.
All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.
They have provided the relevant materials in GitHub as well.
A colleague and I were hoping to review SSIS perfmon counters on a VM. We use a logman command with a counters file to log perfmon to csv.
Opened up the csv that was captured on the VM… there were all of my typical SQL Server counters… but the following SSIS counters were missing.
\SQLServer:SSIS Service\SSIS Package Instances
\SQLServer:SSIS Pipeline\Buffer memory
\SQLServer:SSIS Pipeline\Buffers in use
\SQLServer:SSIS Pipeline\Buffers spooled
\SQLServer:SSIS Pipeline\Flat buffer memory
\SQLServer:SSIS Pipeline\Flat buffers in use
\SQLServer:SSIS Pipeline\Private buffer memory
\SQLServer:SSIS Pipeline\Private buffers in use
\SQLServer:SSIS Pipeline\Rows read
\SQLServer:SSIS Pipeline\Rows written
The more you know.
First things first the DMA is a replacement of the Upgrade Adviser. In fact it’s an upgrade of the Upgrade Adviser. It has some amazing new features.
You can install this on a workstation. It doesn’t have to be installed on the server itself.
You can have multiple projects saved with data from multiple server/instances.
There is the option to get compatibility issues and/or new feature recommendations.
You can check issues/feature recommendations for upgrades to 2012, 2014 or 2016.
Looks like it’s a step up from the old Upgrade Advisor.
There is a total of “N” linked servers. The query selects from each one and combines the results via UNION ALL. Here’s a sample of the output on my CMS:What we’ve seen so far could be used as a template for other queries, such as a backup report, a database files report, etc. The code could be put in a stored procedure and run as needed for automated tasks. I’ve used similar code to retrieve data for an SSRS report.
Let’s examine a problem you’re likely to encounter by including @@SERVERNAME within the query
Give it a read.
I’ve talked about this in my live sessions, but this is an extreme case that really happened – a team took over a week to fix a bug in a stored procedure, and the delay was caused solely by poor naming standards. What happened was that the application was calling
dbo.Customer_Update, but the team was hunting for the bug in a different procedure,
dbo.Update_Customer. While there was no formal convention in place, the real problem was inconsistency – a consultant charged with writing a different application didn’t check for an existing procedure, she just looked for
dbo.Update_Customerin the list; when she didn’t find it, she wrote her own. The bug itself wasn’t crucial, but that lost time can never be recovered.
I’ll repeat again that the convention you choose is largely irrelevant, as long as it makes sense to you and your team, and you all agree on it – and abide by it. But I am asked frequently for advice on naming conventions, and for things like tables, I’m not going to get into religious arguments about plural vs. singular, the dreaded
tblprefix, or going to great lengths to avoid vowels. But I think I have a pretty sensible standard for stored procedures, and I am always happy to share my biases even though I know not everyone will agree with them. Again, I touched on this in my earlier post, but sometimes these things bear repeating and a little elaboration.
I agree with this: pick a standard and stick to it.
When I wanted to research memory problem on a server and started to dig deeper into “DBCC MEMORYSTATUS” command.Very useful links to understand that command were from Microsoft:
During the research I’ve faced two problems:
1. I had to wait several seconds to get the full result set.
2. I had to scroll down and shuffle 115 different data sets to find the counter I want.
To eliminate both these problem all these different counters have to be in the same table/data set.
That will make research easier and data can be stored or compared with the base line.
Read on for the T-SQL script.
As displayed in above screen shots, the single Azure Standard Storage VHD gives you (as promised) about 500 IOPS. Striping eight (8) of those, will roughly give you eight (8) times the IOPS, but not same magnitude of [MB/s] apparently. Still, the setup is better off, after, rather than before!
Do mind, that there are three levels of storage performance; P10, P20 and P30. For more information, read this.
I did this recently and can confirm that there’s a huge difference between using one virtual disk versus even three or four, and Windows Storage Spaces makes it easy to expose them as one combined mount point.
Reduce time to troubleshoot problems. Apache Hadoop components create a lot of log data. Accessing that log data to understand what the component is telling you, especially when issues arise, is critical. Apache Ambari includes a new Log Search service that provides agents for log collection and a delivers a custom UI for searching those logs. This is essential to providing a streamlined approach to searching for stack traces and exceptions across all nodes in the cluster.
I have enjoyed watching Ambari mature as a product.
I panicked. In this situation I would normally pull the database out of the AG and then re-add it. I didn’t have that option because it is a HUGE database and didn’t have that much time and space to move it around. I knew a large transaction had kicked off (thank you alert email that I created to warn me about such things) but thought that surely the rollback would have cleared quickly. That lead me to looking for rolling back transactions.
Fortunately, the issue was on a secondary not under heavy use, so she was able to recover in time.
Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order oftens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as few 10s of milliseconds, (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.
The upshot is that the two products don’t do exactly the same thing, and there might be room in your organization for the two of them.