Lambda architecture – developed by Nathan Marz – provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.
Lambda architecture has three (3) layers:
I haven’t heard much about the Lambda and Kappa architectures lately, so when I saw this, I figured it was time for a refresher.
At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we’ve noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.
— David Smith (@revodavid) June 7, 2017
That lowers the barrier to usage significantly, so it’s a very welcome update.
In my head I’m thinking of all the things that I can do to try and troubleshoot this problem.
- Extended Events my session,
- Ask my Senior DBA,
Then I realize that I’m jumping the gun again and I slow down, and check the first error message again. This time without the developers shouting in my ear, about permissions.
This is a great example of why it’s important to troubleshoot using a methodical, logical process. If you get it stuck in your head that the answer is quite obviously something, you lose a bunch of time if it turns out that it isn’t quite as obvious.
We’ve learned a lot by working with customers using SparkML, both internal and external to Microsoft. Customers have found Spark to be a powerful platform for building scalable ML models. However, they struggle with low-level APIs, for example to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms. Microsoft Machine Learning for Apache Spark (MMLSpark) simplifies many of these common tasks for building models in PySpark, making you more productive and letting you focus on the data science.
The library provides simplified consistent APIs for handling different types of data such as text or categoricals. Consider, for example, a DataFrame that contains strings and numeric values from the Adult Census Income dataset, where “income” is the prediction target.
It’s an open source project as well, so that barrier to entry is lowered significantly.
With variance score of 0.43 linear regression did not do a good job overall. When the x values are close to 0, linear regression is giving a good estimate of y, but we near end of x values the predicted y is far way from the actual values and hence becomes completely meaningless.
Here is where Quantile Regression comes to rescue. I have used the python package statsmodels 0.8.0 for Quantile Regression.
Let us begin with finding the regression coefficients for the conditioned median, 0.5 quantile.
The article doesn’t render the code very well at all, but Gopi does have the example code on Github, so you can follow along that way.
In the call to the
producemethod, both the
valueparameters need to be either a byte-like object (in Python 2.x this includes strings), a Unicode object, or
None. In Python 3.x, strings are Unicode and will be converted to a sequence of bytes using the UTF-8 encoding. In Python 2.x, objects of type
unicodewill be encoded using the default encoding. Often, you will want to serialize objects of a particular type before writing them to Kafka. A common pattern for doing this is to subclass
Producerand override the
producemethod with one that performs the required serialization.
The produce method returns immediately without waiting for confirmation that the message has been successfully produced to Kafka (or otherwise). The
flushmethod blocks until all outstanding produce commands have completed, or the optional timeout (specified as a number of seconds) has been exceeded. You can test to see whether all produce commands have completed by checking the value returned by the
flushmethod: if it is greater than zero, there are still produce commands that have yet to complete. Note that you should typically call
flushonly at application teardown, not during normal flow of execution, as it will prevent requests from being streamlined in a performant manner.
This is a fairly gentle introduction to the topic if you’re already familiar with Python and have a familiarity with message broker systems.
Talking about Power Query; DataMashup file is all you need. It includes everything from the structure of queries, tables, parameters, list, to the actual M scripts behind the scene. You can Fetch all of these information from this single file. Let’s look at the structure of this file. If you open this file with a text editor. you will see some binary things first (which are related to the zipped nature of this file), and also some XML information. Yes, this is a zipped file. Let’s start with unzipping it into a folder. I’ve done that with 7-zip application.
This is an interesting peek under the covers of a PBIX file.
SQL Server has more than one way to pull pages in from disk for your queries. SQL Server can do a physical read of an 8KB page, or an extent of 8 of those 8KB pages.
SQL Server can also use the “read-ahead” mechanism to pull even larger chunks of data in from disk when you have a query that wants to read a lot of data — because just plucking one 8KB page or even 64KB of pages into disk isn’t super fast when you need lotsa pages.
But these terms get a little confusing when you’re changing between different diagnostic tools in SQL Server, because some of these tools include read-ahead reads in physical reads, and some don’t!
There is some good information here, so read the whole thing.
I’ve been lucky with database corruption during my career. I could probably count on one hand the number of times I’ve had to deal with it. A couple times, it was in a customer’s environment–they managed it themselves, but called me in to help. The other incidents were ones I inherited from a backup I had to restore into a production environment. The first time it happened to me, I didn’t realize it until days later when DBCC CHECKDB ran during a weekend maintenance window. After that, I added a new “rule” to my list: always run DBCC CHECKDB after restoring a database from someone else. That rule paid dividends today.
Here’s the output I saw:Msg 8914, Level 16, State 1, Line 50 Incorrect PFS free space information for page (1:2564368) in object ID 457768688, index ID 1, partition ID 72057619124060160, alloc unit ID 72057594116767744 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL. CHECKDB found 0 allocation errors and 1 consistency errors in table 'tbl_Redacted' (object ID 457768688). CHECKDB found 0 allocation errors and 1 consistency errors in database 'db_redacted'. repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (db_redacted).
Read on to see how Dave solved this issue.
There is an awesome set of PowerShell cmdlets out there written by MVP Chrissy LeMaire. This method is my personal choice. It works great and is easy to automate. You can run it with SQLAgent or you can just use Scheduled Tasks in the OS. The scheduled tasks method is a little cleaner, but you don’t get to see it in SQL Server. Also if you are on a cluster and running Windows 2012 you can cluster the task scheduler as an added benefit.
Chrissy wrote this with the intent of making migrations easier, and she succeeded. In fact, I made it a point to thank her at MVP Summit last year because it made my life insanely easier. The advantage here is that you can automate a lot more than than just logins. In fact you can migrate and automate pretty much anything at the server level. Here is the link that I guarantee you are going to bookmark followed by a video demo where I show how to install and automate the syncing of logins using both the SQLAgent method and the Scheduled Tasks method.
DBATools would be my preference in this situation as well, but click through to see four other methods, as well as code.