Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.
Arthur gives us an interesting example in R to boot.
Connect has been an integral part of Apache Kafka since version 0.9, released late 2015. It has proved to be an effective framework for streaming data in and out of Kafka from nearby systems like relational databases, Amazon S3, HDFS clusters, and even nonstandard legacy systems that typically show themselves in the enterprise. Connect is an API on which the connectors themselves are built, plus a run-time framework that runs them in a scalable, fault-tolerant way. The intent was for the community to provide its own connectors to plug into this framework and do the work of data integration while saving everyone a bunch of unrewarding coding that was near-boilerplate and didn’t add a lot of differentiated value to the business.
So where would those connectors live? Well, GitHub, for starters. At the time of this writing, there were 660 repositories matching the search phrase “Kafka Connect” on the popular hosting service, all in various stages of repair and levels of maintenance. Beyond those, Confluent’s popular Connectors page has proven to be one of the best ways to find connectors, some of which are supported by Confluent, and others of which have robust community support behind them. The Connectors page lists for each entry the type of connector, the developer, a few tags, and how you could obtain the code—but that’s really all it did. You still had to go find the released JARs for the connector, download them, and know how to install them properly. And if there were no released JARs available, you had to clone the repository, figure out how to run the build, and then install the JARs into your own Kafka Connect installation. Maybe not rocket science, but we all know it’s never as simple as it sounds. And besides, this was just connectors—no transformations or converters were available on this page.
We knew there was a better way. We wanted something that was easier to use, would avoid you having to building a connector from source every time you wanted an update (and learning a new build tool every now and then), and would be built on top of a meaningful and functional discovery mechanism. And most importantly, we wanted to avoid the pitfalls of manually moving JARs around and having to debug why Connect didn’t find them.
This looks like a good addition to the Kafka ecosystem.
1. yarn-client mode: In
clientmode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,
- spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter
For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m
Check it out for a few other configuration settings you can tweak.
In Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1), MapReduce performed both processing and resource management functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its article that according to Yahoo!, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.
There’s a lot of depth to YARN.
The current state of the Clustered Columnstore Index ONLINE rebuild points to be an unfinished version, which will definitely get vastly improved before being released & supported in SQL Server. I have seen a couple of deadlocks and canceled transactions and so I decided that this blog post will get updated as soon as there will be an official announcement of this feature.
If you are still looking to start working on this feature, then I would suggest trying it on smaller tables. Like really, really small ones.
Oh, and for online rebuild operation focus on using partition rebuild – you are using the partitioning, right ? 🙂
Niko gave this a try in Azure SQL Database, as there is no publicly available version of SQL Server which supports this. I’ve been waiting for this feature for 3 years now, so I’ll be happy to see it in production.
You can use Distributed transactions which specifies the start of a SQL distributed transaction managed by Microsoft Distributed Transaction Coordinator (MS DTC). This is only applicable to SQL Server and no Azure. Transaction-level snapshot isolation does not support distributed transactions.
Is a user-defined transaction name used to track the distributed transaction within MS DTC utilities. transaction_name must conform to the rules for identifiers and must be <= 32 characters.
Is the name of a user-defined variable containing a transaction name used to track the distributed transaction within MS DTC utilities. The variable must be declared with a char, varchar, nchar, or nvarchar data type.
Jeanne also includes the important comment that nested transactions don’t really work the way you’d expect them to. I probably could have ended that last sentence early with, “don’t really work.”
Oops, something did go wrong, as it turns out that if you try to grant permissions on extended stored procedures, which SPEES is, you need to do it from the
masterdatabase. Cool, let us switch to master and do it there. Well, if you try to do that – then you get another error: the user does not exist in
At this stage you have a couple of options:
Add the login for the user to theNo do not do that, I am only kidding! Do.Not.Do.That!
sysadminrole, or the user to the
db_ownerrole in the actual database.
Create the user in
masterand grant the permission. That would work.
Grant the permission to
Check it out, as there are two parts to the process.
Access to the table columns can be controlled based on the user’s execution context or their group membership with the standard GRANT T-SQL statement. To secure your data, you simply define a security policy via the GRANT statement to your table columns. For example, if you would like to limit access to PII data in your customers table, you can simply GRANT SELECT permissions on specific columns to the ContractEmp role:GRANT SELECT ON dbo.Customers (CustomerId, FirstName, LastName) TO ContractEmp;
This capability is available now in all Azure regions with no additional charge.
This has been in regular SQL Server for a long time, so it’s good to see it make its way into Azure SQL Data Warehouse, and in a manner which doesn’t involve creating user-defined functions for predicates like Row-Level Security.
With this option, if no rows have changed, the statistic will not be updated. If one or more rows have changed, the statistic will be updated.
Since the release of SP1 for 2012, this has been my only challenge with Ola’s scripts. In SQL Server 2008R2 SP2 and SQL Server 2012 SP1 they introduced the sys.dm_db_stats_properties DMV, which tracks modifications for each statistic. I have written custom scripts to use this information to determine if stats should be updated, which I’ve talked about here. Jonathan has also modified Ola’s script for a few of our customers to look at sys.dm_db_stats_properties to determine if enough data had changed to update stats, and a long time ago we had emailed Ola to ask if he could include an option to set a threshold. Good news, that option now exists!
Keeping statistics up to date is a nice way of quietly improving performance in a system.
In a previous consulting life, a customer contacted us and asked for an evaluation of their architecture. They were attempting to scale the business and encountering… obstacles. A team looked over their software and database designs and recommended a rewrite of their custom code. We supplied a proposal (including an expensive estimate) to deliver the re-architecture, redesign, and rewriting of their code.
They felt the price was too high.
We understood, so we countered with a proposal to coach their team to deliver the same result. This would cost less, the outcome would be the same, and their team would grok the solution (because their team would build the solution). The catch? It would take longer.
They felt this solution would take too long.
Andy has some great thoughts on the subject. One area where I’d push further is to say that the best way to take advantage of cloud services is not the best way to take advantage of on-prem services, so even if you have a well-architected on-prem solution, it might not be ideal for running in AWS or Azure.