the write-ahead log
This chapter also helped me understand what’s going on with write-ahead logs better! Write-ahead logs are different from log-structured storage, both kinds of storage engines can use write-ahead logs.
Recently at work the team that maintains Splunk wrote a post called “Splunk is not a write-ahead log”. I thought this was interesting because I had never heard the term “write-ahead log” before!
There are a few different topics in here, all of which are important to understand how databases work.
Why does help exist?
When you think about it, why is there even a function called
As far as I’m aware it’s basically the same as
Get-Helpexcept it automatically pipes the output to
| moreso we get pages rather than a wall of text.
Is there more that we can do with
Get-Helpthough? Is there a way that we can return the examples only? Syntax only? Parameters only?
Is there not a way that we can do such things?!
Read on to find out if there is.
Although most analytics applications today still leverage older data warehouse and OLAP technologies on-premises, the pace of the cloud shift is significantly increasing. Infrastructure is getting better and is almost invisible in mature markets. Cloud fears are subsiding as more organizations witness the triumphs of early adopters. Instant, easy cloud solutions continue to win the hearts and minds of non-technical users. Cloud also accelerates time to market allowing for innovation at faster speeds than ever before. As data and analytics professionals, be sure to make time to learn a variety of cloud and hybrid analytics tools.
Exploring novel technologies across various ecosystems in the cloud world is usually as simple as spinning up a cloud image or service to get started. There are literally zillions of free and low cost resources for learning. As you dive into a new world of data, you will find common analytics architectures, design patterns, and types of technologies (hybrid connectivity, storage, compute, microservices, IoT, streaming, orchestration, database, big data, visualization, artificial intelligence, etc.) being used to solve problems.
It’s worth reading the whole thing.
Raise your standards as high as you can live with, avoid wasting your time on routine problems, and always try to work as closely as possible at the boundary of your abilities. Do this because it is the only way of discovering how that boundary should be moved forward.
Readers of this blog post are just as likely as anyone to fall victim to the classic maxim, “When all you have is a hammer, everything is a nail.” I remember a job interview where my interrogator appeared disinterested in talking further after I wasn’t able to solve a certain optimization using Lagrange multipliers. The mindset isn’t uncommon: “I have my toolbox. It’s worked in the past, so everything else must be irrelevant.”
There’s some good advice in here.
It’s still early days for machine learning. The bounds and guidelines about what is possible or likely are still unknown in a lot of places, and bigger projects that test more of those limitations are more likely to fail. As a fledgling data engineer, especially in the industry, it’s almost certainly the more prudent course to go for the “low-hanging fruit” — easy-to-find optimizations that have real world impact for your organization. This is the way to build trust among skeptical colleagues and also the way to figure out where those boundaries are, both for the field and for yourself.
As a personal example, I was once on a project where we worked with failure data from large machines with many components. The obvious and difficult problem was to use regression analysis to predict the time to failure for a given part. I had some success with this, but nothing that ever made it to production. However, a simple clustering analysis that grouped machines by the frequency of replacement for all parts had some lasting impact; this enabled the organization to “red flag” machines that fell into “high replacement” group where the users may have been misusing the machines and bring these users in for training.
There’s some good advice. Also read the linked Dijkstra note; even in bullet point form, he was a brilliant guy.
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.
If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.
Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.
This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.
It’s a good starting set of links.
even more links
a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
openintro has free some statistics books
There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.
Today I am very excited to announce that I have (finally!) launched my email course covering the basics of SQL Server Security.
This has been a lot of work to get a new system in place to make the learning experience a little different. It is like a normal email course, but at the same time it isn’t.
I have been waiting for this for months ever since hearing Chris first talk about it.
What this has meant is that innovation – in particular in the Azure Public Cloud, ISV’s, new data services/products, and new data related infrastructure – has accelerated dramatically and changed the very definitions of what was previously accepted as comprising the “Data Platform”.
Nowadays when I talk to customers about the “Data Platform” it encompasses a range of services across a mix of IaaS, PaaS and SaaS. The decision of which data service to deploy now comes down to matching the business case technical requirements with the capability of a purpose built cloud service – as opposed to (in the past) trying to fit an obvious NoSQL use case into a traditional RDBMS platform.
I now see the “New Data Platform” as much broader than ever before and includes many other “non-traditional” data services…
The developer in me thinks this is nuts. Run the same few lines of code twice, with no changes in between, and get different outputs? Madness!
Here’s another example. Nothing too complex here: I connect to an instance of SQL, SELECT CURRENT_TIMESTAMP, and show the returned value in the output window. (There’s a fixable issue here that I would go on to discover later. But hold that thought for now.)
Even when you’re conceptually familiar with a language, getting into the particular foibles of that language can expose all sorts of behavior which is strange to newcomers.