What does your pipeline look like, and what steps are involved?
Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.
At first, I’d suggest just using delimited files, as it’s easiest that way. Once you have developed a bit of Hadoop maturity, then it makes sense to think about whether rowstore formats (like Parquet and Avro) or columnstore formats (like ORC) make sense for a particular data set.
The key thing to remember with SQL Server is to convert to a non-integer value by using a “decimal” as shown in the above example with “10.”. This is the same as saying “10.0”. Without the “.”, it will result in uneven splits from rounding errors of integers. It is not the result that you intend to have it you want accurate results.
To show you the difference, I have included the SQL and results of a query that uses “.” and the other that does not, with “.” being the only difference:
It’s a good article, and definitely an important thing to think about when you have large tables.
The Problem I am Trying To Solve
“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416
According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:
“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”
Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.
It’s an idea worth grappling with. The other side of the argument is that for some problems, you won’t know what you need until you need it.
There is no value in data.
If you’re still here, then I am assuming you either a) believe I have a valid point, or b) just want to see how crazy I am for opening my new data blog with a post spouting the lack of value in data. We’ll see which option is right by the end of the post because right now, I am not so sure which one is right and which one is wrong. After all, if there is no value in data, why should companies hire data professionals and give them a pay check?
My long-form response is too long for this format, so the short response is that data requires context. I agree that action is important, but the purpose of a data visualization professional is to provide information with the relevant context to assist decision-making. It’s not that there’s no value in data or that action is everything; it’s a multi-faceted process, and the specific relevant data will depend upon the industry. In professional sports, front offices certainly use accurate(-ish) metrics which show the worst performing players on the team because sports leagues are zero-sum games. Finding out Fred in Accounts Receivable spends the most time at the coffeemaker each day (17 minutes instead of 12 minutes!) matters a lot less, so unless you’re doing a Taylor-style factory study—and if you are, I’ll have other words with you that also aren’t apropos here—it doesn’t rate high enough in the relative priority list.
Many times in the past, I’ve had arguments with members of the development teams who, when we are discussing fuzzy searches, draw themselves up to their full height, look dignified, and say that a relational database is no place to be doing fuzzy searches or spell-checking. It should, they say, be done within the application layer. This is nonsense, and we can prove it with a stopwatch.
We are dealing with data. Relational databases do this well, but it just has to be done right. This implies searching on well-indexed fields such as the primary key, and not being ashamed of having quite large working tables. It means dealing with the majority of cases as rapidly as possible. It implies learning from failures to find a match. It means, most of all, a re-think from a procedural strategy.
This is a very interesting article, as Phil’s tend to be. I enjoy these types of solutions where it requires almost an inversion of mindset: instead of writing code which understands the data you intended, writing simpler code which looks at intention-laden data.
It’s clear that in this specific case – with an address column of
nvarchar(60)and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive “leading wildcard” searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I’ve also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some “greatest n per group” logic.
Yes, it is 132 rows of magic! Now that we have this super cool table, how do we use it? Let’s pretend that my data is time-stamped in US Mountain Standard Time, but I want to display it in Western Australia Standard Time.
Read on for table details.
Many problems/opportunities can ONLY be addressed by making thousands of small behavior changes, on an ongoing basis, out in the “trenches.” In other words, there are some which CANNOT be addressed by a small number of smart decisions made at the top of the org. Improved pricing behavior by the sales force for instance. Improved quality of service by field technicians, for another.
How do you make that happen? How do you influence a large number of people to behave differently on an ongoing basis, especially when they operate at a distance (both organizationally and physically) from you?
Fancy management consulting firms charge six-seven figures to do this, but with Power BI and/or Power Pivot, it’s no longer necessary to make that outlay. We’ve seen organizations crack this nut with in-house resources, modest budgets, and short timelines – even though the impact is absolutely enterprise-wide. It takes some thought and iteration, for sure, but primarily, you “just” need to build a good scorecard.
This is a non-technical post, but I like it a lot because it helps motivate decisions and gives you a good reason to investigate whether your tools are doing the work you need them to do.
One of the samples Microsoft provided with release of new SQL Server 2016 was using simple logic of Benford’s law. This law works great with naturally occurring numbers and can be applied across any kind of problem. By naturally occurring, it is meant a number that is not generated generically such as a page number in a book, incremented number in your SQL Table, sequence number of any kind, but numbers that are occurring irrespective from each other, in nature (length or width of trees, mountains, rivers), length of the roads in the cities, addresses in your home town, city/country populations, etc. The law calculates the log distribution of numbers from 1 to 9 and stipulates that number one will occur 30% of times, number two will occur 17% of time, number three will occur 12% of the time and so on. Randomly generated numbers will most certainly generate distribution for each number from 1 to 9 with probability of 1/9. It might also not work with restrictions; for example height expressed in inches will surely not produce Benford function. My height is 188 which is 74 inches or 6ft2. All three numbers will not generate correct distribution, even though height is natural phenomena.
Tomaz includes SQL Server R Services code, so check it out.
(OH RIGHT, WE WROTE A BOOK ABOUT THIS!!!)
My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.
(Actually that’s a good point, I am supposed to be pitching this book! — which is really mostly Laine with a smidgen of me but it’s going to be super awesome. Consider this your sales pitch.)
So first, as an accidental DBA, you should obviously buy this book :). Second: stateful services require a different mindset[*]. It’s cool that you are running your own databases! But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.
The theme of the story is a Mongo upgrade gone south, but this is a post about principles. And rainbows.