Data science: Jumping through hypes

Formal definition is tricky for every young multidisciplinary skill

Fake it till you make it

Any new science must go through growing pains before it becomes widely accepted, well defined and standardized. Just remember how astronomy and astrology were a single entity up until couple of centuries ago, how medicine used leeches, dildos and lobotomies to treat diseases, how Isaac Newton, widely regarded as one of the most influential scientist of all times, practised alchemy and was passionately trying to convert lead to gold or literally achieve immortality.

Data science is no exception. We still don’t even know what it actually is. The best we can do for now is to define it as a set of activities from multiple disciplines aimed at extracting knowledge from structured or unstructured data. This is far from precise. Activities can be anything, as long as there is some kind of an algorithm, a repeatable process behind. Disciplines it combines are perhaps more concrete: mathematics, statistics, computer and information science are the usual suspects.

Emerging roles in data science

Data scientist, data analyst, data engineer, data visualization, data entry, that's a lot of data

Is someone who writes “SELECT * FROM Users;” a data scientist? Is your grandmother a data scientist by reading a newspaper, after all, she is extracting knowledge from structured data? We instinctively feel what is and what’s not data science, which is not very scientific. It is very cool to be a data scientist, it was even declared as “The Sexiest Job of the 21st Century” by some media. Everyone wants to join the bandwagon!

Slowly, several roles are emerging. Data scientists are, naturally, “the sexiest”, almost godlike. They are mining and wrangling data, teaching machines to think, playing detectives, predicting future, directing multi-billion companies’ strategies.

Data analysts come second, they are often regarded as “junior data scientists”. They don’t have so much mathematical or technical knowledge, or, simply put, they don’t see the future as scientists. They tend to use tools, while data scientists produce their own custom solutions. Data engineers are seen as technicians, platform providers for scientists, data maintenance guys, basically.

Data visualization is even less valued. Few approaches are common. People coming from the visual end of the spectrum tend to see it as a graphical, artisan, highly custom type of work. Each visualization is different, special, tweaked for specific audience and medium. Methods they practice are usually closer to drawing than to any mathematics or science. Their results tend to lean towards infographics. Some of them do learn to code, use a library with predefined visualization templates. Those results are slightly more technical, sometimes interactive, with some basic mouse overs or zoom ins. Data scientists are usually not that visual, you can’t rule all the skills. They tend to use tools which can generate charts. Those results are perhaps less communicative, but more useful to the trained eye. Pie charts are a big no-no in the designer world, for example, because they are considered dull, but for a scientist, they get the job done.

At the bottom of the pyramid, nameless data entry slaves thrive, mostly outsourced from third world countries through Upwork and similar platforms. Their only tool is usually just the keyboard. Ever wondered who is providing thousands of images with traffic signs for a machine learning system which is supposed to learn to recognize them? Yes, some poor people had to draw rectangles around them. Speech transcription and film subtitles are another good example, not all digitization and data cleanup work can be fully automated, yet.

Getting useful information is the only goal of all this data manipulation. Fancy math, huge clusters of servers and jungle-like flow diagrams are pointless if your data is full of garbage, or if you can’t communicate your findings. Let’s respect all aspects of the data science process equally.

Trends in IT for the next season

AI, machine learning, big data, NoSQL

Artificial intelligence is another source of mystification. First of all, is it intelligence at all, or a glorified bunch of if else statements, as sceptics would say? Most of the time, we don’t know what is actually going on in machine learning systems. It doesn’t really understand the problem it’s solving, and there is no human like logic behind its reasoning. Monitoring and debugging such systems is often not easy. AI doesn’t have any feelings, good or bad, it’s not going to either destroy, or save us any time soon.

How big data has to be to deserve to be called big? Should you make fun of small big data? Of course not, but we see it all the time. My data is bigger than yours, hoarders are all around us, and they are even proud about it. Yes, it’s easy to produce and collect huge amounts of data, but that shouldn’t be your main goal. The more you have, the less it’s worth in a way. That brings us to data science quacks: statistics is hard, joining a database with a key value storage is not trivial, real time cluster anomaly detection requires many skills. It’s much easier to poke around, start a bunch of virtual images and claim you are a scientist. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it" still stands. Be humble, don’t promise miracles to your boss or a customer. Startups are irritatingly cocky. Go read a book or two, before you claim you're an expert.

NoSQL was the best thing since sliced bread for a while. It’s subduing slowly. We are learning what it’s good for, and when you shouldn’t use it. When you are trying really hard to use some technology, everything looks like a nail. Same goes with programming languages and tools in data science. Python and R seem to be the norm, but why? If you are good at something, stick with it, expand your knowledge, don’t throw it away. Every technology was produced to solve some problem - don’t favor any language just because it’s popular.

Learning by doing

Don't panic

Datoris is attempting to bring down the hype, to reduce the complexity of and demystify business intelligence. Experimenting should be easy, users shouldn’t be afraid of making mistakes. When possible, predefined defaults are suggested, so that you can focus on getting the results. It should help you in your data science and analytics experience.

Data science: Jumping through hypes

Formal definition is tricky for every young multidisciplinary skill

Fake it till you make it

Emerging roles in data science

Data scientist, data analyst, data engineer, data visualization, data entry, that's a lot of data

Trends in IT for the next season

AI, machine learning, big data, NoSQL

Learning by doing

Don't panic

Business Intelligence

Data science

NoSQL

Big data

Statistics

Firing squad synchronization problem

Introduction to Descriptive statistics