Digital Marketplaces Unleashed
Page 80
The first V is Volume – problems associated with large amounts of data collected in modern big data infrastructures. The second V on his list is Variety, that is, the problems related to very different aspects being realized in large data sets, especially in those coming from observations. The third V stands for Velocity, that is, the amount of data per time unit that needs to be processed, stored, analyzed, and visualized in a big data project. So far, this list resembles the original list of Big Data’s 3 V’s. However, the next V on the list is Veracity. This stands for the prerequisite that a dataset must contain enough knowledge (ground truth, number of samples, etc.) about effects to be mined in order to get reliable and statistically sound results out of it. Furthermore, one V for Validity is added to the set of V’s summarizing the quality of the data, the metadata, and the acquisition process. One of the additional V’s is Value. This covers all aspects of business insights from data and is very challenging. A lot of data science projects reach success rates of more than 95% measured during training, but actually remain useless in practice. This happens often, when the root cause of effects is not correctly identified. For example, when controlling the quality of an electric vehicle, one can count the number of times that a battery has been replaced for a specific model. By intuition, we could believe that an increase in the battery replacement rate means a decrease in battery quality. However, it can also be the case that the battery has been replaced by guessing a problem with the battery though there is one with the charging subsystem. While a data scientist might correctly detect an anomalous amount of battery replacements, this might have nothing to do with the battery. The next V on the list is Variability. This covers aspects of current research including non‐stationary effects in spatiotemporal time series, from seasonal data, and from autocorrelation. Often, there is no useful summary statistic such as the mean value for a Gaussian distribution, therefore, analysis is unable to compress the data and lots of the large volume proceeds to the analytics layer. A more practical V on the list stands for Venue. Much data is generated in distributed, heterogeneous systems. This is true for a technical perspective, e. g. vendor lock‐in and similar effects, but also for a global system with different, possibly contradicting rules for managing, exchanging, and communicating data across borders. Possibly the most important V for this article is then presented as Vocabulary. It is the problem that many data scientists use different vocabularies. Furthermore, structured data comes with data schemata which are usually tailored to one application but hindering another one. For example, temporal databases are often organized by rows identifying events and columns identifying properties or information associated with events. When running an analytics algorithm on such a property, however, the data needs to be reorganized into a column‐oriented access pattern. Additionally, vocabulary problems start with definitions of very simple terms like “big” by different data scientists. While for a statistician, a megabyte of values might be big (depending on the algorithm to be used), a computer scientist tends to think that “big” starts when the main memory of a single computer does not suffice anymore. However, from a business perspective, “big” might start when the cloud hosting cost increases faster than the return on invest, completely independent of the amount of work actually being done. Finally, the list of 10 V’s of Big Data is completed by Vagueness, which stands for the uncertainty about the meaning of specific data. Usually, this comes from bad structured or even lacking metadata. For example, a dataset with location readings can be very accurate (in the case of laser scanning or precise point positioning from GNSS) or only very rough (e. g., when created from cellular network information). However, it is seldom the case that actual characteristics of the recordings are collected together with the data. Often, only very rough information like “smartphone GPS” is available, if any.
These 10 V’s of Big Data go far beyond the 3 V’s of Big Data incorporating many aspects. The interesting thing to note about the seven additional V’s is the fact that they are less technical and focus on organizational or communication issues.
In fact, Veracity, Validity, Value, Variability, Venue, Vocabulary and Vagueness are not to be answered by a smart algorithm or infrastructure. Instead, they are related to the risk of wasting time in analyzing useless data or creating misinterpretation including wrong conclusions. Hence, we spot the central topic of this article in these 10 V’s: successful Data Science is a communication challenge. Seven out of ten V’s can’t have a technical solution and for the other classical 3 V Volume, Variety, and Velocity, a single technical solution does not exist either.
Before diving into these aspects in more detail, the following section will focus on the field of Data Science and explains how it differs from various different approaches to data management.
52.3 Data Science Communication Model
Fig. 52.1 depicts three disciplines from which Data Science is often defined as some sort of intersection. Definitely, mathematics and statistics are a fundamental building block of Data Science in which statistical models need to be created and analyzed. The field of mathematics classically contributes two aspects: correctness of results and scalability. For example, recent advances in calculating singular value decompositions of large matrices are clearly created by mathematicians as a contribution to their field, but find wide applications in Data Science due to the power of dimensionality reduction of, for example, singular value decomposition. However, mathematics and statistics alone do not represent Data Science. One reason is that mathematicians are usually fully happy when something is possible or exists. Additionally, they are not actually trained for developing large‐scale systems beyond demonstration capabilities. Hence, the insights from mathematics and statistics must undergo a transformation into high quality software and tools. This is why the field of computer science plays a vital role in Data Science. Programming, software engineering, test‐driven development, operational aspects, and distributed systems are some key ingredients to successful data science projects, which are usually available from the field of computer science. Still, all collected data originates from a world having rich semantic with meaning behind each and every piece of information. However, this meaning is still largely unavailable to computer systems. Therefore, there is a third aspect in every successful data science project which is based on domain knowledge. Asking the right question against a dataset, understanding unexpected issues with algorithms quickly, and – in general – having a well‐working intuition in the domain is desperately needed for guiding the power and tools of Data Science towards valuable insights.
Fig. 52.1Data Science (DS) as an intersection of three disciplines
There might be the need for a fourth and fifth or even more additional disciplines for creating a successful data science team. One such aspect could be business development knowledge, or intimate knowledge of law. However, the three depicted domains are vital for Data Science. This becomes clear when discussing what would happen, if one of the fields is not well‐represented in a data science project or data science team (see Fig. 52.2).
Fig. 52.2Underrepresented fields lead to machine learning, traditional research and potentially risky projects
If we leave out mathematics and statistics, we are left with a team consisting of software developers and domain experts. This leads to very successful projects, but due to the missing awareness of mathematical limitations, this might as well diverge into a set of software tools and a set of claims, which are simply wrong. This is the reason for writing the word “risky” for the connection of these two fields. A team of that shape is able to create something good, however, it does not know whether it actually is good and it does not proof basic assumptions about the result.
If we leave out computer science, we are left with mathematics, statisti
cs, and domain knowledge. This creates very well working results along the line of classical research: a dataset is selected, extensive preprocessing and modeling is done, a heavy dose of self‐criticism protects from over‐claiming or over‐interpreting results. However, the results don’t get transformed into agile tools; they will usually be unavailable in real time, on demand, and, most importantly, the amount of data that can be handled by these two disciplines is limited. There will always be a point, where computer science skills are needed to scale out a successful statistical analysis to the big data space.
If we hold out domain knowledge from the data science triangle, we are left with computer scientists, mathematicians, and statisticians. This is an extremely powerful combination with respect to the ability of problem solving. However, as these groups might not understand the problem, they are likely to solve something less useful than a data science group under guidance of domain experts. To put in other words: The team will generate results, but they may not be able to interpret or utilize them.
This discussion makes clear that Data Science is an aggregate of sciences reaching higher levels of maturity by synthesis and composition of individual skills.
There might be the affect to try to find data scientists that can cover all these aspects. As such data scientists are rare, this simple intersection is often called “the unicorn of Data Science”. It is much more important to form diverse teams and set specific emphasis on the various aspects of the data science triangle.
52.4 The Ten V’s of Data Science
When reconsidering the ten V’s of Big Data in the context of Data Science, we come up with a clear mapping of the challenges onto the communication diagram as depicted in Fig. 52.3. The 3 V Volume, Velocity, and Variety are clearly a challenge for an interdisciplinary discussion between mathematics, statistics and computer science. Approximation algorithms, smart index structures, randomization, and modern hashing architectures are computer science contributions to these three challenges. Local models, local analytics, treatment of missing data, and smart aggregation approaches with error bounds are contributions from mathematics and statistics.
Fig. 52.3Mapping of 10 V to Data Science Communication Diagram
Veracity and Validity are to be discussed between domain knowledge, mathematics, and statistics. While mathematics and statistics might find surprising insights from the dataset, these insights can actually be random effects or sampling errors. To the contrary, some assumptions about the data (such as error distributions, independence assumptions, and similar) often made by applying specific statistical tools can turn out to be false. The point of discussion at this intersection is between expectation and results: are data science results realistic? Do they resemble knowledge? How well do they generalize?
Variability and Venue are to be discussed between domain knowledge and computer science. Highly variable datasets pose challenges towards distributed computing and data organization. Knowing the exact patterns of data access can be a great deal here and the venue has to be chosen according to the domain expert’s needs.
The central V of Value is clearly a domain knowledge question: Given a statistically approved, well‐scaling data science result with an error probability of 1%, how can the whole system be monetarized? How can the errors be treated and how do the errors harm from a business perspective? How much value can be generated by reducing this expected error even further?
The remaining two V’s Vocabulary and Vagueness are to be discussed in the center of a data science group. It is difficult to find a working language across those three disciplines in order to facilitate talking about the same things.
For example, “Big Data” will mean something completely different for the three domains: For computer science, Big Data is usually linked to needing a distributed cloud infrastructure instead of a single cluster of computing devices and all methodologies needed for launching such distributed architectures. For statistics, Big Data starts when the usual algorithms and calculations take too long and less fail‐safe approaches have to be put in place. So, Big Data is given when the road which is under perfect control of modern statistics has to be left. For mathematics, Big Data starts when lower bounds are being used. That is, when the data is compressed in a lossy way and we need to infer results (at least in a probabilistic way) about the outcome of an infeasible computation which has been replaced by some approximation algorithm. When assembling a data science team from experts on these three perspectives on Big Data, it is very difficult for a domain expert to talk to them about the real world and their real problems. The computer scientists will always tend to tell the expert that more data could be useful in the future, the statisticians try to help the expert selecting useful things out of the huge dataset available, and the mathematician will try to find the most elegant solution to the problem.
The expected result of this imaginary situation can often be observed in reality: first of all, all data is collected into a data lake – just to have it. Then, beautiful reports are generated using small fragments of the data showing microscopic results. Additionally, visionary projects for specific problems are started, but often do not come to a successful completion as the surroundings change faster than the project goals can be reached.
52.5 Four Top Skills of Data Science Groups
In order to cope with the situation just described, we propose to let data scientist groups organize around four main skills consisting of three specific domain challenges and one overall challenge, see Fig. 52.4. These skills should be read into two directions: First of all, the person or subgroup representing a specific aspect of Data Science must aim for excellence in this area. However, being excellent in this area is at least as important as being able to communicate a working knowledge of this area to the rest of the team. When we are able to reach a situation in which we have individual people or groups standing for the three aspects of the communication diagram and a group that is able to generate a common working knowledge of each of these expert areas, we have made a large progress towards a successful and powerful data science group ready to deliver business value from data.
Fig. 52.4Four main skills for successful data science groups
We align these three characters with the following skills of Big Data to show that a successful data science group must have all of these characters.
Skill 1: Handling Big Data (as a Computer Science Challenge)
The amount of data is so large that neither statisticians nor domain experts can extract knowledge directly with their most common tools. High performance computing, distributed systems, cloud computing, and GPU computing is needed.
Skill 2: Detect Limitations (as a Mathematics/Statistics Challenge)
The fact that a computer system makes perfect predictions on given datasets does not generalize. Actually, the most important challenge of artificial intelligence is the avoidance of overfitting and the question of how to get a system to learn the right concepts instead of the random sampling error.
Skill 3: Awareness & Management (as a Domain Expert Challenge)
Assuming we own skills one and two – we are able to cope with the amount of data and we are able to extract useful knowledge. How do we get this into a company? How can we communicate with the decision makers? How can we integrate domain knowledge? There is the need of a sustainable, well incorporated data science strategy that is actively supported by the management.
Skill 4: Usefulness & Understanding (as an Overall Challenge)
After transforming the data into insights via Data Science and after transforming the insights into business value by the domain experts, can we build a general uniform understanding such that the indirections are not needed anymore and data science results are recognized by decision makers?
52.6 Conclusion
This arti
cle discussed the 10 V’s of Big Data that extend the well‐known 3 V model consisting of Volume, Variety, and Velocity. Furthermore, the data science communication model has been reviewed together with the effects when omitting one domain. The 10 V’s of Big Data have then been mapped to the data science communication diagram revealing that the original 3 V’s of Big Data perfectly adapt to computer science, mathematics, and statistics, while the other seven V’s align uniquely between domain knowledge and computer science or mathematics/statistics, respectively. Based on this distribution we have identified four top skills that a successful data science group needs to have.
References
1.
M. B. Kinshuk Mishra, Personalization at Spotify using Cassandra, https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/: Spotify Labs, 2015.
2.
D. Whiting, Data Processing with Apache Crunch at Spotify, https://labs.spotify.com/2014/11/27/crunch/: Spotify Labs, 2014.
3.
C. S. Jeff Magnusson, Talk “Watching Pigs Fly with the Netflix Hadoop Toolkit” at Hadoop Summit (June 27, 2013), 2013.
4.
D. Laney, 3D Data Management: Controlling Data Volume, Velocity, and Variety, http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf: META Group, 2001.
5.
K. D. Borne, Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s, https://www.mapr.com/blog/top-10-big-data-challenges-serious-look-10-big-data-vs, 2014.