It might not be a surprise if someone tells you that, like it or not, we are constantly surrounded by data. The technological advances of recent decades have brought the Internet to portable devices that are regularly used by many people. In April 2020, about 4.57 billion people (59% of the world’s population) had access to the Internet, and it is estimated that each person generates about 1.7 MB of data every second . Compare this figure with the memory storage of our everyday electronic devices and it is not surprising that some people believe we are living in the era of Big Data. Everywhere we go, everything we buy is susceptible to being recorded. In this sense, data have shed light on the desires of consumers, which is the main reason many companies consider data science to be a key factor in their strategic business plans.
Faced with the challenge of dealing with large empirical datasets, the field of statistics plays an important role. In the first preliminary step in the data analysis procedure, data mining algorithms provide techniques to properly handle such large datasets. Although the aforementioned numbers are truly astonishing, it is important to note that it is not possible to explore the entire “data universe”: data that have not yet been collected are lost and incoming data are simply unknown. Therefore, statistical analysis is always restricted to a particular dataset or sample encompassing a limited number of events. The next step in the data analysis procedure lies in merely describing this sample to extract the crucial Key Performance Indicators (KPIs) for any business. But how reliable is the information obtained from this sample to characterize the whole “data universe”?
One of the most important steps in statistical analysis is based on inferring the properties of the whole - population in statistical terms - based on the available information in a sample. Contrary to the logic statements in mathematics, where “A implies B,” the generic statements formulated in statistical inference are characterized by a probability of occurrence. So, “with a probability of 60%, A implies B.” The lack of information, together with the inherent complexity of the system one is dealing with, makes it impossible to make an assertion without hesitation. In this context, probability distributions are essential mathematical tools that relate probabilities of occurrence to different events.
The main body of the probability distribution is associated with the most common events, which have a relatively high probability of occurrence. However, one might be interested in those extreme events whose appearance could bring dramatic consequences for the performance of a business. For instance, in the mobility field, it is very unlikely a user will wait more than three hours to be picked up by a bus. However, if it happens the user’s experience will be extremely negative. Information on extreme events is included in the tail of these probability distributions and, although the corresponding probability might be very small in comparison with common events, they have a non-zero probability to occur. Extreme value theory is the field of probability which studies extreme events and their probabilities of occurrence . Within the framework of this theory, it is quite common to distinguish between light tails, whose probability is relatively low in comparison with the body of the distribution, and heavy tails, whose probability is relatively high.
Shotl is constantly taking advantage of rigorous data science techniques to improve the user experience. Our routing algorithms are designed in such a way that the distributions of our KPIs do not exhibit heavy tails. This implies that the probability of occurrence of an extreme event is as low as possible. Different strategies can be adopted to avoid these dangerous heavy tails. For example, the algorithm will try to choose an optimal route by minimizing those KPIs considered negative for the user experience. In addition, based on the results of simulations, it is possible to tune some parameters of the algorithm’s configuration to avoid the appearance of extreme events as much as possible.
 Domo: Data Never Sleeps 8.0,
 De Haan, L. and Ferreira, A., Extreme value theory: an introduction, Springer Science & Business Media, (2007)
As awareness of the climate crisis grows, so does scrutiny of the aviation industry. However, while many look to the skies for solutions, opportunities also exist on the ground to make airports more sustainable.