The recent AI wave driven by tremendous ability to utilize data helps gain more knowledge from data, make valuable predictions, create novel applications (e.g., driverless cars) and improve nearly all sectors of business as well as our everyday life. This great stride is brought about by a confluence of progress in IT, and other technologies that enable generation and or collection of data, storage, handling and compute with concomitant attempts to improve sophisticated approaches to process data. In this post we explore the status of our ability to tackle the popular adage “garbage-in garbage-out (GIGO)” to maximize the opportunities and enable progress. This is the first of a two-part post on adopting AI at scale. Centered on the theme of GIGO, we will also explore early successes of (true and pseudo) AI-based solutions, implications for subsequent progress, preparing businesses for reaping value from these solutions and setting appropriate strategies for digital and AI transformation. We take this opportunity to highlight advantages of the approach the outcomes intelligence company ReSurfX is taking with excellent success.
The last decade we have made tremendous progress on digital technologies to facilitate far advanced and newer uses of data that nearly every data based application is now being referred to as Artificial Intelligence (AI). The popular adage “garbage-in garbage-out (GIGO)” needs no elaboration to practitioners as well as business leaders. When we talk about AI or digital advances both in enterprises and in the society a major factor in play is our ability to tackle the GIGO. While we reached this tipping point of AI now, many of the tools we use today use more or less the same conceptual principles that have been tried before and dismissed as unsuitable to achieve those goals as little as a decade or two ago. In this post we will explore early successes of (true and pseudo) AI-based solutions, implications for subsequent progress, preparing businesses for reaping value from digital and AI solutions. Consistent with the theme indicated in the title we will focus on where we are in handling GIGO, some reasons for why things are as they are, scan the emerging landscape through my lens as the leader of the outcomes intelligence company ReSurfX (with due thanks to our other team members).
To give a perspective, ReSurfX is an outcomes intelligence company improving innovation and ROI to enterprises from their data intensive initiatives leveraging a novel machine learning (ML) approach ‘Adaptive Hypersurface Technology’ (AHT). Given the use of a novel approach as the mothership, we do many times the validation than most in the industry to evaluate the AHT based solutions by themselves as well as to evaluate these solutions against other alternative ML and analytical approaches available in the market. AHT based solutions incorporated into our SaaS product and tested at scale yield both complementary insights and enhanced outcomes that is robust and valuable.
In using data at scale (Big Data), both the data input and solutions for utilizing them that feed processed data from or at each stage of application, our thematic GIGO statement plays a role. The former in the previous statement relates to input data quality and the latter of processing pipelines to handle problems that pertain to data quality and other properties that are often related to Big Data. The larger the workflow (i) in terms of number of operations they are subject to or (ii) in long product/application development cycles such as drug development in Pharma where downstream activities are often far more expensive than early stages – and the effect of errors propagate along the pipeline (and over years when the business has long product development cycles) with compounding effect.
When referring to Big Data applications here, we refer both to those that utilize enormous volumes of data from one or few sources as well as those that are applicable to smaller data volumes through combining a variety of them including near real-time data from the edge (such as from smartphone, wearables, IoT etc.).
When referring to data quality there are also others factors such as bias of the society embedded in the data, that at this point need other kinds of conscious efforts and involving people from multiple specialties besides data-driven or computation based approaches. The remedy for these may not be amenable to automation for some time to come, or the reliability will be sporadic and dangerous. Some facets of this class of problem are discussed in the blog posts How does AI, Waymo self-driving taxi launch and Google indicate room for another search engine? and How Bad Data Is Undermining Big Data Analytics.
Market definition of AI, ‘riding the wave’ and classes of early successes
Unlike often misstated, AI is not synonymous with machine learning; the same applies to using machine learning as a synonym for neural network or deep learning. AI is a combination of data, analytics and machine learning with other technologies, tools and at least at this stage highly dependent on specialized knowledge so we can improve what we get out of the automated systems. Though there are excellent genuine and novel cases (e.g., autonomous vehicles), significant proportion of solutions are using pseudo-AI at this time. The latter ones are the classical ‘riding the wave’ and rely on ‘tinkerbell effect’ either (i) aimed at small niches and riding on the buzz word AI and will fall apart in short period, or (ii) useful solutions that might have been hard to get market acceptance in the past but now aligned with the term AI getting accepted.
We know that, sophisticated and well understood solutions that were considered as not practically useful just over a decade ago gives valuable new insights when applied to large volumes of data (Big Data) and are being utilized by many organizations in every industry spanning logistics, autonomy, finance, medicine, sentiment and behavior prediction etc. These successes despite shedding light on the immense advantages we can derive, also imply our current advantages are often dependent on the value of far and fewer correct insights and despite the significantly large insights that are incorrect. These highlight the amount of resources and effort we are wasting to get that insight.
Nevertheless, in toto these solutions, technologies and applications have proven to confer extremely valuable business advances, and enabling societal progress. We can attribute significant proportion of early successes to these two different factors (i) Sheer scale: in many cases our current ability to apply those solutions (dismissed a decade or so ago) at enormous scale in terms of data (Big Data) increase the predictive power and consequently the chances to uncover insights otherwise not possible and get a blockbuster, (ii) Prior deductions needing additional support: this class of success comes from a lot of practical knowledge based intuitions or deductions but still had insufficient proof to justify expenditure that are waiting in the wings, and we pick those insights from among the enormous amount of wrong ones.
Even Glass Far Less Than Half Full Provides Immense Business Value, However Leaders Beware
The value of these early successes referred to above even from the small proportion of correct insights demonstrate enormous promise and consequentially providing the impetus and resources to advance development and applications of AI. Thus the glass is far from half full.
Possible novel applications and continued innovations in data and digital innovations are expected to evolve and mature for at least a decade or two. However, with the immense potential for uncovered value that same fact also demonstrates that we can achieve enormously improved outcomes and better utilization of resources by paying attention to problems, or conversely the enormous amount of unproductive resource expended at this time. If business and technical leaders from other disciplines do not keep themselves adequately informed of these problems this will result in building up wrong capabilities, infrastructure and commitment to solutions that will not be useful beyond the short term. Hence there are significant efforts on educating business leaders on data, associated technology and AI to help with their digital transformation strategies.
However value of the progress happens to be enormous indicating the sophistication we operate with at any point despite way off from what we want or can do, and of other significant shortcomings we are able to reap huge rewards and make incredible societal progress. Yes glass-half-full is a great positive attitude that historically got us to remarkable progress.
My favorite example to highlight the point of ‘immense value despite making huge assumptions or unknowns’ and ‘progress even with significant known limitations’ from applications of innovation referred to above is aircraft and spaceships. In aircraft design and development nearly all theories that guide practical aspects of design make approximations such as ‘air is an incompressible fluid‘– despite that we have had incredible successes. If you have been to some museum and see the pieces of spacecraft we sent men to moon or keep sending to outer orbits, you ca see that the parts are not necessarily sophisticated.
However, the example above should not be used as a motivation to be content with that as status quo (doing the best we can) in this case as it will back fire in the near future. The returns from current level of practical usability will soon become far from effective. In addition, the capabilities built will be obsolete or far from suited for the emerging needs; with the rapid and iterative progress the information from them will become confusing. With these, refocusing will take enormous effort in addition to time lost and need for many resources afresh.
The GIGO Monster is a Formidable Challenger In Extracting Insights from Data Every Step of the Way
GIGO is a major factor limiting the extent of value extraction and reliability of knowledge extracted from data. GIGO effects stem at every major step of utilization of data, including as quality of: data generation, collection, cleaning, assembly, exchange, quality of input for processing and intermediate outputs from processing steps and evaluation approaches. Several failed attempts by Google in the past to utilize their immense prowess with data for healthcare related applications, and the famous Watson is now a failed effort in healthcare for IBM can both be significantly attributed to data quality (GIGO effects). When referring to input data quality there are also others deeper factors such as bias of the society embedded in the data and difficult to detect or automate.
Despite the enormous efforts and the incredible sophistication in data and digital solutions that led to this ‘AI wave’ they all surprisingly suffer from classical (i.e., well known) problems that significantly reduce the knowledge extraction capability and reliability of predictions. These include: (i) even specialized solutions often do not perform uniformly across different datasets of same kind from different origins or from different times, (ii) difficulty and misuse of metrics of accuracy, such as that known problems with misuse of probability or p-values – that even spilled to mass media – e.g. John Oliver exposes how the media turns scientific studies into “morning show gossip” – relevant YouTube video, and (iii) solutions developed are often highly specialized for input data and the buyers not yet sophisticated enough to evaluate them and recognize the success scenarios and limitations applicable to their use (e.g., effectively testing for the classical case of overfitting). These classes of problem can be considered a specialized form of GIGO manifestation where error does not stem from the errors in data but due to data properties at scale not conforming to assumptions used in the processing and insight extraction approaches, need for evaluation of practices suited for using these innovations etc. – an aspect of this is discussed in the ReSurfX blog “Overcoming the Curse of Dimensionality with Combinatorics”.
Besides the quality of the input data, solutions that utilize them and feeding processed data from each stage of application our thematic GIGO plays a role. The former in the previous statement relates to input data quality and the latter the quality of processing pipelines to handle problems that pertain to data quality in general and other properties that are often specific to or exacerbated in Big Data.
Emphasizing and reinforcing above, when validating solutions we build at ReSurfX we find that even with technologies and applications where we have invested enormous amount of brain power and resources, we often have over 30% error in the information derived from data (a form of processed prior knowledge used as truth) that are currently used to train models and develop AI solutions.
Slaying the GIGO Monster for a Better Tomorrow
In the rush to reap the low hanging fruits as in our capitalistic pursuits and proverbial rat race even in the research community, GIGO the monster of a problem is often is overlooked as an area for innovation or to extend innovations to tackle the fundamental problems that significantly reduce the value from the AI wave.
One major cause for data quality and other performance related GIGO problems that limit the ability of digital sophistication often boil down to simple fact that error properties are unknown in Big Data (i.e., they vary even across a dataset and in ways that are not predictable), thus limiting ability to model errors and in turn limiting scalability of solutions for reliability of value extraction. In recent times several efforts are underway to tackle this challenge. The effects of this data property and the aforementioned classical problems manifests as the classical GIGO adage. However, the problems like inherent bias are far less amenable to such data-driven or computation based automated approaches, and at least for some time to come and will also involve painstaking efforts involving people from multiple specialties.
We at ReSurfX posited that “dramatic improvements in accuracy and novel insights can only happen through innovation outside the mainstream framework – given that error properties are often non-uniform in big data, and most analytic shortcomings result from model assumptions not robust enough to handle that” [CIO Review, 2017].
The outcomes intelligence technology company ReSurfX improves innovation and ROI of enterprises through accurate and robust novel insights and advance prediction of outcomes direction from their data-intensive activities. ReSurfX does this by leveraging a novel data-source agnostic machine learning approach the ‘Adaptive Hypersurface Technology’ (AHT) that we developed that significantly overcomes GIGO among other problems that affect most ML and AI solutions. We provide functionalities based on AHT through an enterprise SaaS platform ReSurfX::vysen. The remarkable predictive power of the solution System Response based Triggers and Outcomes Predictor (SyRTOP) in ReSurfX::vysen leveraging AHT is evident in terms of accuracy, robustness, novelty of insights and ability to predict outcomes far in advance. We are developing the latter of those values in previous sentence as an Advance Outcome Alert System. For example we have shown that SyRTOP can predict adverse drug interactions identified and recommendations effected by FDA (US Food and Drug Administration) and AHA (American Heart Association) based on post-market surveillance of drugs (i.e., after approval and continued monitoring in the large population of users) from the system response using a single reporter variable. Expanded In addition ReSurfX::vysen can provide accurate knowledge repositories of proprietary customer data that can improve other predictors and workflows they use through use of highly accurate ReSurfX::vysen processed data as input. More details on these functionalities in and features of the ReSurfX::vysen delivery platform, and solutions being developed by leveraging AHT are ReSurfX in 2021 – Best-in-class Outcome Predictors, Innovation Catalysts and ROI Multipliers and in the extended version of this article and other blog posts in the ReSurfX website.
In summary, we highlighted that the remarkable strides in data and digital solutions in the last couple of decades by “the Big Data wave that transformed into practically applicable solutions as the AI wave” that is likely to continue adding value to nearly all facets of society for a long time to come. We also noted that these advances are mired to a significant and surprising degree by classical problem of GIGO, and indicated how much more effective these advances can confer to commercial and social needs by tackling these needs, and how businesses can strategize their digital transformation by being educated on these. We outlined some causes and approaches to overcome this problem (including the premise and a novel approach taken by ReSurfX) to maximize value of our data and digital assets through the enterprise SaaS product we continue to develop and expand at ReSurfX.
This post was written by Suresh Gopalan, Ph.D; CEO & Cofounder; ReSurfX.