Big and rich data as fuel for the AI engine

Philips
Philips Technology Blog
6 min readDec 13, 2023

--

Author: Tim Hulsen, Philips Senior Data and AI Scientist

High-quality data to feed the AI engine (Source: Imagine AI)

Introduction

Artificial Intelligence (AI) is currently very influential in many industries, such as (social) media and gaming, finance, and marketing, and has recently been expanding to medicine and healthcare as well [1, 2]. One of the more recent developments is the rise of Generative AI (GenAI), a type of AI that can create a wide variety of data, such as images, videos, audio, text, and 3D models. It does this by learning patterns from existing data, then using this knowledge to generate new and unique outputs [3]. GenAI is very useful in healthcare as well, as it can generate radiology and pathology reports, answer patients’ questions, create synthetic data when ‘real’ data has privacy restrictions, etc. However, AI, specifically GenAI, needs a lot of data to feed into the algorithms. Gartner predicted ‘data-centric AI’ as one of the top-5 trends for 2024 [4]. With data becoming more abundant, organizations should shift from ‘big’ data to ‘rich’ data to make sure that their AI systems have the quality they desire. At Philips, we work on getting rich data to help us improve healthcare, by working together with hospitals, universities, and other partners.

Data-driven vs. hypothesis-driven research

With the rise of big data and AI, is the old-fashioned hypothesis-driven research still needed? Some people might say this is indeed the case, but they forget that data is only useful when it is collected in the right way to answer a certain research question. Just collecting data for the sake of collecting data does not make any sense and will only flood the AI algorithms with a tsunami of useless data. Furthermore, data-driven research can create hypotheses, which, in turn, need to be tested using traditional methods. So, hypothesis-generating approaches are not only synergistic with traditional methods but depend upon them [5]. The challenge here lies in convincing both sides of this discussion that both data-driven research and hypothesis-driven research are useful for making progress in healthcare research.

Data quality

Many data suffer from quality issues: missing data, no annotations, no metadata, no protocol on how the data was collected, etc. This causes the old ‘Garbage In, Garbage Out’ (GIGO) effect [6]: the AI algorithms do not give any useful results when the input data suffers from all types of quality issues. Now that computer algorithms can handle much more data than before, data quality becomes even more of an issue: with higher data quantities, it becomes more difficult to check everything. Luckily, many data scientists recognize this issue, and spend much time on data cleansing, data annotation, etc. However, the best way to resolve this issue is by ensuring data quality at the earliest phase, when capturing and collecting data. These measures should be written down in a Data Management Plan (DMP), so that everyone participating in the projects knows how the data was collected, processed and so on.

Data management and stewardship

Data management and data stewardship allow researchers to store and analyze their ‘big data’ in a meaningful way and enable application in the clinic. I have previously written about the ten commandments [7] of translational research informatics, based on a long experience with translational research projects in oncology. These commandments are not only useful for the data managers, but for all people involved in a research project. One of the commandments deals with the FAIR Guiding Principles for scientific data management and stewardship, published in 2016 [8]. FAIR stands for ‘Findability, Accessibility, Interoperability, and Reusability’, four foundational principles that guide data producers and publishers as they navigate around the obstacles around data management and stewardship. There are also commandments about the need for upfront data model definition, agreeing about deidentification and anonymization, the reuse of software, etc. Perhaps the most important point is the proposed recognition of data management and data stewardship as part of the scientific process, including the allocation of funding specifically for these purposes.

Data sharing

Big healthcare data come not only from professional health systems (such as MR or CT scanners), but also from wearable devices (such as smartwatches). All these data put together can be used to optimize treatments for each patient (‘precision medicine’) [5]. For this to be possible, hospitals, academia, and industry must work together to bridge the ‘valley of death’ in which many inventions fail to be applied in the clinic. However, hospitals and academia often are reluctant to share their data with other parties, because of ownership and intellectual property issues, patient privacy and publication embargoes. Sometimes data is only shared after study (and publication) completion, which could mean a delay of months or even years before other researchers can analyze the data. One solution is to incentivize the hospitals to share their data with (other) academic institutes and industry [9]. If patient privacy is the main issue, the use of federated data might be a solution; the data itself is not shared with other parties, but the AI algorithm ‘travels’ to the data and sends the results back to a central server. An example of a federated data system is the Personal Health Train (PHT) [10], an initiative that is rapidly gaining traction.

From big data to rich data?

With computing power still growing according to Moore’s law [11], it might be tempting to keep on collecting all data that might be useful at some point. However, current data growth exceeds computing power growth [12], which will cause problems in the future. How can we guarantee high-quality data at larger scales? Furthermore, running AI algorithms on these big data causes issues in terms of sustainability: supercomputers consume a lot of energy. Especially popular chatbots such as ChatGPT, running on Large Language Models (LLMs), are very computation- and data-intensive. In fact, ChatGPT may use between 0.0017 and 0.0026 KWh of electricity to answer just one query [13], with the training of the data leading to up to 10 gigawatt-hour (GWh) power consumption [14]. Instead of big data, we should look at well-annotated and FAIR ‘rich’ data as the solution to AI. Not data quantity, but data quality should be considered the most important.

Conclusion

With the current hype of AI in full swing, we need to be very passionate about data as the fuel for the AI engine. At Philips, we consider data as just as important as AI, which can be seen in the naming of our communities, departments, and job titles: we have a Data & AI Community of Practice, a Data Science & AI Engineering department, and Data & AI Scientists. We have defined data steward and data custodian roles to properly manage datasets, and make sure that data access, data sharing, and data quality are taken into account. In combination with a large number of data scientists, we should be able to overcome the aforementioned obstacles, and create solutions to improve healthcare.

Curious about working in tech at Philips? Find out more here

References

1.Hulsen T. Literature analysis of artificial intelligence in biomedicine Hulsen — Annals of Translational Medicine (amegroups.org) 2022;10(23):1284.

2. Hulsen T, Friedecky D, Renz H, Melis E, Vermeersch P, Fernandez-Calle P. From big data to better patient outcomes (degruyter.com) Clin Chem Lab Med. 2023;61(4):580–6.

3. AI G. All Things Generative AI 2023

4. Gartner Identifies Top Trends Shaping the Future of Data Science and Machine Learning 2023

5. Hulsen T, Jamuar SS, Moody Frontiers | From Big Data to Precision Medicine (frontiersin.org) AR, Karnes JH, Varga O, Hedensted S, et al. Front Med (Lausanne). 2019;6:34.

6. Kilkenny MF, Robinson KM. Data quality: “Garbage in — garbage out” — Monique F Kilkenny, Kerin M Robinson, 2018 (sagepub.com) Health Information Management Journal. 2018;47(3):103–5.

7. Hulsen T. The ten commandments of translational research informatics — IOS Press Data Science. 2019;2:341–52.

8. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship | Scientific Data (nature.com) 2016;3:160018.

9. Hulsen T. IJERPH | Free Full-Text | Sharing Is Caring — Data Sharing Initiatives in Healthcare (mdpi.com) Int J Environ Res Public Health. 2020;17(9).

10. Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, et al. Distributed Analytics on Sensitive Medical Data: The Personal Health Train | Data Intelligence | MIT Press 2020;2(1–2):96–107.

11. Moore GE. Cramming More Components Onto Integrated Circuits — Proceedings of the IEEE (computer-architecture.org) 1998;86(1):82–5.

12. Stewart M. The Future of Computation for Machine Learning and Data Science | by Matthew Stewart, PhD | Towards Data Science 2019

13. Ludvigsen KGA. ChatGPT’s energy use per query. How much electricity does ChatGPT use… | by Kasper Groes Albin Ludvigsen | Towards Data Science 2023

14. McQuate S. Q&A: UW researcher discusses just how much energy ChatGPT uses | UW News (washington.edu) 2023

--

--

Philips
Philips Technology Blog

All about Philips’ innovation, health technology and our people. Learn more about our tech and engineering teams.