In the age of big data, and particularly in specialisations such as artificial intelligence, biology, and medicine, researchers often generate large and complex datasets that are challenging to analyse. This is particularly true for multi-view data, otherwise known as multimodal data, which are data that encompass multiple perspectives concerning a single entity or phenomenon. In the case of single-cell genomics, for instance, researchers can measure a huge range of different characteristics concerning an individual cell, such as RNA expression levels or protein levels. While multi-view datasets provide vast amounts of information, they are difficult to analyse because looking at each type of data within them provides only a small part of the overall picture. A new computational approach called Covering Point Set-merge analysis, or CPS-merge analysis for short, has been developed by Lixiang Zhang of Pennsylvania State University and colleagues, and it aims to assist researchers to merge the different types of data present in multi-view datasets into one coherent and meaningful set of results, without misrepresenting the individual contributions of each type of data. More
Imagine that you’re a doctor who is trying to understand a patient’s illness by performing a battery of medical tests, such as blood tests, medical imaging, and genetic tests. The results from each test will provide different pieces of information on the potential cause of the patient’s illness, but no one test alone is likely to provide the full picture. Instead, the doctor must devise a hypothesis about the patient’s illness based on all the test results taken together. In this example, the entire collection of the test results is a multi-view dataset, with each test result representing one ‘view’ of the patient’s illness.
In a similar fashion, researchers studying a single cell can capture different types of data to provide multiple views of that cell’s biological properties or behaviour. For instance, researchers can obtain data on RNA levels, which tell us which genes are being actively expressed in a cell. Data on the levels of various proteins can provide information on a cell’s function, and are complementary to, but distinct from, RNA data, providing an additional layer of information on gene expression.
Analysing just one ‘view’ in such multi-view datasets can miss important details and context that can be crucial to the overall story. For instance, a gene might be highly expressed in the form of RNA, but this RNA molecule could be routinely destroyed before it is translated into protein form. However, a researcher might never know this, unless they also look at protein levels in the same cell.
One way to analyse multi-view datasets involves data clustering, which essentially means grouping data points into clusters based on their levels of similarity. For instance, if you wanted to group a set of people into clusters, you could choose a variable to dictate those clusters, such as heights or ages.
In a similar fashion, scientists studying cell biology can cluster cell data that have similar properties. This may help them to identify cells of a certain type or cells that are likely to demonstrate a certain behaviour. However, in multi-view data, clustering can be more difficult. For instance, two cells could share near identical levels of a certain RNA transcript, but their levels of a certain protein could differ wildly. How then, does a researcher cluster the data to answer their research questions?
This is where the CPS-merge technique developed by Zhang and colleagues can play a role. Traditionally, data clustering can be performed by pooling all the available data into one enormous dataset and then performing clustering afterwards: this approach is known as Early Integration. Late integration, however, involves clustering individual datasets in isolation, and then generating the aggregated results afterwards.
Existing methods in these categories have their drawbacks, including overlooking unique characteristics of specific views. For example, they might fail to identify complementary effects, where just one data view shows something unique that is not shown by any others. They can also potentially miss instances where all the possible data views indicate that a certain clustering approach is warranted – this is known as the consensus effect.
The CPS-merge method, however, aims to account for both the consensus and complementary information within the multi-view data without combining them into a single dataset. The first step in this approach is to create Cartesian product clusters. In essence, this means that CPS-merge assesses every possible combination of clusters based on the different possible ‘views’ of the data. As an example, if looking at a cell dataset from the perspective of RNA levels would result in three different clusters, whereas clustering based on protein levels would result in four clusters, then the Cartesian product (3 multiplied by 4) could be 12 combined clusters.
The CPS-merge method then assesses these combined clusters to determine which are biologically meaningful. This involves using mathematical techniques to assess each cluster for “stability” and determine its similarity to other clusters for potential merging, allowing the researchers to weed out less reliable clusters, and helping them to decide which clusters should make it into the final batch for analysis.
Zhang and colleagues put the CPS-merge technique through its paces using single-cell RNA and protein datasets, generated through real-world research efforts. For instance, the researchers assessed the CITE-seq Human Bone Marrow Cells dataset, which includes both RNA and protein level data, generated from 30,672 human bone marrow cells. Using data clustering techniques is essential to allow researchers to identify specific cell types, such as immune cells, among this huge group of cells. Excitingly, the CPS-merge technique outperformed other established data analysis techniques, and helped to identify CD14 monocytes, a specific type of white blood cell which could not be adequately identified by clustering RNA data alone. These types of results indicate the need for sophisticated data analysis tools such as CPS-merge that can handle large and complex datasets.
CPS-merge can allow researchers to understand important nuances in their multi-view datasets by assessing the contribution of each data view to the final clustered results. For instance, finding answers to questions such as “Did RNA or protein data play the bigger role in helping us to identify a certain type of cell?” could assist researchers in designing more informative and precise experiments in their future research.
As science advances, and researchers continue to create increasingly complex datasets, sophisticated methods such as CPS-merge will allow them to chart a path through their data and make sense of it. Integrating multiple data views while successfully and transparently accounting for their unique weight and contribution to the overall experimental result, could assist researchers in unlocking new discoveries and fully exploit the rich seam of multi-view data that modern research methods can readily generate.