| dc.description.abstract |
Malignant Pleural Mesothelioma (MPM) is a rare and aggressive cancer that is strongly associ
ated with asbestos exposure. Its severity has led to growing research interest in finding effective solutions.
In recent years, computational methods and machine learning approaches have been increasingly applied
in oncology to classify tumor and normal samples using transcriptomic data. However, such models typi
cally require large and balanced datasets to achieve robust performances, which are not available for rare
cancers like MPM due to the very limited number of patients and under-representation of normal samples.
This data scarcity poses a significant challenge in building predictive models that are reliable and general
izable. To address this limitation, we employ computational analysis with data augmentation as a strategy
to increase the effective sample size. Specifically, we evaluate two deep generative models, Generative Ad
versarial Networks (GANs) and Variational Autoencoders (VAEs) to generate synthetic tumor and normal
samples. Importantly, synthetic samples were used strictly in the training process, while test sets contained
only real data, ensuring no data leakage during evaluation. To validate the augmentation strategy, a com
parative evaluation framework was introduced using both the naturally imbalanced MPM dataset and an
originally balanced breast cancer dataset, which is further manipulated to simulate imbalance, resulting in
four experimental conditions: original balanced data, artificially imbalanced data, GAN-augmented data,
and VAE-augmented data. Classification is performed using Support Vector Machines (SVM) and Random
Forests (RF), and model performance is assessed through accuracy, F1 score, precision, recall, and ROC
AUC. In addition, Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding
(t-SNE) are applied to visually examine the quality and separability of synthetic data. The results show that
GAN-based augmentation consistently improves classification performance more than VAE-based augmenta
tion, particularly under imbalanced conditions. For instance, in the imbalanced breast cancer setting, GAN
improved SVM accuracy by 5.6% and recall by 7.1% compared to the baseline without augmentation. In
MPM, performance gains were smaller due to high baseline separability, indicating a ceiling effect. Overall,
GAN achieved a mean performance score of 0.9247, compared to 0.9081 for VAE. This study presents a re
producible computational pipeline for benchmarking generative models in transcriptomics, and demonstrates
that augmentation can effectively mitigate class imbalance in cancer prediction, while highlights the impor
tance of dataset specific characteristics. The findings also motivate further research into hybrid generative
architectures and biologically grounded validation strategies in precision oncology. |
en_US |