This project is a work in progress, but results so far are promising. For the math behind the model and preliminary results, please see slides from a presentation I gave at UCLA.

Overview

I created ULTRA as a way to automatically discover interpretable patterns in single-cell RNA sequencing (scRNA-seq) data that explain an external phenotype. Roughly, the problem setup is that there are a set of samples of cells, where each sample has some label, which describes some external phenotype. For example, we may have a group of patients with and without a disease and we extract cells from each patient. The label in this case may be whether or not the patient has the disease. We then measure the RNA expression of each cell in each sample. The question is then: what patterns in the RNA expression data are associated with the label?

This is, in some sense, a regression problem. However, predicting the label accurately is not our terminal goal; our terminal goal is to learn about mechanisms that can actually be used to intervene on the external phenotype (e.g. cure disease). In other words, it is important that we can interpret the regression model.

I discovered that a very natural way to solve this problem is to use a form of cell attention to pick out cell types for which the expression of some gene(s) is maximally covariant with the response variable. One can then examine which cells are being picked out, and which genes are differentially expressed in those cells.

In principle, ULTRA automatically answers a very typical type of question that researchers ask in scRNA-seq, which is something like “which transcriptional programs in which subsets of cells are responsible for this disease our patient outcome?”

Tensor methods

ULTRA is (kind of) a tensor decomposition method. Tensor methods have been quite successful for extracting interpretable and explanatory patterns in high-dimensional data in other domains. However, most tensor methods cannot be applied to scRNA-seq because scRNA-seq cannot naturally be arranged into a tabular tensor form. Each cell in each sample is different, and its identity is unknown, and so you cannot arrange cells into common axes across conditions.

While existing tensor decomposition methods allow the reduced-rank regression of regular (i.e. aligned) tensors (e.g. CP-PLSR) and there exist unsupervised tensor decomposition methods for unaligned tensors (e.g. PARAFAC2, which our lab also applied to scRNA-seq), there was previously no way to perform reduced-rank regression on unaligned tensors, which is the structure of scRNA-seq data. ULTRA essentially aims to fill this gap.