Algorithm

Objective

The MAIC algorithm seeks to combine the information in a heterogeneous group of data sources, in the form of lists of genes implicated in similar processes. It creates a data-driven information weighting for each source to prioritise relevant information, and allows the systematic integration of both ranked and unranked gene lists.

Method

In a superset $A$ of $m$ input sets ${ L_{1},L_{2},L_{3},…L_{m}}$, such as experimental data sources, each input set contains $n$ named entities ${ e_{1},e_{2},e_{3},…e_{n}}$, such as genes. Each input set belongs to a particular type of data source, which may have its own hidden biases. For example, siRNA affects some genes more than others, and some proteins have a tendency to be highly-connected in protein-protein interaction networks. Hence each input set is assigned to one of $k$ categories, ${ C_{1},C_{2},C_{3},…C_{k}}$. The algorithm begins with the assumption that a set of true positives, $T$, exists, and that, for any entity $e$, membership of several data sets $L$ belonging to independent categories $C$ increases the probability that $e$ is a member of $T$. Each one of the data sets $L_{j},j = 1,…,M$, has 3 attributes:

set of $n_{j}$ entities $L_{j} = { e_{j1},e_{j2},e_{j3},…e_{jn_{j}}}$
a category, $c_{j} \in { C_{k}}$
a structure, $r_{j} \in { R,F}$ where $R$ is ranked and $F$ is flat, or not ranked.

A score value for each one of the genes in $A$ which is based on the “popularity” of the genes in the input datasets. A gene will get a higher score for being represented in many different categories, as compared to being represented in many different datasets at the same category.

Each input set $L$ is assigned a weighting score $w$ to quantify the evidence in $e$ that derives from membership of $L$. The weighting score $w$ is itself defined as the sum of the scores assigned to each entity $e$ within $L$. The starting value of $s$ for any $e$ is arbitrary - any numerical value can be chosen, without altering the final scores. For simplicity, the initial $s$ for each $e$ is set to

In order to prevent any single category ($C$) of data from biasing the results, each entity draws only one score (the highest score) from each category. If there is no score for this $e$ in a particular $C$, the score assigned will be zero. In each iteration $i$, the score of an entity in a given category is:

\[\begin{matrix} s_{\text{ik}} = max\{ S_{j}^{L}│g_{i} \in L_{j} \land c_{j} = C_{k}\}\\ i = 1,...,n;k = 1,...,K \\ \end{matrix}\]

The score of an entity for this iteration is the sum of the scores in each one of the categories.

\[\begin{matrix} s_{i} = \sum_{k = 1}^{K}s_{\text{ik}},i = 1,...,n \\ \end{matrix}\]

The weighting score given to a dataset is the square root of the average score of the genes belongs to this dataset.

\[\begin{matrix} w_{j}^{L} = \sqrt{\frac{\sum g_{i} \in L_{j}s_{i}}{n_{j}}} \\ \end{matrix}\]

These equations are iterated until the values for $w$ for all input sets $L$ are no longer changing (ie. each value for $w_{j}^{L}$ is within 0.01 of the previous value.)

Ranked lists

Some input data sources provide gene lists that are ranked according to the strength or statistical significance of experimental results. With descending rank, the probability that a given gene is a true positive result is expected to decrease. This decline in information content is modelled by fitting a k-nearest neighbours (kNN) regression model to the measured information content at each position in a ranked list. The information content is inferred from the MAIC algorithm by truncating the list at every position and calculating a weighting ($w_{j}^{L}$) for the list up to that point, as if it were unranked. A specific weighting for each position in a ranked list is then calculated from the kNN regression function specific to this list.

Although MAIC is designed to require minimal paramaterisation in order to provide generalizability across different data structures and domains, it is necessary to choose some optimal values and procedures. In a series of optimization experiment using synthetic ranked data with different patters of decay in information content, a value of k=3 with the inclusion of an inverse distance-based kernel in the kNN regression, and a final scaling factor of 2 for MAIC output scores were found to provide the best recapitulation of a known input set.

(Description of the algorithm written by K Baillie and Bo Wang, 2021)