Example 1:

The Cell Cycle Data Set

We apply Martini on the dataset of Whitfield et al., 2002 that contains 600 genes that are shown to be strongly differentially expressed at different points in the human cell cycle. For the analysis, the cell cycle was divided into 100 points, following the approach of Jensen et al., 2006; each gene was mapped to one peak point, i.e. to the point in the cell cycle of maximum expression level.

Keyword Enhancement of the Cell Cycle:

The genes were divided into Sets A and B by sliding a window of 10 Cell Cycle points, in steps of 1. For example, at the first step, Set A consists of the genes mapped to the range of the cell cycle points from 0 to 9. The remaining genes (mapped to the cell cycle points ranging from 10 to 99) were assigned to set B. The Keyword Enhancement Analysis per window lasts about 1.5 hrs (20508 abstracts).


The results, Figure 1 bellow, show a strikingly accurate and precise match between the keywords found and the entities and events known to occur at different stages of the cell cycle.

The keywords are clustered into three distinct groups:

(1) a pre-replication phase, defined by keywords that describe the initiation of DNA replication and the checkpoints that can prevent initiation from taking place;

(2) S-phase, defined by keywords that describe the proteins, complexes, and processes associated with the replication machinery;

(3) M-phase has no keywords for proteins or complexes, but has keywords that describe the cell division processes.

In the G0 and G1 phases, no enhanced keywords are seen even though large numbers of genes occur at several points in these phases. This matches to the generally-accepted belief that, during these 'resting' phases, relatively few large-scale, coordinated processes occur.

Of the 70+ total keywords found by Martini, perhaps 5 could be considered to be uninformative, e.g. 'extractable' and '874 amino acids'. None of these uninformative keywords suggest a misleading or wrong biological process, and we expect most users of Martini would easily recognize and ignore them; thus we do not consider these uninformative keywords to be false positives. Such cases can arise, for example, because several genes link to the same abstract that mentions such a keyword.

In addition, we observe that Martini can identify the literature trends related to a topic. For example, from the results we can see that for the M-phase the research (and thus the literature) is more focused on the process itself, rather than on the functions and / or the interactions of the genes / proteins involved. On the other hand, for the S-phase the contrary is true.


We repeated the corresponding analysis with the same dataset using CoPub and ProfCom.

With CoPub we find a large number of keywords that are clearly false positives, e.g. the keyword 'M-phase' is found for genes in the S-phase. Sample results for the Cell Cycle data using CoPub:
Window centered around the Cell Cycle time-point 15 (no terms found)
Window centered around the Cell Cycle time-point 45
Window centered around the Cell Cycle time-point 55
Window centered around the Cell Cycle time-point 57
Window centered around the Cell Cycle time-point 60
Window centered around the Cell Cycle time-point 75
Window centered around the Cell Cycle time-point 85
Window centered around the Cell Cycle time-point 99

With ProfCom , we find zero GO terms occurring either in the M-phase, pre-replication, or S-phase.

Figure 1: Enhanced keywords found by Martini mapped on the cell cycle.
Using 600 human cell-cycle genes assigned to specific time points within the cell cycle, a sliding window spanning 10% of the cell cycle was moved in steps of 1%, comparing genes in the window to those outside. The figure shows the position and duration of all enhanced keywords, with 12 o'clock representing the moment of cell division. The left figure shows keywords that describe biological processes and functions - these keywords cluster into three distinct phases: M-phase, S-phase, and a pre-replication phase. The right image shows a close-up of the pre-replication and S-phase regions - all keywords that specify genes, proteins, or complexes occur only in this region.