Example 2:

The Arabidopsis Thaliana Data Set

The Arabidopsis data set used for the Keyword Enhancement Analysis example of " Martini: using literature keywords to compare gene sets " can be found here:
- Set A: 269 Arabidopsis genes known to be associated with disease resistance mechanisms and
- Set B: 514 genes with no clear link to disease, randomly selected from the Arabidopsis genome.
A snapshot of the results of this analysis is displayed in Figure 1 of the manuscript.

Evaluation Data Set 1:

A manually created set of 90 interesting abstracts (selected abstracts that refer to pathogen resistance of AT) has been used as Set A and a set of 90 uninteresting abstracts (randomly selected abstracts that refer to AT but are not related to such mechanisms) has been used as Set B to perform Keyword Enhancement Analysis. The Keyword Enhancement Analysis of Evaluation Data Set 1 finishes in about five seconds.
We observe that the two sets of abstracts are indeed distinct, since there are significantly enhanced keywords for each set, as compared to the other one.

In fact, the results are indeed descriptive of the content of the selected abstracts:

- Set A: the most important significant terms belong to this set and are very specific (like the manually chosen abstracts of Set A); e.g. salicylic acid is a phytohormone that is used by plants in triggering the defense signaling pathway; resistance; disease; pathogen; defense; etc.

- Set B: the significant terms of this set are less important and are of a generic plant organism context (like the manually chosen abstracts of Set B), e.g. development, flower, etc.

Evaluation Data Set 2:

We retrieved from AKS2 the genes mentioned in the selected abstracts of Evaluation Data Set 1. We mapped them to their Arabidopsis Thaliana Entrez gene ids (Set A and Set B , respective to the abstracts of Set A and Set B) and then used them with Martini.
Again, in the Keyword Enhancement Analysis of the Arabidopsis Thaliana genes of the Evaluation Data Set 1 , all keywords found for set A accurately match the known biological functions of this set, i.e., disease resistance mechanisms. E.g., Pseudomonas syringae is a common plant pathogen; etc.

Comparison of Keyword Enhancement:

We use Evaluation Data Set 2 with CoPub and ProfCom. We find that Gene Ontology annotation is provided through ProfCom when using only Set A , but not when Set A is used with Set B as the defined background. CoPub , however, can take as input entrez gene ids but not ids from Arabidopsis Thaliana.