EMBL
MARTINI

Home

FAQs

Examples

About



Frequently asked Questions:

INTRODUCTION

1. What does Martini do?
2. What can I do with Martini? For which cases is Martini appropriate?
3. What is important with Martini?
4. Can Martini solve my problem?

INPUT

5. How do I find a Set A and Set B? What is interesting and what not?
6. What do I need in order to use Martini? (*) Updated
7. How can I collect abstracts for the Keyword Enhancement Analysis?!
8. What is the input format of the lists?
9. I do not have any Set B; what should I do?
10. How many abstracts can I give? What limitations are there? What should I keep in mind?
11. What are the term types used for?
12. Which term types should I use?
13. What do the term types mean?
14. Are there other term types?
15. How many e-mail addresses can I enter?

METHODS

16. How does Martini work? What happens in the background? (*) Updated
17. Does Martini take into account all the abstracts? (*) Updated
18. How does the term enhancement work? (*) Updated

RESULTS

19. What do I see in the raw keyword enhancement file? (*) Updated
20. What do I see at the keyword enhancement analysis table?
21. I want to work with the results of Martini; how can I access them?
22. I want to work with the results of Martini; how can I process them?
23. I cannot find all my PubMed Ids in the results! What has happened?

OTHER

24. Can one enter both Entrez and Ensembl ids in a single set?! What happens then?! (*) Updated
25. How can I automatically submit Martini and/or Caipirini jobs?! Is there an API?! (*) Updated
26. I submitted an older job of mine but this time I receive slightly different results!? What is wrong?! (*) Updated

1. What does Martini do?

In general, Martini works on the texts of abstracts that are extracted from the input ids that the user has entered in order to learn the difference between the two inserted sets. To do this Martini identifies the words that are over-represented (and considered significant) in each set as compared with the other one. The output results are sent by e-mail.

In specific, there are three ways of entering data in Martini:

(a) One can enter directly lists of PubMed Ids or
(b) One can enter lists of Entrez Gene Ids or
(c) One can directly query PubMed.

In all three cases Martini retrieves the corresponding abstracts and processes them.

Martini provides a Keyword Enhancement Analysis:

By default, Martini processes the abstracts retrieved from the input and identifies which terms are over-represented in each set (compared to the other one). Then, it e-mails as result, the ones that are significantly different between Set A and Set B.

[return to top of page]

2. What can I do with Martini? For which cases is Martini appropriate?

Martini can be used in several different cases.

The Keyword Enhancement Analysis can very quickly provide you with information about how different two sets of biomedical literature or of genes are. This is very important in cases that somebody, for example:

(a) Has big sets of genes resulting from (high throughput) experiments. One can find what the important features of each set are. Alternatively, one can find what differentiates one set from the other. Furthermore, Martini can be used in order to isolate subgroups of genes and identify their biological significance.

(b) Wants to know what characterizes one set of biomedical literature against another one, or what kind of information is mentioned in the text of these abstracts, or what the important features that make the two sets different are, if they are indeed significantly different.

[return to top of page]

3. What is important with Martini?

Martini carries several innovations on different levels.
(a) It is the first system that directly characterizes genes based on abstracts, in a simple manner.
(b) It extracts information from the literature directly, without performing complex functions on the sets of abstracts.
(c) It can be more informative and specific, compared to GO related annotations.
(d) In addition, it can be applied to genes from more organisms, as well as to combinations of gene- and literature- data sets.
(e) Furthermore, the system is applicable to a wide range of cases related to the life sciences. This is because of the dictionaries used and the control that the user can have over them.
(f) The analysis can, thus, be 'personalized' to each research field and user separately. By changing the context (selecting different combinations of term types to be taken into account) the user can define different questions.

[return to top of page]

4. Can Martini solve my problem?

It depends; Martini most probably can.
The Keyword Enhancement Analysis is relatively fast and can quickly guide you in understanding and exploring different sets of biomedical literature or genes. This does not mean that it will always find significantly enhanced terms.
Advantages of Martini are that it is generic, fast and more descriptive than other systems. Nevertheless, Martini does not claim to be able to find significant terms for every combination of sets, as well as other systems would also not.

[return to top of page]

5. How do I find a Set A and Set B? What is interesting and what not?

This depends entirely on the type of problem you would like to solve.
Sets A and B are, in general, two sets of genes or abstracts whose difference, as compared to each other, one wants to find. One might want to search, for example, what is the difference between the PolyQ (e.g. set A) and the non-PolyQ (e.g. Set B) trinucleotide repeat disorders literature. By using Martini, one can easily identify which diseases or genes or drugs are mostly important for each case. This way, someone could search for another set of abstracts and discover more literature related to either set A or B.

[return to top of page]

6. What do I need in order to use Martini?

Martini needs:
1. Two lists of PubMed Ids or Entrez and/or Ensembl Gene Ids or two queries for PubMed.
2. A valid e-mail address where the results will be sent.
3. Optional: A description of the job to be submitted. The description is used at the subject of the e-mail sent with the results. This is useful in the case that a user plans to submit several jobs.
4. Optional: The user can select different term types to be taken into account. The default is that all terms will be used.

[return to top of page]

7. How can I collect abstracts for the Keyword Enhancement Analysis?!

There are many web sites, tools, systems and databases that someone can use in order to directly collect literature.

[return to top of page]

8. What is the input format of the lists?

There should be one PubMed Id per line, or there should be one Entrez Gene Id per line. If the input is a query for PubMed, then the query should be exactly in the same format as when used in the NCBI PubMed interface.

[return to top of page]

9. I do not have any Set B; what should I do?

In the current version, Martini will not proceed without both sets being declared by the user.
One could just enter, however, a single id as second set. This most probably will make the Term Enhancement Analysis find no significant terms. However, if one is interested in the terms of a single set then by using the result files, one could isolate them all and process them further for another analysis.
In any case, entering a single random id as Set B, will anyway result in poor performance for describing Set A. For this purpose, a bigger set of randomly selected ids in Set B would work, however.

[return to top of page]

10. How many abstracts can I give? What limitations are there? What should I keep in mind?

Keep in mind that more ids increase the processing time. A process that includes a few hundreds up to a few thousands of abstracts will be managed in a reasonable time frame, i.e. some minutes or hours. Even bigger processes will last longer. You can paste as many PubMed IDs as you want at the input lists, as long as they are less than 25000. The query results will be all processed as long as they are also less than 25000 abstracts. (The methodology used can be applied on whichever amount of data; however we restrict the size simply for time performance reasons).

[return to top of page]

11. What are the term types used for?

The term types are important because their selection leads to a different, user defined, aspect about the genes/abstracts to be tested, as in the Keyword Enhancement Analysis only the selected types of terms will be taken into account.

[return to top of page]

12. Which term types should I use?

This depends on the problem that you would like Martini to solve. The default option is that Martini will take into account all the term types that will be mentioned in the abstracts that correspond to the given input.

[return to top of page]

13. What do the term types mean?

There are five different term types that the users can (un-) select:
Organisms: these are terms that refer to an organism.
Genes/Proteins: these are terms that to genes and/or their proteins.
Small Molecules: these are terms that refer to chemicals or drugs.
Diseases: these are terms that refer to diseases.
Symptoms: these are terms that refer to phenotypes of or reactions to diseases and other stimuli (e.g. pain).

[return to top of page]

14. Are there other term types?

Yes, there are. If one unselects all the available term types of the interface, Martini will still use terms in order to understand what the difference between the two given lists is. These are terms that refer to biological actions, such as "enhances" and "regulates", and other biomedical terms that do not belong to any other category. In addition, they comprise the vast majority of the terms mentioned in an abstract.

[return to top of page]

15. How many e-mail addresses can I enter?

Please, enter only one e-mail address. Make sure that it is correct and valid; otherwise the results will never arrive (to you at least!).

[return to top of page]

16. How does Martini work? What happens in the background?

Martini uses in the background text mining information retrieved from the AKS2 database, an industrial system, product of Bioalma. The AKS2 manages information extracted directly from the scientific literature. The system is getting daily updated and has indexed more than 8.000.000 of the latest PubMed abstracts. Martini uses instances of the AKS2 database, updated less frequently.
When the user enters directly PubMed Ids the abstracts are directly registered for processing. When the user enters lists of Entrez and/or Ensembl Gene Ids, the corresponding abstracts are being collected first and then registered to a Set. Martini retrieves all the abstracts linked to the corresponding Gene records. When the user enters queries, for the retrieval, Martini uses the Entrez Programming Utilities of NCBI to query PubMed.
After the abstracts have been collected, Martini takes care of the cases that there are duplicate records entered by the user and of the cases that abstracts that have been collected belong to more than one input sets. Martini first removes the duplicates. In order to reduce the noise entered, Martini excludes for the Keyword Enhancement Analysis the abstracts that belong to more than one set. The only exception happens when both sets are Entrez Gene Ids. In this case common genes are indeed removed, but common abstracts that are related to different genes of both sets are taken into account.
When the class assignment procedure has finished, there are only unique abstracts that belong only to one set. These abstracts are then mapped to the AKS2. Only the indexed abstracts are used from the system. If after these steps, there are no abstracts remaining in any of the sets, Martini will send an e-mail to the user notifying that "no indexed abstracts can be found". Then, the terms mentioned in each set of abstracts are extracted for further use. The Keyword Enhancement Analysis takes place by taking into account the terms of the types that are selected by the user. If there are no terms found, an e-mail is sent to the user, as well.

(*) Updates applied: The content of the removed sentences is not taking place in the new version of Martini.

[return to top of page]

17. Does Martini take into account all the abstracts?

Martini filters the abstracts in 3 stages:

(a) it checks whether the input indeed corresponds to a number. This happens in order to remove any "noise" input. If there are no abstracts retrieved the procedure stops there.
(b) It removes duplicates. If an abstract belongs to both sets A and B then Martini does not take it into account at all.
(c) From the remaining abstracts the ones that have been indexed are used for keyword enhancement analysis.

(*) Updates applied: Filters (a) and (b) have been inactivated after request from users.
Filter (a) has been removed because now Martini can handle also Ensembl identifiers. Filter (b) has been removed because repetition is considered from some users to be important for the modelling of their data. In any case, if there no valid entries identified for a given Set the procedure stops and an e-mail is sent to the user notifying them for the fact.

[return to top of page]

18. How does the term enhancement work?

Having found the set of terms associated with each of the input sets, Martini next loops over each term and analyzes how much it is enhanced in set A versus set B. In the case the user specified either a list of PubMed IDs or a query, Martini counts the numbers of abstracts in which the term occurs at least once. In the case that the user specified genes as input, Martini counts the numbers of genes in which the term occurs in at least one of the abstracts associated with these genes. Martini then compares these counts for sets A and B using a two-tailed Fisher test, both ways, i.e. we look for over and under represented terms in set A; then, we do the same for set B. To account for the total number of genes or abstracts tested, the Fisher p-values are then adjusted using the False Discovery Rate method. For this adjustment, the p values for each gene are ranked from smallest to largest, then we find the largest i value that satisfies: p <= a * i / m, where a=5% is the fraction of false positives that are considered acceptable and m the number of abstracts (or genes) taken into account for the corresponding set. Once this i value is identified, it is then used to calculate the adjusted p value.

(*) Updates applied: Each valid entry is now considered as many times as inserted in the original input. This means that if an input consists or three abstracts but one of them is inserted twice, the number of abstracts to be taken into account is not three anymore but four. if a term occurs only in that duplicate abstract then its number of occurrences will not be one anymore but two, since the abstract was inserted twice. The same applies for gene entries.

[return to top of page]

19. What do I see in the raw keyword enhancement file?

All keywords found in Set A and B are shown. This allows the user to work on the results.
The columns are tab (\t) separated and each one displays, respectively:
- The set(s) in which the term was found. If the term was found in both sets the value is 0. If the term was found only in set A or B then the value is set to 1 or to -1, respectively.
- The term name.
- The references from Set A, i.e. the abstracts it was mentioned in, or the gene ids (and the abstracts related to each gene) that the term was mentioned in.
- The references from Set B, same as above.
- Number of abstracts (or genes) the term was mentioned in from Set A. Also the relative representation within the set is displayed, as the proportion to the total number of (genes or) abstracts of the set. In the header line the total number of (genes or) abstracts of the set is mentioned.
- Same as above, for Set B.
- Enhancement Factor: the ratio between the relative representations of the term in each set.
- p: the fisher p value

(*) Updates applied: The references from each set consist of the abstracts (or the gene entries and their corresponding abstracts) that contain the given term. Since currently common entries between the two Sets or the repeated within each of the Sets entries are not removed, the terms (and thus their occurrences) are considered as many times as the corresponding valid entries they are found in and that are taken into account. For this reason, if a term is associated to a number of abstracts (or genes for a given Set) the references of the term for that Set will contain these abstracts (or genes) as many times as they were present in the original input. This is done so that the user can keep track of the effect that their input has had over the analysis.

[return to top of page]

20. What do I see at the keyword enhancement analysis table?

All significantly enhanced keywords are shown first as a 'term cloud' where the size of each keyword is directly proportional to its significance. This allows the user to immediately identify the most significant terms. The keywords assigned to sets A or B are colored blue or black, respectively (red are the terms that are equally represented in both sets). Martini also presents the significant keywords in a table form that shows the following data: the number of times each term occurs in each set; the enhancement factor, which is the ratio of the previous numbers; adjusted p value, which is an estimation of the likelihood that the given level of keyword enhancement occurred by chance.

[return to top of page]

21. I want to work with the results of Martini; how can I access them?

The raw keyword enhancement results file provides the essence of the martini results. They are presented in this form, so that you can download them and then work on them.

[return to top of page]

22. I want to work with the results of Martini; how can I process them?

One could do many things with Martini and its results (text mining, statistical analysis, experiments, etc.). There is no specific direction; this is totally free and up to the user. However, Example 1 can be a good approach for identifying subgroups of genes.

[return to top of page]

23. I cannot find all my PubMed Ids in the results! What has happened?

They have been probably removed during the filtering procedures.

[return to top of page]

24. Can one enter both Entrez and Ensembl ids in a single set?! What happens then?!

Yes, one can enter in the same Set (A or B) a list of gene ids from different organisms and that consists either from Entrez ids or from Ensembl ids or both. In this case, Martini will remove duplicate records both in the sense that
(a) if an id is pasted more than once, then it will be taken into account only a single time for the analysis and
(b) if an Ensembl and an Entrez id correspond to the same gene, then one of the two will be taken into account for the analysis, too.

Internally, Martini first maps in the background the Ensembl ids to their corresponding Entrez Gene entries and then it is their respective abstracts that are used for the analysis.

(*) Updates applied: The content of the removed sentences does not take place anymore. Please see the answer of question 26 for more information.

[return to top of page]

25. How can I automatically submit Martini and/or Caipirini jobs?! Is there an API?!

Yes, you can! There are two ways:

1. One can sequentially call from another program the script that can be downloaded here and pass the proper parameters in order to launch Martini Keyword Enhancement Analysis analyses and Caipirini Classification and Ranking of abstracts jobs. The script constructs respectively the Martini- and/or Caipirini- job urls and in turn submits the data for the analyses to be launched. When the jobs have started the script prints out the response message from the server and finishes. The results are, as from the main interface, sent to the user via e-mail response.
If the output you receive mentions 'Access denied due to security policy violation', please check your firewall and inter- or intra-net settings.

2. Alternatively, one can directly by themselves construct the Martini- and/or Caipirini-job urls, as follows bellow.

To repeat, the input for both Martini and Caipirini consists of:
- A list of PubMed ids or Entrez/Ensembl Gene ids or a PubMed query to define Set A.
- A list of PubMed ids or Entrez/Ensembl Gene ids or a PubMed query to define Set B.
- A set of term types to be used for analysis.
- If a Caipirini job is to be started a third Set C, that is a PubMed query, must be defined.

To launch Martini or Caipirini jobs, one must define the url prefix (i.e. http://martini.embl.de/startGO? and http://caipirini.org/startGO? respectively) and the following parameters:
* positive_list   : The input for Set A (one id per line or a PubMed query; all types should be URL-encoded for proper submission to take place).
* inputA           : The input type for Set A. Please, enter:
      pubmedids : if the input consists of a list of PubMed ids, formatted as described above.
      geneids       : if the input consists of a list of Entrez and/or Ensembl Gene ids, formatted as described above.
      qpubmed    : if the input consists of a PubMed query, formatted as described above.
* negative_list   : The input for Set B (Please, apply as above, i.e. as for option 'positive_list').
* inputB            : The input type for Set B (Please, apply as above, i.e. as for option 'inputA').
* descriptionData : A description about the input data or the job to be submitted (URL-encoded; Optional).
* email               : The e-mail address to send the reply with the results of the launched Martini and Caipirini job. Please, one and valid e-mail address (URL-encoded).
* queryPubmed  : For a Caipirini job to be launched, Set C as a PubMed query must be defined (URL-encoded), else sSet value equal to do_enhancement_case to launch Martini.
* termTypes1      : Define whether terms of type 'Organisms' should be taken into account for the analysis setting the value '4'.
* termTypes2      : Define whether terms of type 'Genes/Proteins' should be taken into account for the analyses setting the value '2'.
* termTypes3      : Define whether terms of type 'Small Molecules' should be taken into account for the analyses setting the value '1__8'.
* termTypes4      : Define whether terms of type 'Diseases' should be taken into account for the analyses setting the value '3'.
* termTypes5      : Define whether terms of type 'Symptoms' should be taken into account for the analyses setting the value '7'.
If a term type definition is ommitted or is assigned the wrong value, it will not be taken into account. If any of the essential parameters (i.e. positive_list, inputA, negative_list, inputB, queryPubmed and email) is ommitted or not assigned a value there will be no response from the server and no submitted job will be launched.

Examples (To see the constructed url, please, copy the underlying link and paste it):
- Martini: Evaluation Data Set 1 (Set A and B as PubMed Ids; all term types used) .
- Martini: Evaluation Data Set 1 (Set A as PubMed Ids; Set B as PubMed query; only 'Gene/Proteins' and 'Small Molecules' term types used)
- Caipirini: Set A as Entrez Gene Ids; Set B as PubMed ids; 'Organisms' and 'Small Molecules' term types not used; Set C a query of two abstracts.

(*) Updates applied: Martini and Caipirini do not anymore use the same configuration. We plan to update the APIs too. In the meantime, please, be cautious and use the provided API only for Martini.

[return to top of page]

26. I submitted an older job of mine but this time I receive slightly different results!? The same applies for the examples that you provide! What is wrong?!

There is nothing wrong. Most probably the background data have been updated and this has influenced the results. The background data consist mainly of the Gene-to-Pubmed id mappings and the AKS2 repository information. The results of the analysis may have been affected primarily by the fact that:
(a) there may be a different and new set of abstracts associated to a gene entry (abstracts may have been added and/or removed) (b) there may be more abstracts indexed by AKS2 (c) there may be more (new) terms indexed by AKS2 for a given abstract.

(*) Updates applied: As a matter of fact, Martini background data have been updated already.

(*) Updates applied: In addition to the background information data-update, Martini now also processes the input in a different manner.

It has been asked from users and considered as proper strategy that the removal of repeated entries within a Set as well of common entries in both Sets A and B should not take place. The reason for this is that the existence of repeated entries within a Set or the existence of common entries in both Sets A and B for some users is considered as valuable information for the modelling of their data. Originally, it has been considered from the developers that the users would not like to take such computational issues into account and that the repetition of entries might thus have been a 'copy-paste' neglection of cleaning up the input data. Thus, Martini itself was filtering-cleaning the input Sets.

Nevertheless, currently, the asked alteration in the strategy of processing the input has taken place and can now give different results than before. For this reason, a 'Warning' message accompanies the results bringing forward this issue to the users that may later want to give as input new lists with unique entries this time.

(*) Updates applied: Furthermore, now martini can accept as input not only Entrez Gene ids (from any organism) but also Ensembl Gene ids.

A single Gene Set is allowed to comprise of a mix of gene ids from either Entrez or Ensembl or from both and from whichever (applicable from these sources) organism too. The Ensembl Genes are not directly linked to the literature but via their corresponding Entrez Gene records. In the case that there is a one-to-many Ensembl-to-Entrez Gene mapping, the literature of the Ensembl record is set to be the union of the abstracts linked to the corresponding Entrez Gene records.

(*) Updates applied: These updates may lead to longer processing time as the number of entries to be taken into account becomes larger.

[return to top of page]