Difference between revisions of "GGBN Data Portal Explanations"

From GGBN Wiki
Jump to: navigation, search
(Third step: backbone)
(Third step: backbone)
Line 85: Line 85:
 
==Third step: backbone==
 
==Third step: backbone==
 
Every single name is compared to the GBIF Backbone, the Catalog of Life Backbone, and the NCBI Backbone. Higher taxa, synonyms and accepted taxa are retrieved, using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species).
 
Every single name is compared to the GBIF Backbone, the Catalog of Life Backbone, and the NCBI Backbone. Higher taxa, synonyms and accepted taxa are retrieved, using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species).
The Prokaryotic Nomenclature Up-to-Date API (http://bacdive.dsmz.de/api/pnu/) is used for bacterial names.
+
 
 +
The "Prokaryotic Nomenclature Up-to-Date API" (http://bacdive.dsmz.de/api/pnu/) is used for bacterial names.

Revision as of 10:36, 11 November 2015

This page is currently under construction (Nov 2015)!


In general you can use three options to search our data portal:

1. Browse Taxonomy 2. Search by Fields 3. Search by Citation

Browse Taxonomy [1]

Navi-Browse Taxonomy.JPG

The GGBN/DNA Bank Network's Data Portal makes currently use of the Catalogue of Life (CoL) as its main taxonomic backbone. We match family and genus of original records provided by our partners with the CoL annual checklist (version 2009). Selecting [XXX "Browse Taxonomy"] leads you to our Taxonomic Backbone site (figure below).

This site gives you an overview on how many samples and taxa are online available for certain higher taxa and genera (taxa and sample counts bracketed). By clicking on "Show DNA samples" you will be redirected to a query that gives you all records for selected taxon.

Furthermore you can click on the little plus icon to see the next level of taxa.


CoL-Start1.JPG

Search by Fields [2]

Navi-Data Portal.JPG

The main query form for the data portal can be found at "Search & Preorder". Here you can choose a lot of criteria to filter your results. The upper part contains parameters related to the underlying voucher and collection event and the lower part contains facts about the DNA samples. If you select for example a certain DNA bank you can browse it's whole DNA and tissue collection.

The NCBI Taxonomy ID is often used by the microorganismic community. Right now search does not include synonyms, but will do in the future. Many records are provided with multiple determinations (their determination history). We index all determinations and you can search for all of them. Suggestion lists will help you when entering Family name or Species name.

Furthermore we filter collection year as well as Seas and Oceans out of the raw data. Seas/Oceans and Countries are matched with an existing list, so that also countries like "England" will be recognized as United Kingdom for instance.

We also index related Sequence Accession Numbers (BOLD and GenBank/EMBL/DDBJ) to be available for search.

Dependent parameters

Some parameters are related to each other, e.g. if you select a Continent or an Ocean, the list of Countries or Seas respectively will be reduced to those belonging to the selected Continent/Ocean. The same happens for Family name and Species name/Taxonomy ID. If you select a Family name, the list of suggested Species names and Taxonomy IDs is reduced to the relevant ones.

Queryform.JPG

Getting results

After clicking on "Search" you will receive a hitlist with 50 records per page. The column heading contain small arrows. Clicking on such an arrow will arrange the results in the new order. The green arrow marks the current order (in the figure ordered by Species name from A to Z). The hitlist contains the species name, the country where the specimen/sample was collected, the DNA number as well as the specimen/voucher number. Clicking on the small magnifier or the species name will give you the record details.

Some samples doesn't have a DNA number. This means that DNA was not extracted yet but can be ordered on-demand.

Hitlist-portal.JPG

Retrieving details

The single record details page is a synoptic virtual dataset combined from at least 2 different sources (DNA sample and voucher). The upper part contains information on the physical DNA sample as well as links to published sequences and papers. The lower part contains the underlying specimen information, sometimes associated vouchers are also listed (e.g. duplum in another collection, cultivated/captivated individuals). GGBN encourages its partners to provide digital images of the vouchers as well. Due to several reasons (e.g. size of the organism) this can not always be realized easily.

RecordDetail.jpg

Pre-order samples

Login is required for pre-ordering samples. When you are logged in you can select the required samples (checkboxes on the right) and click on "Add selected DNA/Tissue samples to shopping cart" on top right. Some samples are blocked for the ordering process and marked with a big red X. Hitlist-portal-shop.JPG

Shopping Cart

After putting samples into the shopping cart on top of the website appears the message "Your shopping cart contains xy samples". Clicking on "Show details" will open you shopping cart and will guide you through the pre-ordering process. ShoppingCart1.JPG

The Shopping cart gives you an overview of your selected DNA/tissue samples. Here you can delete samples from your cart or by clicking on the taxon name see the record details again. Click on "Continue Pre-Order (->Step 2/3)" to continue. ShoppingCart2.JPG

The next steps orders the samples by institutions/DNA banks. In the example below you see that the selected samples are deposited at BGBM and DSMZ. You can check your invoice and delivery address and add some notes if you want. When clicking on "Finish Pre-Order (->Step 3/3) your pre-order will be forwarded to the DNA bank(s) in authority of requested samples. Every DNA bank only receives its relevant order information. Subsequently a confirmation email will be send to you by the DNA bank(s) in question. An offer including binding prices will than be made within a separate email.

ShoppingCart3.jpg

Search by Citation [3]

The prototype of the GGBN Document Library contains more than 4400 papers citing either DNA and Tissue samples or vouchers that are provided via the GGBN/DNA Bank Network Data Portal. Therefore a lot of non-molecular papers can be found here. The publications are provided in citation format with every voucher and DNA record. You can search for a DNA or Specimen Number, a certain DNA Bank, publication years or free text within citations. Library1.JPG

The resulting hitlist contains 50 records per page and is ordered by publication year (from lower to upper). By clicking on a DNA number you will directed to the record details. Library2.JPG


Technical informations

First step: harvesting and indexing

Data is harvested and indexed using B-HIT (http://wiki.bgbm.org/bhit/). B-HIT is a harvesting and indexing toolkit, based on the GBIF-HIT software.
Bhit.png

The records(or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schema ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported. Elements that are indexed are listed under http://wiki.bgbm.org/bhit/index.php/Indexed_fields.


Second step: data quality

Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked again both ISO code and country name. In case of incomplete data, the tool is looking into the namedareas and localities and tries to extract some information regarding the country or the water body.

Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and custom regular expressions.

Third step: backbone

Every single name is compared to the GBIF Backbone, the Catalog of Life Backbone, and the NCBI Backbone. Higher taxa, synonyms and accepted taxa are retrieved, using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species).

The "Prokaryotic Nomenclature Up-to-Date API" (http://bacdive.dsmz.de/api/pnu/) is used for bacterial names.