Difference between revisions of "GGBN Data Portal Explanations"

From GGBN Wiki
Jump to: navigation, search
Line 1: Line 1:
 
=Overview=
 
=Overview=
 +
We use the [http://wiki.bgbm.org/bhit/ Berlin Harvesting and Indexing Toolkit (B-HIT)] to harvest GGBN provider data. The records (or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schema ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported, including the GGBN extensions. Elements that are indexed are listed under http://wiki.bgbm.org/bhit/index.php/Indexed_fields.
 +
 
[[File:GGBN data portal menu search.jpg|thumb|right|200px]]
 
[[File:GGBN data portal menu search.jpg|thumb|right|200px]]
 
The search features can be found under magnifying glass icon. You can use three options to search within the GGBN Data Portal:
 
The search features can be found under magnifying glass icon. You can use three options to search within the GGBN Data Portal:
Line 5: Line 7:
 
*Browse the tree of life
 
*Browse the tree of life
 
*Browse collections
 
*Browse collections
 +
 +
=Data Quality and Data Cleaning=
 +
During harvesting GGBN provider data are checked and cleaned if necessary. We keep the original provider data in addition to cleaned versions. Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked again both ISO code and country name. In case of incomplete data, the tool is looking into the namedareas and localities and tries to extract some information regarding the country or the water body.
 +
 +
Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and customized regular expressions.
 +
 +
=Taxonomic Backbone=
 +
After harvesting the scientific names are matched against certain checklists of the GBIF checklist bank. Higher taxa, synonyms and accepted taxa are retrieved, also using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species). These checklists include: Catalogue of Life, NCBI and the GBIF backbone itself. In addition we match the names against the Prokaryotic Nomenclature up-to-date (PNU) web service, provided by the [http://bacdive.dsmz.de/api/pnu/ DSMZ].
  
 
=Search by fields [http://data.ggbn.org/ggbn_portal/search/index]=  
 
=Search by fields [http://data.ggbn.org/ggbn_portal/search/index]=  
Line 15: Line 25:
 
|}
 
|}
  
Most of the fields are drop down lists or include suggestion lists to help you.
+
Most of the fields are drop down lists or include suggestion lists to help you. E.g. when typing a name the portal searches for all synonyms and accepted names matching your search term and provides a suggestion list with detailed information about the name found in the GGBN backbone.   
==Taxonomic Backbone==
 
After harvesting the scientific names are matched against certain checklists of the GBIF checklist bank. Higher taxa, synonyms and accepted taxa are retrieved, also using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species). These checklists include: Catalogue of Life, NCBI and the GBIF backbone itself. In addition we match the names against the Prokaryotic Nomenclature up-to-date (PNU) web service, provided by the [http://bacdive.dsmz.de/api/pnu/ DSMZ]. When typing a name the portal searches for all synonyms and accepted names matching your search term and provides a suggestion list with detailed information about the name found in the GGBN backbone.   
 
 
[[File:GGBN data portal suggestion list.jpg|center|500px]]
 
[[File:GGBN data portal suggestion list.jpg|center|500px]]
 
More information about the technical background of harvesting and enrichment of GGBN data will follow soon. We use the [http://wiki.bgbm.org/bhit/ Berlin Harvesting and Indexing Toolkit (B-HIT)] to harvest GGBN provider data. The records (or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schema ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported, including the GGBN extensions. Elements that are indexed are listed under http://wiki.bgbm.org/bhit/index.php/Indexed_fields.
 
 
==Data Quality and Data Cleaning==
 
During harvesting GGBN provider data are checked and cleaned if necessary. We keep the original provider data in addition to cleaned versions. Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked again both ISO code and country name. In case of incomplete data, the tool is looking into the namedareas and localities and tries to extract some information regarding the country or the water body.
 
 
Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and customized regular expressions.
 

Revision as of 18:23, 16 December 2015

Overview

We use the Berlin Harvesting and Indexing Toolkit (B-HIT) to harvest GGBN provider data. The records (or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schema ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported, including the GGBN extensions. Elements that are indexed are listed under http://wiki.bgbm.org/bhit/index.php/Indexed_fields.

GGBN data portal menu search.jpg

The search features can be found under magnifying glass icon. You can use three options to search within the GGBN Data Portal:

  • Search by fields
  • Browse the tree of life
  • Browse collections

Data Quality and Data Cleaning

During harvesting GGBN provider data are checked and cleaned if necessary. We keep the original provider data in addition to cleaned versions. Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked again both ISO code and country name. In case of incomplete data, the tool is looking into the namedareas and localities and tries to extract some information regarding the country or the water body.

Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and customized regular expressions.

Taxonomic Backbone

After harvesting the scientific names are matched against certain checklists of the GBIF checklist bank. Higher taxa, synonyms and accepted taxa are retrieved, also using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species). These checklists include: Catalogue of Life, NCBI and the GBIF backbone itself. In addition we match the names against the Prokaryotic Nomenclature up-to-date (PNU) web service, provided by the DSMZ.

Search by fields [1]

GGBN data portal search form.jpg

Here you can choose a lot of criteria to filter your results. The upper part contains parameters often used by researchers and curators. In addition you can add further parameters (click on "add search field"). We distinguish between GGBN repositories (DNA and tissue banks) and voucher collections. The latter can also be non-GGBN instutions.

select additional search parameters
the field will appear (here Ocean) immediately. It can be deleted by clicking on the red cross

Most of the fields are drop down lists or include suggestion lists to help you. E.g. when typing a name the portal searches for all synonyms and accepted names matching your search term and provides a suggestion list with detailed information about the name found in the GGBN backbone.

GGBN data portal suggestion list.jpg