Difference between revisions of "GGBN Data Portal Explanations"

From GGBN Wiki
Jump to: navigation, search
(Technical informations)
(Search by fields)
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<div id="wikinote"> '''This page is currently under construction (Nov 2015)!''' </div>
+
=Background=
 +
[[File:GGBN data portal menu search.jpg|thumb|right|200px|three search options to choose]]
 +
We use the [http://wiki.bgbm.org/bhit/ Berlin Harvesting and Indexing Toolkit (B-HIT)] to harvest GGBN provider data. The records (or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schemata ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported, including the GGBN extensions. Elements that are indexed are listed at http://wiki.bgbm.org/bhit/index.php/Indexed_fields.
  
 +
[[File:GGBN data portal menu stats.jpg|thumb|200px|statistic feature can be found under the freezer icon]]
 +
The search features can be found under the magnifying glass icon. You can use three options to search within the GGBN Data Portal:
 +
*Search by fields
 +
*Browse the tree of life
 +
*Browse collections
  
In general you can use three options to search our data portal:
+
Furthermore you can check out statistics of GGBN online collections.
  
1. Browse Taxonomy
+
==Data Quality and Data Cleaning==
2. Search by Fields
+
During harvesting GGBN provider data are checked and cleaned, if necessary. We keep the original provider data in addition to cleaned versions. Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked against both ISO code and country name. In case of incomplete data, the tool is looking into the named areas and localities and tries to extract some information regarding the country or the water body.
3. Search by Citation
 
  
=Browse Taxonomy [http://data.ggbn.org/CoL.php]=
+
Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and customized regular expressions.
[[File:Navi-Browse Taxonomy.JPG|thumb|100px]]
 
The GGBN/DNA Bank Network's Data Portal makes currently use of the Catalogue of Life (CoL) as its main taxonomic backbone. We match family and genus of original records provided by our partners with the CoL annual checklist (version 2009). Selecting [XXX "Browse Taxonomy"] leads you to our Taxonomic Backbone site (figure below).  
 
  
This site gives you an overview on how many samples and taxa are online available for certain higher taxa and genera (taxa and sample counts bracketed). By clicking on "Show DNA samples" you will be redirected to a query that gives you all records for selected taxon.
+
==Taxonomic Backbone==
 +
After harvesting the scientific names are matched against certain checklists of the GBIF checklist bank. Higher taxa, synonyms and accepted taxa are retrieved, also using the GBIF checklist bank webservice (http://api.gbif.org/v1/species). These checklists include: Catalogue of Life, NCBI and the GBIF backbone itself. In addition we match the names against the Prokaryotic Nomenclature up-to-date (PNU) web service, provided by the [http://bacdive.dsmz.de/api/pnu/ DSMZ].
  
Furthermore you can click on the little plus icon to see the next level of taxa.
+
=Search=
 +
<div id="wikinote" align="center">http://data.ggbn.org/ggbn_portal/search/index
  
 +
The Data Portal is based on SOLR, which provides powerful full text search. We have implemented this feature in all search fields, apart from select lists, checkboxes and radio buttons. The search is case insensitive, so e.g. “black sea”, “Black Sea” or “Black sea” will all work.
  
[[File:CoL-Start1.JPG|500px]]
+
'''Note: Boolean operators must be in capital letters! You can combine as many terms and operators as you want. Wildcards (*) can be placed anywhere.'''</div>
  
=Search by Fields [http://data.ggbn.org/Query.php]=
+
==Use of boolean operators like NOT, OR, AND==
[[File:Navi-Data Portal.JPG|thumb|100px]]
+
{|class="wikitable"
The main query form for the data portal can be found at "Search & Preorder". Here you can choose a lot of criteria to filter your results. The upper part contains parameters related to the underlying voucher and collection event and the lower part contains facts about the DNA samples. If you select for example a certain DNA bank you can browse it's whole DNA and tissue collection.
+
!Example !! operator !! Remarks
 +
|-
 +
|corvus AND alba
  
The NCBI Taxonomy ID is often used by the microorganismic community. Right now search does not include synonyms, but will do in the future. Many records are provided with multiple determinations (their determination history). We index all determinations and you can search for all of them. Suggestion lists will help you when entering Family name or Species name.
+
alternatively:
 +
corvus && alba
 +
|AND
 +
|Both, "Corvus" and "alba" must appear somewhere
 +
|-
 +
|"Corvus alba"
 +
|""
 +
|The whole phrase will be searched
 +
|-
 +
|Corvus OR Coloeus
  
Furthermore we filter collection year as well as Seas and Oceans out of the raw data. Seas/Oceans and Countries are matched with an existing list, so that also countries like "England" will be recognized as United Kingdom for instance.
+
alternatively:
 +
Corvus || Coloeus
 +
|OR
 +
|Searches for records containg "Corvus" or "Coloeus" or both
 +
|-
 +
|Corvus alba
 +
|
 +
|The default operator performed here is AND
 +
|-
 +
|Corvus NOT corone
  
We also index related Sequence Accession Numbers (BOLD and GenBank/EMBL/DDBJ) to be available for search.
+
alternatively:
 +
Corvus -corone
 +
|NOT
 +
|Searches for all "Corvus" records, but excludes all containing "corone"
 +
|-
 +
|Cor*
 +
|
 +
|Searches for records beginning with Cor
 +
|-
 +
|c?rvus
 +
|
 +
|Will find Cervus and Corvus
 +
|-
 +
|C*vus
 +
|
 +
|Will find Cervus, Corvus and e.g. Campylobacter curvus
 +
|-
 +
|*amrock
 +
|
 +
|Will find Gallus gallus domesticus (Gmelin, 1789) 'Amrock'
 +
|-
 +
|(Ontario OR Ohio OR pennsylvania) AND NOT “Lake Erie”
 +
|
 +
|Combine as many operators and terms as you want
 +
|}
  
==Dependent parameters==
+
==Search by fields==  
Some parameters are related to each other, e.g. if you select a Continent or an Ocean, the list of Countries or Seas respectively will be reduced to those belonging to the selected Continent/Ocean.
+
[[File:GGBN data portal search form.jpg|center|700px]]
The same happens for Family name and Species name/Taxonomy ID. If you select a Family name, the list of suggested Species names and Taxonomy IDs is reduced to the relevant ones.
 
  
[[File:Queryform.JPG|600px]]
+
Here you can choose different parameters to filter your results. The upper part contains parameters often used by researchers and curators. In addition you can add further parameters (click on "add search field"). We distinguish between GGBN repositories (DNA and tissue banks) and voucher collections. The latter can also be non-GGBN institutions.
 +
{|
 +
|[[File:GGBN data portal add search fields.jpg|thumb|200px|select additional search parameters]]
 +
|[[File:GGBN data portal add search fields step2.jpg|thumb|400px|the field will appear (here Ocean) immediately. It can be deleted by clicking on the red cross]]
 +
|}
  
==Getting results==
+
Most of the fields are drop down lists or include suggestion lists to help you. E.g. when typing a name the portal searches for all synonyms and accepted names matching your search term and provides a suggestion list with detailed information about the name found in the GGBN backbone.
After clicking on "Search" you will receive a hitlist with 50 records per page. The column heading contain small arrows. Clicking on such an arrow will arrange the results in the new order. The green arrow marks the current order (in the figure ordered by Species name from A to Z). The hitlist contains the species name, the country where the specimen/sample was collected, the DNA number as well as the specimen/voucher number. Clicking on the small magnifier or the species name will give you the record details.
+
[[File:GGBN data portal suggestion list.jpg|center|500px]]
  
<div id="wikinote">Some samples doesn't have a DNA number. This means that DNA was not extracted yet but can be ordered on-demand.</div>
+
<div id="wikinote">You can search for any scientific name using "Scientific Name", including higher taxa.</div>
  
[[File:Hitlist-portal.JPG|600px]]
+
==Edit your search==
 +
Your results are displayed in a hitlist. You can change the filters at any time. Just select further parameters from "add search field" or delete some using the red cross. To see the new results click on "Refine search". You can also change the order of the columns by clicking on the little arrows. To see the details of a record click on the blue scientific name.
 +
[[File:GGBN data portal hitlist.jpg|center|500px]]
  
==Retrieving details==
+
==Record detail==
The single record details page is a synoptic virtual dataset combined from at least 2 different sources (DNA sample and voucher). The upper part contains information on the physical DNA sample as well as links to published sequences and papers. The lower part contains the underlying specimen information, sometimes associated vouchers are also listed (e.g. duplum in another collection, cultivated/captivated individuals). GGBN encourages its partners to provide digital images of the vouchers as well. Due to several reasons (e.g. size of the organism) this can not always be realized easily.
+
The record details page aggregates data from multiple sources. Here you see an example with DNA sample, Tissue sample and Specimen. These data are coming from up to three different datasources, depending on where the samples and data are deposited. On top you find information about loaning availabilities and conditions. Furthermore it is checked whether the taxon is listed on CITES. Left to the map you find collecting information and determination details. In the lower part you see different blue tabs with information about the physical samples and where to find them.
  
[[File:RecordDetail.jpg|600px]]
+
On top right information about the taxon are retrieved live from external sources, such as GBIF, NCBI, BOLD and EOL. In addition you see how many further samples for this taxon can be found at GGBN.
  
==Pre-order samples==
+
On the left you find information on samples at GGBN that are from same population or same individual as this one.
Login is required for pre-ordering samples. When you are logged in you can select the required samples (checkboxes on the right) and click on "Add selected DNA/Tissue samples to shopping cart" on top right.
 
Some samples are blocked for the ordering process and marked with a big red X.
 
[[File:Hitlist-portal-shop.JPG|600px]]
 
===Shopping Cart===
 
After putting samples into the shopping cart on top of the website appears the message "Your shopping cart contains xy samples". Clicking on "Show details" will open you shopping cart and will guide you through the pre-ordering process.
 
[[File:ShoppingCart1.JPG|600px]]
 
  
The Shopping cart gives you an overview of your selected DNA/tissue samples. Here you can delete samples from your cart or by clicking on the taxon name see the record details again. Click on "Continue Pre-Order (->Step 2/3)" to continue.
+
In case sequences, publications or multimedia items are provided, further tabs will appear.
[[File:ShoppingCart2.JPG|600px]]
+
[[File:GGBN data portal add record detail.jpg|center|700px]]
  
The next steps orders the samples by institutions/DNA banks. In the example below you see that the selected samples are deposited at BGBM and DSMZ. You can check your invoice and delivery address and add some notes if you want. When clicking on "Finish Pre-Order (->Step 3/3) your pre-order will be forwarded to the DNA bank(s) in authority of requested samples. Every DNA bank only receives its relevant order information. Subsequently a confirmation email will be send to you by the DNA bank(s) in question. An offer including binding prices will than be made within a separate email.
+
==Preorder samples/Login feature==
 +
[[File:GGBN data portal login.jpg|thumb|200px|login feature]]
 +
To preorder samples or subscribe to searches you must register as a user. To do so click on "log in" or the little human in the menu. We appreciate if you fill out the complete contact information, since these data can then be forwarded to the sample holding institution, but this is not mandatory.
  
[[File:ShoppingCart3.jpg|600px]]
+
<div id="wikinote">Your orders will be forwarded to the respective institution holding the requested samples. Please check our [[Data_Privacy | Data Privacy Statement]] for more information about storage of user data. If you don't want to register as a user you can also send us an email at info@ggbn.org.</div>
 +
[[File:GGBN data portal menu login.jpg|thumb|right|100px]]
 +
After login a menu will appear under the human icon.
  
=Search by Citation [http://library.ggbn.org]=
+
'''Profile''' Change your personal information here.
The prototype of the GGBN Document Library contains more than 4400 papers citing either DNA and Tissue samples or vouchers that are provided via the GGBN/DNA Bank Network Data Portal. Therefore a lot of non-molecular papers can be found here. The publications are provided in citation format with every voucher and DNA record. You can search for a DNA or Specimen Number, a certain DNA Bank, publication years or free text within citations.
 
[[File:Library1.JPG|600px]]
 
  
The resulting hitlist contains 50 records per page and is ordered by publication year (from lower to upper). By clicking on a DNA number you will directed to the record details.
+
'''Settings''' Personal settings for the hitlist can be defined here.
[[File:Library2.JPG|600px]]
 
  
 +
'''Subscription, Save Searches''' When logged in the hitlist shows an additional column to add samples to the cart. If a sample is not available for loaning for some reasons there is an 'x'. On top right appear buttons to subscribe to this search (and get informed via email if new records are available) as well as to save this search or add selected samples to the cart. You can also add a sample to the card via the details page.
 +
[[File:GGBN data portal hitlist logged in.jpg|center|700px]]
  
=Technical informations=
+
'''Shopping Cart''' If you have added samples to your cart you can go to "View cart" or "Shopping Cart" via the menu or the buttons on right. In step 1 you will see an overview of requested samples and if it is a CITES taxon again a note. Please make sure you belong to an institution registered with CITES, otherwise you can't loan such samples. Go to "Checkout" to proceed.
==1: harvesting and indexing==
+
[[File:GGBN data portal shopping cart step1.jpg|center|700px]]
Data is harvested and indexed using B-HIT (http://wiki.bgbm.org/bhit/). B-HIT is a harvesting and indexing toolkit, based on the GBIF-HIT software.<BR>
 
[[File:bhit.png|500px]]
 
  
The records(or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schema ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported. Elements that are indexed are listed under http://wiki.bgbm.org/bhit/index.php/Indexed_fields.
+
In step 2 the samples are grouped by holding institution. In this example we preorder at two different institutions. You can add a comment to them if you want. When clicking "Order now" your preorder is placed. GGBN forwards your request to the holding institutions. We do not forward your complete order, but only sample information relevant for the sample holding collection.
 +
[[File:GGBN data portal shopping cart step2.jpg|center|700px]]
 +
<div id="wikinote">'''Note: You can only preorder samples. It might be that the samples cannot be loaned to you for some reasons. The curator will contact you and provide details about further procedure. Every GGBN partner is responsible for its samples and procedures. Some partners may require a service charge. In any case you have to sign a Material Transfer Agreement before samples can be loaned. The curator will provide you more details about it.</div>
  
==2: data quality==
+
=Browse the tree of life=
Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked again both ISO code and country name. In case of incomplete data, the tool is looking into the namedareas and localities and tries to extract some information regarding the country or the water body.
+
Coming soon.
  
Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and custom regular expressions.
+
=Browse online collections=
 +
Coming soon.
  
==3: backbone==
+
=Statistics of GGBN online collections=
Every single name is compared to the GBIF Backbone, the Catalog of Life Backbone, and the NCBI Backbone. Higher taxa, synonyms and accepted taxa are retrieved, using the GBIF Checklistbank webservice (http://api.gbif.org/v1/species).
+
Coming soon.
 
 
The "Prokaryotic Nomenclature Up-to-Date API" (http://bacdive.dsmz.de/api/pnu/) is used for bacterial names.
 
 
 
==4: indexation with SOLR==
 

Latest revision as of 09:12, 19 October 2016

Background

three search options to choose

We use the Berlin Harvesting and Indexing Toolkit (B-HIT) to harvest GGBN provider data. The records (or units) can be harvested from providers having either a BioCASe or an IPT installation. For BioCASe providers, the schemata ABCD 2.06, ABCD 2.1, ABCDDNA, ABCDGGBN and ABCDEFG are supported (single records or ABCD Archives). For IPT providers, DarwinCore Archives are supported, including the GGBN extensions. Elements that are indexed are listed at http://wiki.bgbm.org/bhit/index.php/Indexed_fields.

statistic feature can be found under the freezer icon

The search features can be found under the magnifying glass icon. You can use three options to search within the GGBN Data Portal:

  • Search by fields
  • Browse the tree of life
  • Browse collections

Furthermore you can check out statistics of GGBN online collections.

Data Quality and Data Cleaning

During harvesting GGBN provider data are checked and cleaned, if necessary. We keep the original provider data in addition to cleaned versions. Data quality tests are done using B-HIT. Country names are translated in English, ISO codes are compared to the country names, coordinates are validated and checked against both ISO code and country name. In case of incomplete data, the tool is looking into the named areas and localities and tries to extract some information regarding the country or the water body.

Scientific names are parsed using the GBIF Name Parser (http://www.gbif.org/developer/species#parser) and customized regular expressions.

Taxonomic Backbone

After harvesting the scientific names are matched against certain checklists of the GBIF checklist bank. Higher taxa, synonyms and accepted taxa are retrieved, also using the GBIF checklist bank webservice (http://api.gbif.org/v1/species). These checklists include: Catalogue of Life, NCBI and the GBIF backbone itself. In addition we match the names against the Prokaryotic Nomenclature up-to-date (PNU) web service, provided by the DSMZ.

Search

http://data.ggbn.org/ggbn_portal/search/index

The Data Portal is based on SOLR, which provides powerful full text search. We have implemented this feature in all search fields, apart from select lists, checkboxes and radio buttons. The search is case insensitive, so e.g. “black sea”, “Black Sea” or “Black sea” will all work.

Note: Boolean operators must be in capital letters! You can combine as many terms and operators as you want. Wildcards (*) can be placed anywhere.

Use of boolean operators like NOT, OR, AND

Example operator Remarks
corvus AND alba

alternatively: corvus && alba

AND Both, "Corvus" and "alba" must appear somewhere
"Corvus alba" "" The whole phrase will be searched
Corvus OR Coloeus

alternatively: Corvus || Coloeus

OR Searches for records containg "Corvus" or "Coloeus" or both
Corvus alba The default operator performed here is AND
Corvus NOT corone

alternatively: Corvus -corone

NOT Searches for all "Corvus" records, but excludes all containing "corone"
Cor* Searches for records beginning with Cor
c?rvus Will find Cervus and Corvus
C*vus Will find Cervus, Corvus and e.g. Campylobacter curvus
*amrock Will find Gallus gallus domesticus (Gmelin, 1789) 'Amrock'
(Ontario OR Ohio OR pennsylvania) AND NOT “Lake Erie” Combine as many operators and terms as you want

Search by fields

GGBN data portal search form.jpg

Here you can choose different parameters to filter your results. The upper part contains parameters often used by researchers and curators. In addition you can add further parameters (click on "add search field"). We distinguish between GGBN repositories (DNA and tissue banks) and voucher collections. The latter can also be non-GGBN institutions.

select additional search parameters
the field will appear (here Ocean) immediately. It can be deleted by clicking on the red cross

Most of the fields are drop down lists or include suggestion lists to help you. E.g. when typing a name the portal searches for all synonyms and accepted names matching your search term and provides a suggestion list with detailed information about the name found in the GGBN backbone.

GGBN data portal suggestion list.jpg
You can search for any scientific name using "Scientific Name", including higher taxa.

Edit your search

Your results are displayed in a hitlist. You can change the filters at any time. Just select further parameters from "add search field" or delete some using the red cross. To see the new results click on "Refine search". You can also change the order of the columns by clicking on the little arrows. To see the details of a record click on the blue scientific name.

GGBN data portal hitlist.jpg

Record detail

The record details page aggregates data from multiple sources. Here you see an example with DNA sample, Tissue sample and Specimen. These data are coming from up to three different datasources, depending on where the samples and data are deposited. On top you find information about loaning availabilities and conditions. Furthermore it is checked whether the taxon is listed on CITES. Left to the map you find collecting information and determination details. In the lower part you see different blue tabs with information about the physical samples and where to find them.

On top right information about the taxon are retrieved live from external sources, such as GBIF, NCBI, BOLD and EOL. In addition you see how many further samples for this taxon can be found at GGBN.

On the left you find information on samples at GGBN that are from same population or same individual as this one.

In case sequences, publications or multimedia items are provided, further tabs will appear.

GGBN data portal add record detail.jpg

Preorder samples/Login feature

login feature

To preorder samples or subscribe to searches you must register as a user. To do so click on "log in" or the little human in the menu. We appreciate if you fill out the complete contact information, since these data can then be forwarded to the sample holding institution, but this is not mandatory.

Your orders will be forwarded to the respective institution holding the requested samples. Please check our Data Privacy Statement for more information about storage of user data. If you don't want to register as a user you can also send us an email at info@ggbn.org.
GGBN data portal menu login.jpg

After login a menu will appear under the human icon.

Profile Change your personal information here.

Settings Personal settings for the hitlist can be defined here.

Subscription, Save Searches When logged in the hitlist shows an additional column to add samples to the cart. If a sample is not available for loaning for some reasons there is an 'x'. On top right appear buttons to subscribe to this search (and get informed via email if new records are available) as well as to save this search or add selected samples to the cart. You can also add a sample to the card via the details page.

GGBN data portal hitlist logged in.jpg

Shopping Cart If you have added samples to your cart you can go to "View cart" or "Shopping Cart" via the menu or the buttons on right. In step 1 you will see an overview of requested samples and if it is a CITES taxon again a note. Please make sure you belong to an institution registered with CITES, otherwise you can't loan such samples. Go to "Checkout" to proceed.

GGBN data portal shopping cart step1.jpg

In step 2 the samples are grouped by holding institution. In this example we preorder at two different institutions. You can add a comment to them if you want. When clicking "Order now" your preorder is placed. GGBN forwards your request to the holding institutions. We do not forward your complete order, but only sample information relevant for the sample holding collection.

GGBN data portal shopping cart step2.jpg
Note: You can only preorder samples. It might be that the samples cannot be loaned to you for some reasons. The curator will contact you and provide details about further procedure. Every GGBN partner is responsible for its samples and procedures. Some partners may require a service charge. In any case you have to sign a Material Transfer Agreement before samples can be loaned. The curator will provide you more details about it.

Browse the tree of life

Coming soon.

Browse online collections

Coming soon.

Statistics of GGBN online collections

Coming soon.