Difference between revisions of "Data Portal Architecture"

From GGBN Wiki
Jump to: navigation, search
m (WikiSysop moved page Data flow to Data architecture)
Line 1: Line 1:
=Level 1=
+
=Overview=
<div id="wikinote">Within GGBN specimen data are recalled by the same data pipelines which are used by [http://www.gbif.org/ GBIF].</div>
+
The data architecture of the GGBN is based on the [http://www.gbif.org/ GBIF] infrastructure. The basic principle of GBIF as well as of the GGBN is to record all data sets only once. Stored at only one place they can be used as a linked reference for different applications. The GGBN Data Portal bridges the gap between sequence portals and GBIF (see Figure). More information about the GGBN Data Portal can also be found in [http://nar.oxfordjournals.org/content/42/D1/D607 Droege et al. 2014 Nucl Acids Res.]
  
The data architecture of the GGBN is based on the [http://www.gbif.org/ GBIF] infrastructure. The basic principle of GBIF as well as of the GGBN is to record all data sets only once. Stored at only one place they can be used as a linked reference for different applications.
+
[[File:GGBN Portal.jpg|center|700px]]
  
Since many institutions joined GBIF applying different database structures each, the installation of wrappers has become a standard to combine different sources and integrate data easily into networks. There are three main wrapper softwares available [http://www.biocase.org/ BioCASE], [http://digir.sourceforge.net/ DiGIR] and [http://www.tdwg.org/activities/tapir/ TAPIR]. All of them use a xml schema for data transfer: [http://www.biocase.org/ BioCASE] - [http://www.bgbm.org/tdwg/codata/schema/ ABCD], [http://digir.sourceforge.net/ DiGIR] and [http://www.tdwg.org/activities/tapir/ TAPIR] - [http://wiki.tdwg.org/twiki/bin/view/DarwinCore/WebHome DarwinCore(DwC)].
+
=Data flow=
 
+
<div id="wikinote">Within GGBN specimen data are recalled by the same data pipelines which are used by [http://www.gbif.org/ GBIF].</div>
=Level 2=
 
<div id="wikinote">For DNA data management an open source software was developed at the BGBM the "DNA Module". Furthermore it is possible to use your own database system.</div>
 
 
 
The [[DNA_Module | DNA Module]] is one of key components of the networks database system. To find related specimen data of a DNA sample the module sends a query to the respective specimen database via BioCASE or DiGIR. A copy including few specimen attributes is as well stored in the DNA cache (speed up queries). By following the BioCASe and DiGIR protocol it is so possible to connect any GBIF compliant specimen database worldwide.
 
 
 
The DNA Module is currently used by three of the four project partners associated with their own specimen databases. The DSMZ in Braunschweig applies its own system for DNA data input.
 
  
=Level 3=
+
Since many institutions joined GBIF applying different database structures each, the installation of wrappers has become a standard to combine different sources and integrate data easily into networks. There are two main wrapper softwares available [http://www.biocase.org/ BioCASE], and [http://www.gbif.org/ipt IPT]. GGBN has developed the [http://terms.tdwg.org/wiki/GGBN_Data_Standard GGBN Data Standard] to share DNA and tissue data via GGBN. This standard is meant to be used together with ABCD or DarwinCore.
<div id="wikinote">To transfer DNA data into the webportal of the DNA Bank Network an DNA extension for ABCD was developed. Thus BioCASE Provider Software is required.</div>
 
  
Another BioCASE wrapper using the new [[ABCDDNA | DNA extension for ABCD]] has been installed on all three DNA Modules and the database in Braunschweig separately to offer all DNA samples and its related specimen data on the [http://www.dnabank-network.org central webportal].
+
[[File:GGBN Portal Architecture.jpg|center|700px]]
 +
'''General data architectur of the GGBN Dta Portal architecture.''' Specimen and DNA sample databases (on top left) are operated by the Network partners. Their data content is structured and provided by using BioCASe or IPT and the GGBN Data Standard extensions. The Berlin Harvesting and Indexing Toolkit is used to harvest GGBN data and store them in a MySQL database. In addition we use a SOLR instance to speed the query. After harvesting the data are cleaned and enriched by e.g. a match against certain datasets of the GBIF checklist bank. Finally the data are aggregated from multiple sources in the portal to be displayed.
  
The source code of the DNA Bank Network's Webportal is available under Mozilla Public License Version 1.1 at
+
=DNA Bank databases=
http://ww2.biocase.org/svn/dnabank/DNA_Bank_Network/webportal/
+
You can use any database system to manage your DNA bank. IPT and BioCASe can handle most of them. Please check out []Mandatory_and_recommended_fields_for_sharing_data_with_GGBN | which parameters]] are required to share data via GGBN.
  
[[File:Dataflow-Grafik.jpg|650px|thumb]]
+
The [[DNA_Module | DNA Module]] has been developed as an open source solution for administer a DNA and tissue bank. A new version is currently planned. Some of our partners are already using it, but you can use any suitable software.
'''General data architectur of the DNA Bank Network.''' Specimen and DNA sample databases (on top and middle) are operated by the Network partners. Their data content is structured and transferred to the shared web portal (black and green arrows) by wrappers (BioCASe, DiGIR, grey boxes). Publications and online accessible DNA sequence data (blue arrows) can be linked to the related DNA sample. The [http://www.catalogueoflife.org Catalogue of Life] checklist is used as search backbone in the Web portal (red arrow).
 

Revision as of 15:13, 16 December 2015

Overview

The data architecture of the GGBN is based on the GBIF infrastructure. The basic principle of GBIF as well as of the GGBN is to record all data sets only once. Stored at only one place they can be used as a linked reference for different applications. The GGBN Data Portal bridges the gap between sequence portals and GBIF (see Figure). More information about the GGBN Data Portal can also be found in Droege et al. 2014 Nucl Acids Res.

GGBN Portal.jpg

Data flow

Within GGBN specimen data are recalled by the same data pipelines which are used by GBIF.

Since many institutions joined GBIF applying different database structures each, the installation of wrappers has become a standard to combine different sources and integrate data easily into networks. There are two main wrapper softwares available BioCASE, and IPT. GGBN has developed the GGBN Data Standard to share DNA and tissue data via GGBN. This standard is meant to be used together with ABCD or DarwinCore.

GGBN Portal Architecture.jpg

General data architectur of the GGBN Dta Portal architecture. Specimen and DNA sample databases (on top left) are operated by the Network partners. Their data content is structured and provided by using BioCASe or IPT and the GGBN Data Standard extensions. The Berlin Harvesting and Indexing Toolkit is used to harvest GGBN data and store them in a MySQL database. In addition we use a SOLR instance to speed the query. After harvesting the data are cleaned and enriched by e.g. a match against certain datasets of the GBIF checklist bank. Finally the data are aggregated from multiple sources in the portal to be displayed.

DNA Bank databases

You can use any database system to manage your DNA bank. IPT and BioCASe can handle most of them. Please check out []Mandatory_and_recommended_fields_for_sharing_data_with_GGBN | which parameters]] are required to share data via GGBN.

The DNA Module has been developed as an open source solution for administer a DNA and tissue bank. A new version is currently planned. Some of our partners are already using it, but you can use any suitable software.