What kind of data are available?
The data comprise of records of the occurrence of a species or higher taxon (sometimes animals and plants can only be identified to Genus or Family) in a place at a particular date. Most records include latitude and longitude coordinates. An increasing number of records have additional information, such as who collected and identified the species, its depth or altitude, and other associated information.
OBIS has data on about half of the named 240,000 marine species, and GBIF probably has records of over half of all named species on Earth. The number of records is about 50 million in OBIS and 1 billion in GBIF. Some records go back centuries, but most data is from recent decades, with a time-lag in data entry of about five years.
The records come from specimens and field observations, and of all kinds of animals, plants and microbes.
For marine species, should I use OBIS or GBIF?
If searching for a marine taxon, use both. Although many datasets are published in both, some are only in one, and many largely terrestrial datasets may have some useful marine records (e.g. herbarium or museum collections).
Because both GBIF and OBIS use the same data standards and formats, you can merge the file. Then delete duplicate datasets by searching on dataset name or ID and removing older versions. Then sort by records and see if some records seem to be duplicates. Then delete duplicates. Removing duplicates will reduce your dataset size which will enable faster analyses.
What format is the data in?
You can download the data as a standard tabular data sheet. This is based on an international standard “Darwin Core” format used by both GBIF, OBIS and other biodiversity databases. This makes it easy to integrate datasets.
You should be familiar with this data format should you wish to publish your own data at some point.
What are datasets?
OBIS and GBIF are compilations of thousands of datasets. Datasets can be anything from annual fishery research trawl data, plankton net surveys, benthic samples, bird counts, satellite tracked paths of whales and turtles, citizen science records, and museum specimen collections. Think of GBIF and OBIS as a journal containing many ‘papers’. Thus you need to cite each dataset used like you would papers in a journal.
Why must I cite the datasets?
It is a condition of almost all datasets that you cite them once used. Citing OBIS or GBIF is not sufficient (it is like citing a journal and not a paper published in it). In fact, you are breaking the conditions of use if you do not cite the datasets used; to put it more bluntly, it is illegal to use the data if you do not cite the datasets.
Note which datasets you are using by copying the citations and DOI (a unique code) of datasets. If you used only a few datasets then cite them in the references with other publications; if tens or hundreds then cite them in an Appendix.
You will notice that many datasets do not have sensible citations provided. Some even say “cite this dataset”. Some cite a source print publication. However, an increasing number now provide a conventional author-year-title-source citation to which you add the date accessed. You must add the date accessed because datasets may have multiple versions if they may be amended over time.
What should I do next to learn more?
Download some data. Select your taxon or geographic area of interest. For example a species or higher taxonomic level, or a country or other geographic region. Look at the information on the web page – how many records, datasets and does it look sufficient for your purpose on the map? If it looks potentially useful for your purpose, then download the data file.
Should I make available the data I used?
Probably yes because you have selected particular data for your purpose. This compilation of data is unique to your research. To enable your work to be reproducible you should make your data used available on an open access archive, e.g., Figshare. Do this when or before you publish the results of your analyses.
Do not just publish your data as a pdf; use the same standard (comma separated values file, “.csv”) format you received it in so other people can more easily re-use your data for their purpose. Then they are likely to cite your publication.
If you added additional data to that you used from GBIF or OBIS, such as from the literature, your field records, or other unpublished sources, then include this in the dataset you publish. Each row of the datasheet is a ‘record’ and notes its origin. If these data are not already in GBIF and OBIS then send it to one of their nodes to publish on your behalf.
What quality assurance checks should I do?
GBIF and OBIS increasingly provide indicators of completeness and other quality assurance checks on their data. However, you need to do your own because only you know what data are suitable for your purpose. The following checks are recommended:
- Remove duplicate datasets. It can happen that both old and new versions of a dataset may occur in GBIF.
- Remove duplicate records (it is possible that records get published through more than one dataset). If a record has the same species, latitude, longitude and collection date as another it is likely a duplicate. It can happen that the same records get published through more than one dataset.
- Check taxonomic nomenclature. For marine species you can use the ‘taxon match’ tool on WoRMS or Lifewatch to check which names are synonyms and organised in a standard taxonomic classification. Names that are not ‘matched’ may be misspelt or mistaken names or not marine taxa. For non-marine taxa, the best source for checking names in the Catalogue of Life.
- Check temporal resolution. Do you want to use all records over all time, or only recent ones?
- Check spatial resolution. There are fields in the data sheet for the geographic precision of each record. You have to decide if you wish to only accept records with particular geographic accuracy (e.g., no data on precision, 10 km accuracy).
- Map the data points. Do some look like outliers? Check their metadata and source. Does the place name match the latitude and longitude coordinates? If the point seems questionable you may decide to omit it from your analysis.
- Do points for marine species appear on land, or terrestrial species appear on the ocean? This could be because the location is associated with an island or country and the exact point is unknown. You need to decide whether such points are useful or not for your purpose.
You can create a table of number of records and species downloaded, and show the reduction in both with each step in this data filtering (sometimes called cleaning).
Where can I find definitions of variables in the datasheet?
The terms are recognized as ‘Darwin Core Terms’, and the definition of variables can be read here: http://tdwg.github.io/dwc/terms/index.htm
Can I reduce the size of the dataset and still do the analysis?
If doing a regional or global analysis, you can aggregate records to larger spatial cells, such as 5o latitude and longitude cells commonly used for global analyses. Then perhaps all you want to know is which species is present in each 5o cell. Thus you can reduce the dataset size from many records of the same species, to one record of the species per cell. You may also wish to know the total number of records per cell to have an indicator of sampling effort.
What to do about sampling bias?
All sampling is biased. This bias includes:
- Sampling methods that target particular taxa (e.g., spotting whales, sediment cores, plankton nets). Even within a method, a method will vary in its efficacy in detecting different species and life-stages. Work is underway to extend GBIF and OBIS metadata to be able to select datasets that used similar methods, and you could presently do this by checking dataset metadata, or using common species as indicators to find datasets that used comparable methods.
- Sample sizes vary. Even when using similar methods, survey area or time may vary, nets may have different mesh size and towing speed, transects and quadrats may have different areas sampled, traps may be deployed for different time periods, etc.
- Sampling is spatially biased. Some places are sampled more and less for many reasons.
- Sampling is temporally biased.
Sampling bias is not a problem as long as you use it with its bias in mind and interpret the results accordingly. All sampling is in practice “stratified” in some way, such as to a particular environment, habitat, taxonomic group (guild), or size group. So be up front about the scope of your work; what has been and has not been studied.
Can I get and use species abundance data?
Species abundance data is increasingly available in OBIS and GBIF. However, abundance data are highly dependent on sampling method, effort and size. Alternatively, one can use the number of locations of a species in an area (e.g., 5o cell) at a particular time (e.g., year) to indicate its spatial abundance. This is less sensitive to sampling bias. This is also called species occurrence, incidence and presence.
Why do most people use species “presence-only” data?
Species presence is far less sensitive to sampling bias, and is the most usual metric of diversity in biogeographic studies. From it, one can calculate a variety of measures of species richness, checklists of species, and change in species composition over time and space. Change in species composition is also called species turnover and betadiversity.
This is often incorrectly called presence-absence data. In fact, whether a species is truly absent is often unknown. However, it can sometimes be reasonable to consider it absent if it is known that it would have been sampled if present.
Which species are most interesting to study?
Many species have important attributes, such as if listed as threatened, extinct, endangered, introduced, and/or invasive. Others are important as food or may be pests, vectors of diseases. Many species provide habitat for others (e.g. corals, trees), and other are top predators that may control the abundance of other species in a food web.
Most species are rare, and often the role in an ecosystem is unknown. However, geographic rarity and endemicity places species at greater risk of extinction.
Where do I get more details on how to analyse data?
How to use OBIS https://classroom.oceanteacher.org/course/view.php?id=349 and http://iobis.org/manual/
See also the obistools R package with the XYlookup service (returns shore distance, bathymetry, SS-salinity, SS-temperature for a given coordinate).
Temporal Diversity Indices package in R
https://cran.r-project.org/web/packages/codyn/vignettes/Temporal_Diversity_Indices.html