Difference between revisions of "Hackathon 2013/Citations"

From TaxonWorks Wiki
Jump to: navigation, search
(Links)
(Revelant Code and projects)
 
(41 intermediate revisions by 3 users not shown)
Line 6: Line 6:
 
#* Possibly http://rubygems.org/gems/bibtex-ruby
 
#* Possibly http://rubygems.org/gems/bibtex-ruby
 
# Writing code to resolve microcitations. A [http://iphylo.blogspot.com/2011/03/nomenclator-zoologicus-meets.html working example] of this has been created by Rod Page that links generic names from Nomenclator Zoologicus with references from the Biodiversity Heritage Library.
 
# Writing code to resolve microcitations. A [http://iphylo.blogspot.com/2011/03/nomenclator-zoologicus-meets.html working example] of this has been created by Rod Page that links generic names from Nomenclator Zoologicus with references from the Biodiversity Heritage Library.
#* Microcitations should be differentiated from verbatim references. A microcitation can be just an author and date (as seen in a full taxon name) or any shortened version of a full citation such as Cas. ces. Spol. ent. 2:27-32 and others found in [http://uio.mbl.edu/NomenclatorZoologicus/ Nomenclator Zoologicus]. A verbatim reference is the full reference as it appears in documentation that hasn't been broken into normalized pieces yet. Both versions need to be tracked and be searchable. We also need a way to get a listing out of TW of both versions, so they can be normalized into full sources.
+
#* Microcitations should be differentiated from verbatim references. A microcitation can be just an author and date (as seen in a full taxon name) or any shortened version of a full citation such as <Cas. ces. Spol. ent. 2:27-32> and others found in [http://uio.mbl.edu/NomenclatorZoologicus/ Nomenclator Zoologicus]. A verbatim reference is the full reference as it appears in documentation that hasn't been broken into normalized pieces yet. Both versions need to be tracked and be searchable. We also need a way to get a listing out of TW of both versions, so they can be normalized into full sources.
 
#* Lists of journals.
 
#* Lists of journals.
 
#* [http://biodivlib.wikispaces.com/Developer+Tools+and+API BHL API] to convert BHL URLs into references
 
#* [http://biodivlib.wikispaces.com/Developer+Tools+and+API BHL API] to convert BHL URLs into references
 
# Finding out more about a particular citation
 
# Finding out more about a particular citation
 +
#* Going from an identifier to a BibTeX citation is partially covered by [https://github.com/gaurav/biburi biburi]
 
#* Getting abstract, keywords from PubMed
 
#* Getting abstract, keywords from PubMed
 
#* Pulling in citation from:  
 
#* Pulling in citation from:  
Line 40: Line 41:
 
* Should be able to store abstracts, nomenclature acts & entire classification (available from ZooRecord - most probably returned as text strings).
 
* Should be able to store abstracts, nomenclature acts & entire classification (available from ZooRecord - most probably returned as text strings).
 
* Should be able to round trip data (e.g. import a BibTex file, then output a BibTex file and have them be the same.)
 
* Should be able to round trip data (e.g. import a BibTex file, then output a BibTex file and have them be the same.)
 +
 +
 +
Source format variations:
 +
* authority string - <author family name> year
 +
* short string - <author short name (as little of the author names needed to differentiate from other authors within current project)> <editor indicator> <year> <any containing reference - e.g. In Book> <Short publication name> <Series> <Volume> <Issue> <Pages>
 +
* long string - <full author names> <editor indicator> <year> <title> <containing reference> <Full publication name> <Series> <Volume> <Issue> <Pages>
 +
* no publication long string -
  
 
== Coding Notes ==
 
== Coding Notes ==
Line 72: Line 80:
 
*** SerialRelationship is modeled as type & 2 serial IDs
 
*** SerialRelationship is modeled as type & 2 serial IDs
 
*** SerialRelationshipType will be based on MARC (http://www.oclc.org/bibformats/en/7xx.html - see 780 and 785)
 
*** SerialRelationshipType will be based on MARC (http://www.oclc.org/bibformats/en/7xx.html - see 780 and 785)
 +
 +
----
 +
Species File conventions to remember:
 +
 +
* Two references are considered a match even if access code or th3 editor, OSF copy, or citation flags are different.
 +
* Two references are considered different if they have different verbatim reference fields (including different capitalization), even if everything else matches!
 +
* A reference is considered different if author, pub or containing ref aren't identical
 +
* A reference is considered similar if years, title, volume or pages are either the same or missing.
 +
* a similar reference may be added to the db by user request
 +
* the values of verbatim data are ignored when checking if references are similar.
 +
 +
SF Reference formats:
 +
* Authority string - author & year only
 +
* Short string - reference string with author last names only, unless more is needed to differentiate authors from other authors in the SF, and the long name of the journal or the short name of any other pubtype. Doesn't contain title.
 +
* Long string without publication - same as below without just missing the pub string.
 +
* Long reference string (with or without the cite string) - Long or short author string (based on user preferences), year, title, full publication name, series volume, issue, pages
 +
* Note, URLs, and publication notes are displayed by an independent call
  
 
== Open Issues ==
 
== Open Issues ==
* How do we support Tom in Tom, Dick & Harry (E.G. Tom is the authority but the actual journal article is by Tom, Dick & Harry). Are these separate sources ('''Dmitry''' says no - not supporting ref-in-ref the same way that SF does. In this case, the taxonomic authority string (which is just a text string) would just not match the author string in the original description source). Ed is wondering, do we divorce the authors of taxa from authors in references? Otherwise how, would we associate the authors for a species correctly.
+
* How do we support Tom in Tom, Dick & Harry (E.G. Tom is the authority but the actual journal article is by Tom, Dick & Harry). Are these separate sources  
 +
** Dmitry says no - not supporting ref-in-ref the same way that SF does. In this case, the taxonomic authority string (which is just a text string) would just not match the author string in the original description source).  
 +
** Ed is wondering, do we divorce the authors of taxa from authors in references? Otherwise how, would we associate the authors for a species correctly.
 +
** Beth - Similarly how would we test for correctness between authority string and the source/reference?
  
 
* Would like a way to store linkages between sources and OTUs that indicate the type of information within the source (e.g. key, images). SF has tblCiteInfoFlags.FlagValue which specifies this kind of information.  It is also summarized across all citations to the same reference in tblReferences.CiteDataStatus.
 
* Would like a way to store linkages between sources and OTUs that indicate the type of information within the source (e.g. key, images). SF has tblCiteInfoFlags.FlagValue which specifies this kind of information.  It is also summarized across all citations to the same reference in tblReferences.CiteDataStatus.
 +
 +
* Can we imbed coins within our pages/structure so that someone else can access our db/pages the way that Gaurav's gem accesses the BHL.
  
 
== Deliverables ==
 
== Deliverables ==
Line 82: Line 112:
 
# Write a Rails system for storing citations, and integrate it with bibtex-ruby.
 
# Write a Rails system for storing citations, and integrate it with bibtex-ruby.
 
#* Beth
 
#* Beth
# Given an identifier, look for information about it online.
+
# Given an identifier, look for information about it online (**done!**: https://github.com/gaurav/biburi)
 
#* Gaurav
 
#* Gaurav
 +
#* uses ''coins'' - a way of imbedding citations within a webpage.
 +
#* works with Medelay, doi, BHL
 +
#* doesn't work with Zotera (Zotera doesn't use coins). Would need add an additional module to use the Zotera API.
 +
#* could refine the BHL interface to improve linkages & pull out other identifiers.
 
# Parsing citations in Ruby.
 
# Parsing citations in Ruby.
  
Line 109: Line 143:
 
== URLs and identifiers of taxonomic significance ==
 
== URLs and identifiers of taxonomic significance ==
 
It should be noted that there will be multiple identifiers associated with a single source.
 
It should be noted that there will be multiple identifiers associated with a single source.
 +
* ''[https://github.com/gaurav/biburi biburi] currently supports DOIs and any website containing [[wikipedia:COinS|COinS]]''
 
* ISBN/ISSN
 
* ISBN/ISSN
 
* BHL URLs
 
* BHL URLs
Line 128: Line 163:
 
* [http://text2bib.economics.utoronto.ca/index.php/index text2bib] converts a plain text list of references in a wide range of styles to BibTeX.
 
* [http://text2bib.economics.utoronto.ca/index.php/index text2bib] converts a plain text list of references in a wide range of styles to BibTeX.
 
* [http://ligercat.ubio.org/ LigerCat] Literature and Genomics Resource Catalogue
 
* [http://ligercat.ubio.org/ LigerCat] Literature and Genomics Resource Catalogue
 +
* [http://manas.tungare.name/software/isbn-to-bibtex/ ISBN to BibTeX converter] - uses Amazon.
  
 
== APIs available ==
 
== APIs available ==
Line 139: Line 175:
 
* [https://docs.google.com/file/d/0Bx0f4rUOr4cidnBEMTg0VVVYV1U/edit?usp=sharing Taeneonema example in BibTex]
 
* [https://docs.google.com/file/d/0Bx0f4rUOr4cidnBEMTg0VVVYV1U/edit?usp=sharing Taeneonema example in BibTex]
 
* [https://docs.google.com/file/d/0Bx0f4rUOr4ciQkFhdlpFR2x2LTg/edit?usp=sharing Taeneonema example as pdf bibliography ]
 
* [https://docs.google.com/file/d/0Bx0f4rUOr4ciQkFhdlpFR2x2LTg/edit?usp=sharing Taeneonema example as pdf bibliography ]
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidEdhdjZGY3RxSEZlaWFMckMzYzFiSlE&usp=sharing PubTypesComparison] TaxonWorks Vue Model mappings with BibTeX, Common Data Model and more.
+
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidEdhdjZGY3RxSEZlaWFMckMzYzFiSlE&usp=sharing PubTypesComparison] TaxonWorks Vue Model mappings of publication types with BibTeX, Common Data Model and more. (spreadsheet)
 +
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidEVEa2UzV0sxVTZjN2UzVkRCWVNJWmc&usp=sharing BibTeX fields and definitions] (spreadsheet)
 +
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidEFCZVFZMmJHOGlpRmNYNVdLWk5DVFE&usp=sharing BibTeX field specifications] listed for different Pub Types (spreadsheet)
 +
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidEF6M2Q2SkI0XzAxSjRndnhQNTJUcEE&usp=sharing SFS to BibTeX] - crude start of mapping of references (spreadsheet)
 +
* [https://docs.google.com/spreadsheet/ccc?key=0Ah0f4rUOr4cidHBVb0poVGdJa2JORzd2cVFMQ3dyaHc&usp=sharing DwC-A literature] fields used in DwC-A and GBIF-GNA literature extension files.
 +
 
 +
Useful BibTeX reference links
 +
* [http://rubygems.org/gems/bibtex-ruby bibtex-ruby at rubygems.org], [http://inukshuk.github.io/bibtex-ruby/ bibtex-ruby home page], [http://rubydoc.info/gems/bibtex-ruby/2.3.4/frames bibtex-ruby documentation]
 +
* [http://www.andy-roberts.net/writing/latex/bibliographies BibtTeX bibliography tutorial] - This contains an example decoding of a BibTeX entry.
 +
* [https://github.com/inukshuk/bibtex-ruby/wiki/The-BibTeX-Format BibTex-ruby wiki] - contains information about where the bibtex-ruby varies from official BibTeX interpretation.
 +
* [http://artis.imag.fr/~Xavier.Decoret/resources/xdkbibtex/bibtex_summary.html BibTex summary] a breakdown of BibTex format useful for creating valid BibTeX files. (Written by someone who was providing help for the Python tool)
 +
* [http://www.fb10.uni-bremen.de/anglistik/langpro/bibliographies/jacobsen-bibtex.html BibTeX Format] - has definitions of many of the BibTeX fields.
 +
* [http://bst.maururu.net/ LaTeX Bibliography Styles Database]
 +
* [http://www.tug.org/pracjourn/2006-4/fenn/fenn.pdf Managing your citation with BibTeX] and/or http://www.tug.org/pracjourn/2006-4/fenn - BibTeX manuals.
 +
* [http://code.google.com/p/bibtex-js/ BibTeX-js] can parse a BibTeX-file and render it as part of an HTML file.
 +
* [http://www.bibtex.org Official BibTex site]
 +
* [http://nwalsh.com/tex/texhelp/bibtx-23.html Help on BibTeX Names] - more good help files available in directory
 +
* [http://bay.uchicago.edu/tex-archive/biblio/bibtex/contrib/doc/btxFAQ.pdf BibTeX Tips and FAQ]

Latest revision as of 10:59, 21 October 2013

This pitch covers making sure that citations can flow easily in and out of TaxonWorks. It will do this in three ways:

  1. Finding an existing data model which can hold information citations with a standard format for representing it.
  2. Writing code to read this in and out in Ruby, possibly just using a standard library
  3. Writing code to resolve microcitations. A working example of this has been created by Rod Page that links generic names from Nomenclator Zoologicus with references from the Biodiversity Heritage Library.
    • Microcitations should be differentiated from verbatim references. A microcitation can be just an author and date (as seen in a full taxon name) or any shortened version of a full citation such as <Cas. ces. Spol. ent. 2:27-32> and others found in Nomenclator Zoologicus. A verbatim reference is the full reference as it appears in documentation that hasn't been broken into normalized pieces yet. Both versions need to be tracked and be searchable. We also need a way to get a listing out of TW of both versions, so they can be normalized into full sources.
    • Lists of journals.
    • BHL API to convert BHL URLs into references
  4. Finding out more about a particular citation
    • Going from an identifier to a BibTeX citation is partially covered by biburi
    • Getting abstract, keywords from PubMed
    • Pulling in citation from:
      • BHL
      • ZooRecord
      • Biosys
    • ImpactStory information
  5. Automatically parsing citations into authors, title, etc.
  6. Designing a user interface to make it easy to resolve microcitations
    • Autocompletion
    • Journal name identification
    • Searching on Google Scholar/Wikipedia for books/authors


Contents

Members

Requirements (potential test cases/use cases)

  • Must support letters after the year for multiple publications of the same author in the same year.
  • No single field will be required for any given record - this is different than the requirements for valid BibTeX.
    • The scientist should be able to enter partial or incomplete data, then return later to complete this information. TW must support a workflow that is convenient for the scientist, not the program. (That may be all that is conveniently available at the moment.) (See note below on adding a "review needed" tag on import. This type of tag should be available on manual entry as well.)
    • The scientist should be able to copy a verbatim reference from another document and pass it into TW for normalization later.
    • The scientist should be able to similarly pass just a URL/URN or other identifier in as a reference for completion later (this will be a source of type "miscellaneous"). Ideally this could then simply resolve identifiers such as a DOI and generate readily available reference information.
  • When importing SF reference data into TW sources, they will need to be tagged "For review" because the BibTex has a finer grain of differentiation of types than SF.
  • Should be able to store abstracts, nomenclature acts & entire classification (available from ZooRecord - most probably returned as text strings).
  • Should be able to round trip data (e.g. import a BibTex file, then output a BibTex file and have them be the same.)


Source format variations:

  • authority string - <author family name> year
  • short string - <author short name (as little of the author names needed to differentiate from other authors within current project)> <editor indicator> <year> <any containing reference - e.g. In Book> <Short publication name> <Series> <Volume> <Issue> <Pages>
  • long string - <full author names> <editor indicator> <year> <title> <containing reference> <Full publication name> <Series> <Volume> <Issue> <Pages>
  • no publication long string -

Coding Notes

The following are notes retrieved from Matt's VUE (Media:Source.JPG) file relating to Sources.

  • Relationships to other objects within TaxonWorks:
    • Sources may have a SourceAuthor and/or SourceEditor (Role of a person)
    • Sources may have a HumanSource (Requires a person with a role of SourceSource)
    • Sources will support the following BibTex types:
      • Book
      • Article - an article is published in a Serial (Journal)
      • Conference
      • Booklet
      • InBook
      • InCollection
      • MastersThesis
      • InProceedtings
      • Misc - this will be used when the only available current source is a URL.
      • PhdThesis
      • Techreport
      • Unpublished - TaxonWorks revisions and other works-in-progress, may also be used for LepIndex catalogue cards.
      • Manual
    • Sources may be published in a Serial, which will support relationships of Preceding and Succeeding.
      • Serials have:
        • Title
        • Series_year_start
        • Series_year_end
        • editors (text list - different from people with roles)
        • publisher
        • place_published
        • primary_language
      • SerialRelationship is modeled as type & 2 serial IDs
      • SerialRelationshipType will be based on MARC (http://www.oclc.org/bibformats/en/7xx.html - see 780 and 785)

Species File conventions to remember:

  • Two references are considered a match even if access code or th3 editor, OSF copy, or citation flags are different.
  • Two references are considered different if they have different verbatim reference fields (including different capitalization), even if everything else matches!
  • A reference is considered different if author, pub or containing ref aren't identical
  • A reference is considered similar if years, title, volume or pages are either the same or missing.
  • a similar reference may be added to the db by user request
  • the values of verbatim data are ignored when checking if references are similar.

SF Reference formats:

  • Authority string - author & year only
  • Short string - reference string with author last names only, unless more is needed to differentiate authors from other authors in the SF, and the long name of the journal or the short name of any other pubtype. Doesn't contain title.
  • Long string without publication - same as below without just missing the pub string.
  • Long reference string (with or without the cite string) - Long or short author string (based on user preferences), year, title, full publication name, series volume, issue, pages
  • Note, URLs, and publication notes are displayed by an independent call

Open Issues

  • How do we support Tom in Tom, Dick & Harry (E.G. Tom is the authority but the actual journal article is by Tom, Dick & Harry). Are these separate sources
    • Dmitry says no - not supporting ref-in-ref the same way that SF does. In this case, the taxonomic authority string (which is just a text string) would just not match the author string in the original description source).
    • Ed is wondering, do we divorce the authors of taxa from authors in references? Otherwise how, would we associate the authors for a species correctly.
    • Beth - Similarly how would we test for correctness between authority string and the source/reference?
  • Would like a way to store linkages between sources and OTUs that indicate the type of information within the source (e.g. key, images). SF has tblCiteInfoFlags.FlagValue which specifies this kind of information. It is also summarized across all citations to the same reference in tblReferences.CiteDataStatus.
  • Can we imbed coins within our pages/structure so that someone else can access our db/pages the way that Gaurav's gem accesses the BHL.

Deliverables

  1. Use bibtex-ruby to read, write and round-trip bibliographic information.
  2. Write a Rails system for storing citations, and integrate it with bibtex-ruby.
    • Beth
  3. Given an identifier, look for information about it online (**done!**: https://github.com/gaurav/biburi)
    • Gaurav
    • uses coins - a way of imbedding citations within a webpage.
    • works with Medelay, doi, BHL
    • doesn't work with Zotera (Zotera doesn't use coins). Would need add an additional module to use the Zotera API.
    • could refine the BHL interface to improve linkages & pull out other identifiers.
  4. Parsing citations in Ruby.

Terms

  • Citation: An individual, unnormalized use of a source.
    • Each citation will have a unique identifier that the rest of the system can reference.
    • It must be possible to have a citation which consists ONLY of a single identifier. We could treat this as a verbatim reference.
  • Source: Someone or institution credited as providing data.
    • TW needs sources to be private or public. Why?
  • Global source: a common pool of sources. These should be published and non-private.

Datasets we can play with

How do these fit here? - These are online resources that we can pull datasets from (Beth)

  • ITIS
  • GNUB
  • UCD

Input/output formats

URLs and identifiers of taxonomic significance

It should be noted that there will be multiple identifiers associated with a single source.

  • biburi currently supports DOIs and any website containing COinS
  • ISBN/ISSN
  • BHL URLs
  • PubMed ID/URLs
  • DOI ID/URLs
  • Handle ID/URLs?
  • Mendeley/Zotero/EndNote ID/URLs

Revelant Code and projects

Projects and gems that may be of interest

APIs available

Links

Useful BibTeX reference links