The EXCITE project, jointly run by WeST (Institut of Web Science and Technologies, University of Koblenz-Landau) in Koblenz and GESIS (Leibniz Institute for Social Sciences) in Cologne, is funded by the Deutsche Forschungsgemeinschaft (DFG) with the aim of extracting citations from social science publications and making more citation data available to researchers. With respect to this objective, a set of algorithms for information extraction and matching has been developed focusing on social science publications in the German language. Excite provides different online services to extract and segment citations. Moreover, other online tools are available to create more gold standard data. (read more...)
September 2016 - July 2019
The shortage of citation data for the international and especially the German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems, and knowledge discovery processes. The accessibility of information in the social sciences lags behind other fields (e.g. the natural sciences) where more citation data is available. The EXCITE project aims to close this gap by developing a toolchain of software components for reference extraction which is applied to existing scientific databases (esp. full texts in the social sciences). The tools are made available to other researchers. The project is to develop a number of algorithms for extracting references and citations from PDF full texts. It also improves the matching of reference strings to bibliographic databases. The extraction of citations is implemented as a five-step process:
This is done with the help of machine learning methods which control the quality of the extracted data of the individual components. The extracted citation data is integrated into the services maintained by the proposers (sowiport) and published as linked open data under permissive licenses to enable reuse. The resulting software of this project is published under open source licenses and made accessible via a web service API.
Several services are provided by Excite to extract and parse citations. All tools are licensed under Creative Commons attribution (CC BY-NC) and their codes are available on GitHub.
EXParser: It is a Python tool that extracts and segment references from PDF files by adopting a feedback mechanism.
EXMatcher: This algorithm is implemented for finding corresponding items in a bibliography corpus (such as Sowiport.org or related-work.net) for reference strings.
EXPublisher: This code is dedicated to the task of converting EXCITE data to a JSON file with OCC ontology.
RefExt: It is JAVA tool that extracts references from PDF files. Using Conditional Random Fields (CRF).
When: 30.03.2017 - 31.03.2017
Where: GESIS-Leibniz-Institut für Sozialwissenschaften, Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Our first community meeting is planned as a “noon to noon” event and has the goal to bring together experts in reference extraction, text mining, and machine learning to explore the possibilities in the project. We plan to have scientific presentations with invited speakers on the first day and hands-on sessions on the second day. For the second day we will release a test corpus (PDF files of scientific papers and manually annotated data) for developers.read more
EXCITE is pleased to announce its collaboration with the Open Citation Corpus (OCC), which started in 2010 as a one-year project funded by the Joint Information Systems Committee (JISC). In addition to its tools and services, OCC publishes accurate bibliographic and citation data in an open repository made available under a Creative Commons public domain dedication. The collaboration with OCC serves our vision of transparent access to bibliographic metadata as well as citation data for facilitating research in social science in particular, and in all sciences and the humanities in general.
EXCITE, Open Citation Corpus, Europe PMC and University of Bologna are organising a workshop on “Open Citations” which will take place at the University of Bologna on September 3rd- 5th. The workshop addresses experts and scholars in open bibliographic metadata and citations and their extraction approaches. Also, the workshop gives a good chance to attend the presentations of our invited speakers who are well experienced in scholarly publishing. At the hack day, new services and data will be presented.read more
Körner M., Ghavimi B., Mayr P., Hartmann H., Staab S. (2017) Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In: Kirikova M. et al. (eds) New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham read more
Boukhers Z., Ambhore S., Staab S. (2019) An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more
Hosseini A., Ghavimi B., Kern D., Mayr P. (2019) EXCITE - A toolchain to extract, match and publish open literature references. In Proceedings of the 19th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2019. ACM. read more