epijson a unified data format for epidemiology CORD-Papers-2022-06-02 (Version 1)

Title: EpiJSON: A unified data-format for epidemiology
Abstract: Epidemiology relies on data but the divergent ways data are recorded and transferred both within and between outbreaks and the expanding range of data-types are creating an increasingly complex problem for the discipline. There is a need for a consistent interpretable and precise way to transfer data while maintaining its fidelity. We introduce EpiJSON a new flexible and standards-compliant format for the interchange of epidemiological data using JavaScript Object Notation. This format is designed to enable the widest range of epidemiological data to be unambiguously held and transferred between people software and institutions. In this paper we provide a full description of the format and a discussion of the design decisions made. We introduce a schema enabling automatic checks of the validity of data stored as EpiJSON which can serve as a basis for the development of additional tools. In addition we also present the R package repijson which provides conversion tools between this format line-list data and pre-existing analysis tools. An example is given to illustrate how EpiJSON can be used to store line list data. EpiJSON designed around modern standards for interchange of information on the internet is simple to implement read and check. As such it provides an ideal new standard for epidemiological and other data transfer to the fast-growing open-source platform for the analysis of disease outbreaks.
Published: 2015-12-29
Journal: Epidemics
DOI: 10.1016/j.epidem.2015.12.002
DOI_URL: http://doi.org/10.1016/j.epidem.2015.12.002
Author Name: Finnie Thomas J R
Author link: https://covid19-data.nist.gov/pid/rest/local/author/finnie_thomas_j_r
Author Name: South Andy
Author link: https://covid19-data.nist.gov/pid/rest/local/author/south_andy
Author Name: Bento Ana
Author link: https://covid19-data.nist.gov/pid/rest/local/author/bento_ana
Author Name: Sherrard Smith Ellie
Author link: https://covid19-data.nist.gov/pid/rest/local/author/sherrard_smith_ellie
Author Name: Jombart Thibaut
Author link: https://covid19-data.nist.gov/pid/rest/local/author/jombart_thibaut
sha: 141bb78c68caf0302c6f31eac91d1ff5197d5771
license: no-cc
license_url: [no creative commons license associated]
source_x: Elsevier; Medline; PMC
source_x_url: https://www.elsevier.com/https://www.medline.com/https://www.ncbi.nlm.nih.gov/pubmed/
pubmed_id: 27266846
pubmed_id_url: https://www.ncbi.nlm.nih.gov/pubmed/27266846
pmcid: PMC7104924
pmcid_url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7104924
url: https://doi.org/10.1016/j.epidem.2015.12.002 https://api.elsevier.com/content/article/pii/S1755436515000973 https://www.sciencedirect.com/science/article/pii/S1755436515000973 https://www.ncbi.nlm.nih.gov/pubmed/27266846/
has_full_text: TRUE
Keywords Extracted from Text Content: line-list EpiJSON line people JavaScript Object spread-sheets UDUNITS2 unit database RFC4648 hh Outbreak-Tools https://github.com/Hackout2/EpiJSON humans JSON object. id base64 CF patient EpiJSON UDUNITS2 Fig. 2 , Table 1 −2147,483,647 EpiJSON's line 1996-12-19T16:39:57 https://raw.githubusercontent.com/Hackout2/EpiJSON/ master/schema/epijson.json. JSON object Epi-JSON NetCDF https://cran.r-project.org/web/packages/ ISO RFC3339 JSON UUID work-flows Solid lines human C's long type (i.e. in Commas RFC 4122 unit-of-record Fig. 2 GeoJSON Non-standard parts EpiJSON
Extracted Text Content in Record: First 5000 Characters:Epidemiology relies on data but the divergent ways data are recorded and transferred, both within and between outbreaks, and the expanding range of data-types are creating an increasingly complex problem for the discipline. There is a need for a consistent, interpretable and precise way to transfer data while maintaining its fidelity. We introduce 'EpiJSON', a new, flexible, and standards-compliant format for the interchange of epidemiological data using JavaScript Object Notation. This format is designed to enable the widest range of epidemiological data to be unambiguously held and transferred between people, software and institutions. In this paper, we provide a full description of the format and a discussion of the design decisions made. We introduce a schema enabling automatic checks of the validity of data stored as EpiJSON, which can serve as a basis for the development of additional tools. In addition, we also present the R package 'repijson' which provides conversion tools between this format, line-list data and pre-existing analysis tools. An example is given to illustrate how EpiJSON can be used to store line list data. EpiJSON, designed around modern standards for interchange of information on the internet, is simple to implement, read and check. As such, it provides an ideal new standard for epidemiological, and other, data transfer to the fast-growing open-source platform for the analysis of disease outbreaks. Crown Infectious disease epidemiology relies on integrating increasingly diverse and complex data. This complexity comes not only from the types of data now collected (for example genetic sequence, image and digital sensor data are routinely generated during the course of a disease outbreak, together with more traditional epidemiological data) but also through multiple partners investigating different facets, from different specialities or covering different geographical areas. This has been seen in recent major epidemics including the 2009 influenza pandemic (Fraser et al., 2009) , Middle-East Respiratory Syndrome outbreaks or the West-African Ebola epidemic (WHO Ebola Response Team, 2014 , 2015 . In this context, the safe storage and swift exchange of epidemiological data between collaborators and institutions is key to the successful assessment of, and response to, infectious disease epidemics. Consequently, a great deal of effort has been recently devoted to standardising platforms for the analysis of epidemiological data with software tools being constructed to permit interoperability between separate methodological approaches (Jombart et al., 2014) . Similar efforts have also been made in the fields of epidemiological data-gathering and recording (Aanensen et al., 2009; ECDC, 2015) . Overall however, there is a scarcity of systematised approaches for the transfer of data. The production of such a capability would vastly improve our ability to transfer information between systems and in doing so aid the interpretation of disease dynamics and ultimately protect a greater number of individuals. Yet epidemics data are still, usually, held as a potentially confusing mass of spread-sheets, databases, text and binary files. A universal format enabling the coherent storage and transfer of these data is lacking. As a consequence, misinterpretation of the data may happen during transfer and result in errors being introduced into subsequent analyses and reports. Unfortunately, the inherent complexity of epidemiological data magnifies the risks of such errors. Fig. 1 illustrates the major systems within an epidemiology work-flow where a standard for digital epidemiology data would be of assistance. A major difficulty associated with transferring epidemiological data lies in the degree of complexity that a dataset may display. The information that is recorded may vary markedly not only between outbreaks but also within a single outbreak. In addition, the epidemic context itself makes data collection a daunting task, leading to some inevitable disparities in the data recorded. Despite these challenges, we can identify a common structure to epidemiological datasets that can make the task of storing them easier. At the top level of this common structure is information relating to the dataset as a whole, such as the name of the infection that is causing the epidemic or the particular geographic setting of the study. This information is meta-data. At a second level, most datasets are divided into subunits (units-of-record) that hold other information. These subunits could be individuals, regions, countries or time periods. In a conventional spreadsheet these units-of-record are usually stored as rows. The information relating to these units-of-record makes up the third level and is usually stored as columns in a conventional spreadsheet. This information can either relate directly to the unit-of-record itself (such as gender for an individual) or can relate to an event happening to or at that unit-o
Keywords Extracted from PMC Text: UDUNITS2 unit database EpiJSON RFC 4122 NetCDF unit-of-record id" " OutbreakTools −2147,483,647 C's long type (i.e. in https://github.com/Hackout2/EpiJSON RFC3339 work-flows []" Fig. 2 https://cran.r-project.org/web/packages/repijson/. line EpiJSON's ISO JSON patient humans spatial-but CF UUID GeoJSON spread-sheets Cauchemez
Extracted PMC Text Content in Record: First 5000 Characters:Infectious disease epidemiology relies on integrating increasingly diverse and complex data. This complexity comes not only from the types of data now collected (for example genetic sequence, image and digital sensor data are routinely generated during the course of a disease outbreak, together with more traditional epidemiological data) but also through multiple partners investigating different facets, from different specialities or covering different geographical areas. This has been seen in recent major epidemics including the 2009 influenza pandemic (Fraser et al., 2009), Middle-East Respiratory Syndrome outbreaks (Cauchemez et al., 2014) or the West-African Ebola epidemic (WHO Ebola Response Team, 2014, WHO Ebola Response Team, 2015). In this context, the safe storage and swift exchange of epidemiological data between collaborators and institutions is key to the successful assessment of, and response to, infectious disease epidemics. Consequently, a great deal of effort has been recently devoted to standardising platforms for the analysis of epidemiological data with software tools being constructed to permit interoperability between separate methodological approaches (Jombart et al., 2014). Similar efforts have also been made in the fields of epidemiological data-gathering and recording (Aanensen et al., 2009, ECDC, 2015). Overall however, there is a scarcity of systematised approaches for the transfer of data. The production of such a capability would vastly improve our ability to transfer information between systems and in doing so aid the interpretation of disease dynamics and ultimately protect a greater number of individuals. Yet epidemics data are still, usually, held as a potentially confusing mass of spread-sheets, databases, text and binary files. A universal format enabling the coherent storage and transfer of these data is lacking. As a consequence, misinterpretation of the data may happen during transfer and result in errors being introduced into subsequent analyses and reports. Unfortunately, the inherent complexity of epidemiological data magnifies the risks of such errors. Fig. 1 illustrates the major systems within an epidemiology work-flow where a standard for digital epidemiology data would be of assistance. A major difficulty associated with transferring epidemiological data lies in the degree of complexity that a dataset may display. The information that is recorded may vary markedly not only between outbreaks but also within a single outbreak. In addition, the epidemic context itself makes data collection a daunting task, leading to some inevitable disparities in the data recorded. Despite these challenges, we can identify a common structure to epidemiological datasets that can make the task of storing them easier. At the top level of this common structure is information relating to the dataset as a whole, such as the name of the infection that is causing the epidemic or the particular geographic setting of the study. This information is meta-data. At a second level, most datasets are divided into subunits (units-of-record) that hold other information. These subunits could be individuals, regions, countries or time periods. In a conventional spreadsheet these units-of-record are usually stored as rows. The information relating to these units-of-record makes up the third level and is usually stored as columns in a conventional spreadsheet. This information can either relate directly to the unit-of-record itself (such as gender for an individual) or can relate to an event happening to or at that unit-of-record (such as the onset of symptoms for an individual). Any format for the conveyance of epidemiological data has two competing goals: consistency and flexibility. With this and the common morphology of a dataset in mind, we propose a standard for the storage and transmission of data for infectious disease epidemiology: EpiJSON (Epidemiological JavaScript Object Notation). This format is intended to be language and software agnostic, simple to implement, and leverages modern data standards whilst maintaining the flexibility to represent most epidemiological data. While initially developed for problems within the infectious disease domain, EpiJSON is applicable to any dataset where "events" happen to "units of record". We believe that it is sufficiently flexible to accommodate other datasets such as those found in non-communicable disease and chemical hazard areas. It has been designed to draw together all relevant epidemiological data into a single place so that, for example, genetic sequences may be stored alongside image data, a patient's standard demographic information and the disease trajectory data in an unambiguous manner. The EpiJSON format capitalises on the common structure of most epidemiological datasets outlined above. Fundamentally, the structure of an EpiJSON file consists of three levels that we term "metadata", "records" and "events" (Fig. 2 ). Within each of these
PDF JSON Files: document_parses/pdf_json/141bb78c68caf0302c6f31eac91d1ff5197d5771.json
PMC JSON Files: document_parses/pmc_json/PMC7104924.xml.json
G_ID: epijson_a_unified_data_format_for_epidemiology