RECCOMENDATIONS FOR
Text Collections
1. Encode texts in standard formats, preferring XML/TEI or other appropriate schemas for literary texts, to ensure interoperability and facilitate re-use. #
PRODUCEStandard formats
Based on a collection’s scientific objectives and technical requirements, texts can be prepared in different formats. A common approach, especially for Natural Language Processing (NLP), is plain text with accompanying metadata and annotations in structured formats like CSV and JSON. In these cases, Unicode character encoding is essential to ensure full accessibility, reusability, and interoperability.
XML/TEI is another widely used format, as it provides a comprehensive framework for encoding and describing textual data. Its guidelines include a module dedicated to linguistic corpora. For more information about TEI, its tools, and other specialised encoding schemas based on or inspired by TEI, please refer to the digital scholarly editions recommendations.
- When detailed encoding is impractical due to time or resource constraints, consider implementing a “light” encoding first (e.g., ALIM, Biblioteca Italiana) and different levels of encoding after, progressively covering more aspects and phenomena.
- If custom encoding schemas were used, publish and describe them in the documentation.
2. Always cite the sources used for text preparation by providing complete bibliographic references and links to descriptive web resources, if available. #
PRODUCEA comprehensive set of metadata describing the sources used for text preparation enables users to verify both the editorial work and text quality.
- Use a metadata standard such as Dublin Core or XML/TEI (<sourceDesc> element in the header).
- Prefer links to digital libraries, online catalogues and other similar resources that provide persistent identifiers for their objects.
Metadata
For collection metadata, several standards and formats are available alongside TEI:
- CIDOC CRM, an ISO standard and reference ontology that provides definitions and formal structure for describing cultural heritage documentation concepts and relationships;
- Dublin Core, a simple, generic metadata element set for various digital object types. This standard is widely adopted world-wide;
- MARC 21, a well-established standard for exchanging bibliographic records, developed and maintained by the library community;
- Metadata Encoding and Transmission Standard (METS), an XML schema for encoding structural metadata about complex digital objects;
- MODS (Metadata Object Description Schema), an XML schema for descriptive metadata compatible with MARC 21 bibliographic format.
For more information about metadata standards, refer to the arts and humanities standards listed in the Metadata Standards Catalog.
3. Link authors’ and works’ records to corresponding authority records if available, e.g., VIAF, Wikidata. #
PLAN PRODUCEBy linking the collected works and their authors to authority records, researchers can search across multiple datasets using standardised identifiers, particularly when the collection offers API access.
Examples in Spadini, Elena, and José Luis Losada Palenzuela. “Re-Using Data from Editions.” Digital Editing and Publishing in the Twenty-First Century, edited by James O’Sullivan et al., 1st ed., Scottish Universities Press, 2025, https://doi.org/10.62637/sup.GHST9020.8.
Example: The Perseus Catalog.
4. In the documentation, specify collection criteria and editorial criteria, stating the philological methodologies applied and the edition types. #
PRODUCEThis information helps users understand the relationship between the collected text and its sources, as well as the editorial preparation process. For example, it clarifies whether the text is a transcription of an audio file or performance, a document processed through OCR software, or a critical edition.
- To specify the edition type, you can reference established definitions from scholarly literature or online resources like the Parvum Lexicon Stemmatologicum and the Lexicon of Scholarly Editing.
- For presenting original documents as digital facsimiles, IIIF is recommended, particularly for institutions digitizing their textual heritage. Guide for IIIF implementers.
5. Assign each text a persistent identifier. #
DEPOSITThis enables users to easily cite and reuse individual texts from the collection.
- When texts are published as individual units in repositories like ILC4CLARIN and Zenodo, PIDs are automatically assigned.
6. Document changes and current status thoroughly, indicating the number of available texts, the collection’s completeness relative to its scientific objectives and, if applicable, a roadmap about the evolution of the text collection. #
DEPOSIT DISSEMINATESpecifying the number of texts in a collection enables users to assess both the corpus’s completeness for their research goals and its overall representativeness.
For collections containing large volumes of texts, editorial work typically relies on time-limited funding, which leads to periodic additions to and revisions of the collection. Such cases require detailed documentation of the work’s status, including clear descriptions of previous work and future plans. This documentation gives users a clear understanding of the stability and reliability of the collection’s texts.
- You can format this part of the documentation as a changelog, following the guiding principles of the “keep a changelog project”, in particular marking changes to the texts in the collection as added, removed, changed, or fixed.
7. Facilitate text exploration through search functionalities, indexes and sub-collections. #
DISSEMINATEA well-designed search functionality can help users easily find meaningful content in the text collection, while indexes provide quick access to individual texts or sub-collections.
Sub-collections can showcase the information contained in the collection, by organising texts by theme, topic, author, genre, etc. To help users engage with the collection, sub-collections should model how users can approach the search functionality with a question or theme to produce meaningful results.
Examples in Chapman, Alison, et al. “Browse, Search and Serendipity: Building Approachable Digital Editions.” Digital Editing and Publishing in the Twenty-First Century, edited by James O’Sullivan et al., 1st ed., Scottish Universities Press, 2025, 6.">https://doi.org/10.62637/sup.GHST9020.6.
Existing guidelines
The National Information Standards Organization (NISO) Framework of Guidance for Building Good Digital Collections (3rd edn, 2007. provides a comprehensive set of principles that in 2007 anticipated the FAIR principles. The NISO principles are organised around four core entities: collections, collected objects, metadata, and “initiatives” (programmes or projects for creating and managing collections). We offer the complete list of principles below. Please refer to the document linked above, for practical guidance on how to achieve these quality requirements.
Collections
- A good digital collection is created according to an explicit collection development policy.
- Collections should be described so that a user can discover characteristics of the collection, including scope, format, restrictions on access, ownership, and any information significant for determining the collection’s authenticity, integrity, and interpretation.
- A good collection is curated, which is to say, its resources are actively managed during their entire lifecycle.
- A good collection is broadly available and avoids unnecessary impediments to use. Collections should be accessible to persons with disabilities, and usable effectively in conjunction with adaptive technologies.
- A good collection respects intellectual property rights.
- A good collection has mechanisms to supply usage data and other data that allows standardised measures of usefulness to be recorded.
- A good collection is interoperable.
- A good collection integrates into the users own workflow.
- A good collection is sustainable over time.
Objects
- A good object exists in a format that supports its intended current and future use.
- A good object is preservable.
- A good object is meaningful and useful outside of its local context.
- A good object will be named with a persistent, globally unique identifier that can be resolved to the current address of the object.
- A good object can be authenticated.
- A good object has associated metadata.
Metadata
- Good metadata conforms to community standards in a way that is appropriate to the materials in the collection, users of the collection, and current and potential future uses of the collection.
- Good metadata supports interoperability.
- Good metadata uses authority control and content standards to describe objects and collocate related objects.
- Good metadata includes a clear statement of the conditions and terms of use for the digital object.
- Good metadata supports the long-term curation and preservation of objects in collections.
- Good metadata records are objects themselves and therefore should have the qualities of good objects, including authority, authenticity, archivability, persistence, and unique identification.
Initiatives
- A good digital initiative has a substantial design and planning component.
- A good digital initiative has an appropriate level of staffing with necessary expertise to achieve its objectives.
- A good digital initiative follows best practices for project management.
- A good digital initiative has an evaluation component.
- A good digital initiative markets itself and broadly disseminates information about the initiative’s process and outcomes.
- A good digital initiative considers the entire lifecycle of the digital collection and associated services.
For developing a complete IT environment to create and manage digital collections, the Reference model for an Open Archival Information System (OAIS) serves as the standard model. However, this technical and complex work typically extends beyond individual scholars’ scope.
OAIS Introductory Guide (2nd Edition)
Finally, the RIDE journal offers quality evaluation criteria for digital text collections, providing guidance for collection preparation and management. RIDE Criteria for Reviewing Digital Text Collections, version 1.0.