Machine-Readable Documents

From HandWiki

Machine-readable documents are documents whose content can be readily processed by computers. Such documents are distinguished from machine-readable data by virtue of having sufficient structure to provide the necessary context to support the business processes for which they are created.

Definition

Data without context (language use) is meaningless and lacks the four essential characteristics of trustworthy business records specified in ISO 15489 Information and documentation -- Records management:

The vast bulk of information is unstructured data and, from a business perspective, that means it is "immature", i.e., Level 1 (chaotic) of the Capability Maturity Model. Such immaturity fosters inefficiency, diminishes quality, and limits effectiveness. Unstructured information is also ill-suited for records management functions, provides inadequate evidence for legal purposes, drives up the cost of discovery in litigation, and makes access and usage needlessly cumbersome in routine, ongoing business processes.

There are at least four aspects to machine-readability:

  • First, words or phrases should be discretely delineated (tagged) so that computer software and/or hardware logic can be applied to them as individual conceptual elements.
  • Second, the semantics of each element should be specified so that computers can help human beings achieve a common understanding of their meanings and potential usages.
  • Third, if the relationships among the individual elements are also specified, computers can automatically apply inferences to them, thereby further relieving human beings of the burden of trying to understand them, particularly for purposes of inquiry, discovery, and analysis.
  • Fourth, if the structures of the documents in which the elements occur are also specified, human understanding is further enhanced and the data becomes more reliable for legal and business-quality purposes.

As early as 1981, the U.S. Government Accountability Office (GAO) began reporting on the problem of inadequate record-keeping practices in the U.S. federal government.[1] Such deficiencies are not unique to government and advances in information technology mean that most information is now "born digital" and thus potentially far more easily managed by automated means.[2] However, in testimony to Congress in 2010, GAO highlighted problems with managing electronic records, and as recently as 2015, GAO has continued to report inadequacies in the performance of Executive Branch agencies in meeting records management requirements.[3] [4] Moreover, more than two decades after a major and formerly highly respected auditing firm, Arthur Andersen, met its demise due to a records destruction scandal, record-keeping practices became a central issue in the 2016 Presidential election.

On January 4, 2011, President Obama signed H.R. 2142, the Government Performance and Results Act (GPRA) Modernization Act of 2010 (GPRAMA), into law as P.L. 111-352. Section 10 of GPRAMA requires U.S. federal agencies to publish their strategic and performance plans and reports in searchable, machine-readable format.[5] Additionally, in 2013, he issued Executive Order 13642, Making Open and Machine Readable the New Default for Government Information in general.[6] On July 28, 2016, the Office of Management and Budget (OMB) followed up by including in the revised issuance of Circular A-130 direction for agencies to use open, machine-readable formats and to publish "public information online in a manner that promotes analysis and reuse for the widest possible range of purposes", meaning that the information is both publicly accessible and machine-readable.

In support of such policy direction, technological advancement is enabling more efficient and effective management and use of machine-readable electronic records. Document-oriented databases have been developed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Extensible Markup Language (XML) is a World Wide Web Consortium (W3C) Recommendation setting forth rules for encoding documents in a format that is both human-readable and machine-readable. Many XML editor tools have been developed and most, if not all major information technology applications support XML to greater or lesser degrees. The fact that XML itself is an open, standard, machine-readable format makes it relatively easy for application developers to do so.

The W3C's accompanying XML Schema (XSD) Recommendation specifies how to formally describe the elements in an XML document. With respect to the specification of XML schemas, the Organization for the Advancement of Structured Information Standards (OASIS) is a leading standards-developing organization. However, many technical developers prefer to work with JSON, and to define the structure of JSON data for validation, documentation, and interaction control, JSON Schema was developed by the Internet Engineering Task Force (IETF).

The Portable Document Format (PDF) is a file format used to present documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of the presentation of the document, including the text, fonts, graphics, and other information needed to display it. PDF/A is an ISO-standardized version of the PDF specialized for use in the archiving and long-term preservation of electronic documents. PDF/A-3 allows embedding of other file formats, including XML, into PDF/A conforming documents, thus potentially providing the best of both human- and machine-readability. The W3C's XSL-FO (XSL Formatting Objects) markup language is commonly used to generate PDF files

Metadata, data about data, can be used to organize electronic resources, provide digital identification, and support the archiving and preservation of resources. In well-structured, machine-readable electronic records, the content can be repurposed as both data and metadata. In the context of electronic record-keeping systems, the terms "management" and "metadata" are virtually synonymous. Given proper metadata, records management functions can be automated, thereby reducing the risk of spoliation of evidence and other fraudulent manipulations of records. Moreover, such records can be used to automate the process of auditing data maintained in databases, thereby reducing the risk of single points of failure associated with the Machiavellian concept of a single source of truth.

Blockchain (database) is a new technology for maintaining continuously-growing lists of records secured from tampering and revision. A key feature is that every node in a decentralized system has a copy of the blockchain so there is no single point of failure subject to manipulation and fraud.

See also

References

External links