Glossary
Term | Definition |
---|---|
Annotation |
Linguistic information added to a speech signal or text. It may include information on any level of language, e.g. part-of-speech labels, morphological analysis, markup of sentence boundaries, speaker turns and overlaps. A transcription can also be considered a form of annotation in relation to its audio source. |
Collection |
In linguistics, this refers to a more or less structured assemblage of digitised texts with some common properties, e.g. a particular type of discourse. Also called a text archive. |
Corpora |
A large, structured collection of authentic (written or spoken) texts that have been compiled in electronic form according to a specific set of criteria, to represent a language or language variety, and particular speakers/writers, at a specified date or period. |
Data cleansing | The process of correcting data errors to bring the level of data quality to an acceptable level for the needs of AusNC information consumers. |
Data model | A representation of the data describing objects and the relationships between the objects, independent of any associated process. A data model may include a set of diagrams for each view along with the meta data defining each object in the model. A complete data model may also include state transition diagrams depicting each major entity lifecycle and value chain analysis linking the data model to processes, roles, organizations, goals, applications and projects. |
Hapax | A word or form appearing only once in the Australian National Corpus. |
Ingest |
A verb used in computational engineering to refer to capturing or transferring video, audio, and metadata from one media storage system to another. |
Item |
A digitised record of a linguistic event, as defined in the structure of the corpus or collection. It can include a sample text, audio file, video file, or combination of any of samples mentioned. |
Language database |
A classified collection of linguistic elements, digitised individually, not as continuous text. |
Linguistics |
The study of human language, which may be undertaken from many different aspects, for example, sounds (phonetics) or structures of words(morphology) or meanings (semantics), as well as text-types and their structure, the mediums of communication, and the interaction between participants in conversation. |
Metadata |
This is information used to describe items and groups of items, i.e. data about data at different levels. At the collection level, it is information about a whole collection/corpus. At the item level it is information features of the individual text within a collection/corpus (e.g. participant characteristics, time, text-type). |
Parsing | In computer science and linguistics, parsing is the process of analyzing a text, made of a sequence of demonstrations (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar. |
Sociolinguistics |
The study of how language expresses social identity, e.g. the age, gender, education, socio-economic status of the individual. |
Transcript |
A written representation of the text of an audio or video file. This can be considered an item in its own right, or a form of annotation. |
*Some terms were developed from the Wikipedia free encyclopedia and online Merriam Webster Dictionary.