Language.
Language is a structure of meanings systematically applied to particular sounds and symbols, which are then consistently related through an organized grammar. The structure and function of language has been studied and discussed in a number of scholarly disciplines. While we recommend that our users maintain a working knowledge of the scholarly literature on language, a few of their conclusions are of particular interest to the use of Raven’s Eye. Therefore, we explain them in these Technicals.
Foremost is the intimate and influential relationship between language, cognition, and perception. For a majority of the 20th Century, scholars debated whether or not our perceptions, and the types and forms of our thoughts about them, are dictated by the concepts available in a particular language. As the scholarship in this area has grown, a consensus has arisen that, in general, variation in the concepts available within a given language on a particular topic influences—but does not absolutely determine—both general variation in the perceptions and experiences that a person may have with respect to that topic, and general variation in the thoughts expressed about those experiences. In other words, while our particular language influences both the specific thoughts we have about a given experience and the manner in which we express these thoughts, the particularities of our language do not preclude us from experiencing aspects of the topic for which we have no words1.
Lexical Hypothesis.
The lexical hypothesis proposes that languages will contain words for objects, events, and ideas that are common to the experiences of their respective speakers. It further proposes a positive relation between the commonality, centrality, or frequency of an experience to the speakers of a language, and the number of extant words available to describe various aspects of that experience2. In this way, the lexical hypothesis proposes that the words and concepts that comprise a given language will be influenced by the everyday experiences and environmental contingencies of those who speak that language.
In psychology, the lexical hypothesis has been utilized to compare the experiences of people within a given language group, and to compare such individually varying traits as cognitive tendencies, motivation, and personality. When applying the lexical hypothesis to individuals, it is the commonality, centrality, or frequency of a particular individual’s experiences or traits compared to the experiences or traits of the group of language speakers, which influences general variation in the type and frequency of particular words expressed. In this way, the lexical hypothesis is extended to propose that variation in the frequency of word usage within a given group of language speakers reflects individual variation in both experiences, and in psychological traits.
The lexical hypothesis can, therefore, be utilized to identify the relative variation in word or concept use, according to groups, individuals, and experiences.
Language corpora.
A corpus is a collection of written documents; more than one such collection are referred to as corpora. A language corpus is an aggregated body of work, which is gathered together to facilitate the identification of popular words and their forms. Lists or tables of words, parts-of-speech, and other linguistic features are typically derived from such corpora, and are often organized according to their frequency of appearance in the corpus at-hand.
While language corpora serve many functions, in Raven’s Eye, they serve primarily as a background pool of words against which to compare an acquired natural language sample. When combined with the lexical hypothesis, these corpora facilitate the identification of words and themes that are relatively essential to your acquired natural language sample (and, as an extension, to your study).
Currently, Raven's Eye maintains corpora in 65 languages. These include:
Language | Unique Words | Total Words |
Afrikaans | 430,000 | 25,020,000 |
Amharic | 90,000 | 1,450,000 |
Arabic | 2,140,000 | 215,430,000 |
Assamese | 170,000 | 5,000,000 |
Azerbaijani | 770,000 | 40,250,000 |
Bengali | 610,000 | 29,100,000 |
Bulgarian | 1,370,000 | 146,950,000 |
Burmese | 420,000 | 6,540,000 |
Croatian | 1,270,000 | 75,360,000 |
Czech | 2,260,000 | 228,360,000 |
Dutch | 3,730,000 | 545,070,000 |
English | 16,780,000 | 4,812,300,000 |
Farsi | 2,010,000 | 188,060,000 |
Finnish | 2,600,000 | 155,040,000 |
French | 6,030,000 | 1,436,790,000 |
German | 9,670,000 | 1,441,420,000 |
Greek | 1,130,000 | 85,860,000 |
Gujarati | 230,000 | 8,910,000 |
Hebrew | 1,610,000 | 183,610,000 |
Hindi | 630,000 | 41,810,000 |
Hungarian | 2,910,000 | 246,590,000 |
Icelandic | 300,000 | 12,800,000 |
Language | Unique Words | Total Words |
Indonesian | 1,540,000 | 140,250,000 |
Irish | 160,000 | 6,950,000 |
Italian | 4,150,000 | 934,100,000 |
Javanese | 290,000 | 11,530,000 |
Kannada | 640,000 | 17,350,000 |
Kazakh | 880,000 | 54,570,000 |
Korean | 3,250,000 | 145,340,000 |
Kurdish | 180,000 | 4,850,000 |
Luxembourgish | 370,000 | 14,150,000 |
Malagasy | 470,000 | 14,290,000 |
Malay | 820,000 | 74,620,000 |
Malayalam | 800,000 | 20,510,000 |
Marathi | 340,000 | 12,880,000 |
Marathi | 310,000 | 12,880,000 |
Nepali | 320,000 | 13,660,000 |
Norweign | 660,000 | 38,690,000 |
Oriya | 190,000 | 5,930,000 |
Polish | 3,780,000 | 482,010,000 |
Portuguese | 2,980,000 | 503,400,000 |
Punjabi | 240,000 | 9,540,000 |
Pushto | 140,000 | 3,800,000 |
Romanian | 1,610,000 | 147,670,000 |
Language | Unique Words | Total Words |
Russian | 7,900,000 | 999,930,000 |
Sindhi | 90,000 | 2,380,000 |
Slovenian | 1,070,000 | 72,380,000 |
Spanish | 4,620,000 | 1,021,060,000 |
Sundanese | 180,000 | 8,210,000 |
Swahili | 190,000 | 7,650,000 |
Swedish | 4,980,000 | 874,690,000 |
Tagalog | 290,000 | 13,090,000 |
Tajik | 260,000 | 14,530,000 |
Tamil | 880,000 | 32,090,000 |
Tatar | 600,000 | 24,940,000 |
Telugu | 780,000 | 30,610,000 |
Thai | 1,280,000 | 37,700,000 |
Tibetan | 60,000 | 440,000 |
Turkish | 190,000 | 3,890,000 |
Ukrainian | 3,890,000 | 353,510,000 |
Urdu | 590,000 | 38,440,000 |
Uzbek | 380,000 | 25,020,000 |
Vietnamese | 1,870,000 | 289,930,000 |
Yiddish | 90,000 | 3,680,000 |
Yoruba | 100,000 | 3,240,000 |
Automatic Speech-to-text transcription.
Raven's Eye pairs with a world-class leader in artificial intelligence to provide automated speech-to-text transcription in 9 different languages. Our automatic transcription services are available for the following languages:
- Arabic (Modern Standard)
- English (UK and US)
- French
- German
- Japanese
- Korean
- Mandarin
- Portuguese (Brazilian)
- Spanish
Automated transcription involves probabilistic matching between the patterns of sound in the audio file and samples of sound patterns associated with specific words in the language selected as your current corpus. This process is basically the same process involved in voice-operated smart home devices, and predictive texting technologies.
Since are business is identifying themes in language, and not transcription itself, we partner with a world-class artificial intelligence leader to produce your transcript. As a result, you can rest assured that you are receiving state-of-the-art quality in your transcription. However, the state-of-the-art is not yet infallible, and correct transcription depends on facets of the recording beyond the control of our partner and ourselves (including quality of the recording, slurred speech or relative lack in enunciation, strong accents, loud background noise, etc.). We recommend that subscribers always download, review and edit, and re-upload their initial transcripts before making scientific claims or substantial business decisions based on their results
Transcription testing and review: Because each subscriber's transcript is derived from unique circumstances, the results of their individual transcription may vary substantially from those acquired by others. To help users get the best results, each new subscription comes with 100 free minutes of transcription (even though we still incur a cost for providing them). This allows subscribers to experiment with our partner's transcription services without incurring initial additional expense.
Prior to uploading a large audio file for transcription, we recommend that subscribers use their audio editing software to isolate a 3 - 5 minute selection of their recording, and then save this as a separate audio file for use in transcription testing. After uploading and transcribing this test file, subscribers can then export their results and troubleshoot any difficulties by adjusting the recording itself (such as through reducing background noise, or adjusting the volume and reverberation, or other means determined necessary by the subscriber). This newly adjusted audio file can then be uploaded again for further testing, and this process can be repeated until the best transcription is acquired.
Once the subscriber identifies the adjustments that produce the best transcription, these adjustments can be applied to the whole recording. This whole recording can then be uploaded for transcription.
While our partner advertises an error rate of 6% based on internal testing with recorded news casts, some may wish to pursue options with fewer errors. In this case, high quality human transcription might provide somewhat lower error rates (most advertise 1% - 2% error rates). However, we should note that such human transcription services are often incompatible with the confidentiality requirements for human subjects research established by the Institutional Review Boards of most academic and research institutions. Human transcription allows unaffiliated personnel without human subjects research training direct and unmonitored access to participant raw data.
Transcript annotation: We proceed from a phenomenological attitude, in which preserving the voice of the speaker or writer without interpretation in the raw data is paramount (interpretation instead occurs during the analysis). Therefore, annotation about non-spoken or behavioral aspects of the interview are not automatically included in the transcript. Our partner does, however, insert an annotation into the transcript in cases where verbal fillers (such as "um," "err," etc.) or false starts (i.e., a half-spoken syllable) are present. In these instances, the word, "hesitation," is inserted into the transcript. Subscribers can eliminate this annotation from their results by exporting them to their computer, and then using the Find and Replace All function of their spreadsheet program to identify and replace them with a space, " " all at once. Or, if you believe that hesitations might be meaningful to the themes in the data, it may be included as an annotation in the manner discussed in the next paragraph.
If annotation of the transcription text is desired, subscribers may do so by following our bracketing procedure in their spreadsheet programs while reviewing and editing their transcript (or anytime thereafter). As described in those bracketing procedures, our phenomenological perspective leads us to advocate that users insert annotations as a new column in their spreadsheet (perhaps labeled "annotation") and then assign this column as a Variable when uploading the edited transcript for analysis. In this way, each type of annotation is sortable, such that one might readily investigate whether those who hesitate—or engage in some other behavior of note—express different words or themes than those who do not (all while simultaneously preserving the original voice of the respondent).
Notes.
1 In such instances, one may modify existing words (e.g., “bueish,” for something similar to blue, but not quite blue in the sense that the word already describes in the language), borrow a word from another language, create neologisms, or produce similes or metaphors that utilize words used to describe somewhat similar experiences.
2 For instance, languages from areas where horses have been found historically tend to have many words to describe horses and their various attributes. However, languages from areas without the historic presence of horses often do not contain a similarly diverse set of words to describe such attributes, if at all.