Raven's Eye | Linguistic psychological foundations

Language.

Language is a structure of meanings systematically applied to particular sounds and symbols, which are then consistently related through an organized grammar. The structure and function of language has been studied and discussed in a number of scholarly disciplines. While we recommend that our users maintain a working knowledge of the scholarly literature on language, a few of their conclusions are of particular interest to the use of Raven’s Eye. Therefore, we explain them in these Technicals.

Foremost is the intimate and influential relationship between language, cognition, and perception. For a majority of the 20th Century, scholars debated whether or not our perceptions, and the types and forms of our thoughts about them, are dictated by the concepts available in a particular language. As the scholarship in this area has grown, a consensus has arisen that, in general, variation in the concepts available within a given language on a particular topic influences—but does not absolutely determine—both general variation in the perceptions and experiences that a person may have with respect to that topic, and general variation in the thoughts expressed about those experiences. In other words, while our particular language influences both the specific thoughts we have about a given experience and the manner in which we express these thoughts, the particularities of our language do not preclude us from experiencing aspects of the topic for which we have no words¹.

Lexical Hypothesis.

The lexical hypothesis proposes that languages will contain words for objects, events, and ideas that are common to the experiences of their respective speakers. It further proposes a positive relation between the commonality, centrality, or frequency of an experience to the speakers of a language, and the number of extant words available to describe various aspects of that experience². In this way, the lexical hypothesis proposes that the words and concepts that comprise a given language will be influenced by the everyday experiences and environmental contingencies of those who speak that language.

In psychology, the lexical hypothesis has been utilized to compare the experiences of people within a given language group, and to compare such individually varying traits as cognitive tendencies, motivation, and personality. When applying the lexical hypothesis to individuals, it is the commonality, centrality, or frequency of a particular individual’s experiences or traits compared to the experiences or traits of the group of language speakers, which influences general variation in the type and frequency of particular words expressed. In this way, the lexical hypothesis is extended to propose that variation in the frequency of word usage within a given group of language speakers reflects individual variation in both experiences, and in psychological traits.

The lexical hypothesis can, therefore, be utilized to identify the relative variation in word or concept use, according to groups, individuals, and experiences.

Language corpora.

A corpus is a collection of written documents; more than one such collection are referred to as corpora. A language corpus is an aggregated body of work, which is gathered together to facilitate the identification of popular words and their forms. Lists or tables of words, parts-of-speech, and other linguistic features are typically derived from such corpora, and are often organized according to their frequency of appearance in the corpus at-hand.

While language corpora serve many functions, in Raven’s Eye, they serve primarily as a background pool of words against which to compare an acquired natural language sample. When combined with the lexical hypothesis, these corpora facilitate the identification of words and themes that are relatively essential to your acquired natural language sample (and, as an extension, to your study).

Currently, Raven's Eye maintains corpora in 65 languages. These include:

Language

Unique Words

Total Words

Afrikaans

430,000

25,020,000

Amharic

90,000

1,450,000

Arabic

2,140,000

215,430,000

Assamese

170,000

5,000,000

Azerbaijani

770,000

40,250,000

Bengali

610,000

29,100,000

Bulgarian

1,370,000

146,950,000

Burmese

420,000

6,540,000

Croatian

1,270,000

75,360,000

Czech

2,260,000

228,360,000

Dutch

3,730,000

545,070,000

English

16,780,000

4,812,300,000

Farsi

2,010,000

188,060,000

Finnish

2,600,000

155,040,000

French

6,030,000

1,436,790,000

German

9,670,000

1,441,420,000

Greek

1,130,000

85,860,000

Gujarati

230,000

8,910,000

Hebrew

1,610,000

183,610,000

Hindi

630,000

41,810,000

Hungarian

2,910,000

246,590,000

Icelandic

300,000

12,800,000

Language

Unique Words

Total Words

Indonesian

1,540,000

140,250,000

Irish

160,000

6,950,000

Italian

4,150,000

934,100,000

Javanese

290,000

11,530,000

Kannada

640,000

17,350,000

Kazakh

880,000

54,570,000

Korean

3,250,000

145,340,000

Kurdish

180,000

4,850,000

Luxembourgish

370,000

14,150,000

Malagasy

470,000

14,290,000

Malay

820,000

74,620,000

Malayalam

800,000

20,510,000

Marathi

340,000

12,880,000

Marathi

310,000

12,880,000

Nepali

320,000

13,660,000

Norweign

660,000

38,690,000

Oriya

190,000

5,930,000

Polish

3,780,000

482,010,000

Portuguese

2,980,000

503,400,000

Punjabi

240,000

9,540,000

Pushto

140,000

3,800,000

Romanian

1,610,000

147,670,000

Language

Unique Words

Total Words

Russian

7,900,000

999,930,000

Sindhi

90,000

2,380,000

Slovenian

1,070,000

72,380,000

Spanish

4,620,000

1,021,060,000

Sundanese

180,000

8,210,000

Swahili

190,000

7,650,000

Swedish

4,980,000

874,690,000

Tagalog

290,000

13,090,000

Tajik

260,000

14,530,000

Tamil

880,000

32,090,000

Tatar

600,000

24,940,000

Telugu

780,000

30,610,000

Thai

1,280,000

37,700,000

Tibetan

60,000

440,000

Turkish

190,000

3,890,000

Ukrainian

3,890,000

353,510,000

Urdu

590,000

38,440,000

Uzbek

380,000

25,020,000

Vietnamese

1,870,000

289,930,000

Yiddish

90,000

3,680,000

Yoruba

100,000

3,240,000

Automatic Speech-to-text transcription.

Raven's Eye pairs with a world-class leader in artificial intelligence to provide automated speech-to-text transcription in 9 different languages. Our automatic transcription services are available for the following languages:

Arabic (Modern Standard)
English (UK and US)
French
German
Japanese
Korean
Mandarin
Portuguese (Brazilian)
Spanish

Automated transcription involves probabilistic matching between the patterns of sound in the audio file and samples of sound patterns associated with specific words in the language selected as your current corpus. This process is basically the same process involved in voice-operated smart home devices, and predictive texting technologies.

Since are business is identifying themes in language, and not transcription itself, we partner with a world-class artificial intelligence leader to produce your transcript. As a result, you can rest assured that you are receiving state-of-the-art quality in your transcription. However, the state-of-the-art is not yet infallible, and correct transcription depends on facets of the recording beyond the control of our partner and ourselves (including quality of the recording, slurred speech or relative lack in enunciation, strong accents, loud background noise, etc.). We recommend that subscribers always download, review and edit, and re-upload their initial transcripts before making scientific claims or substantial business decisions based on their results

Transcription testing and review: Because each subscriber's transcript is derived from unique circumstances, the results of their individual transcription may vary substantially from those acquired by others. To help users get the best results, each new subscription comes with 100 free minutes of transcription (even though we still incur a cost for providing them). This allows subscribers to experiment with our partner's transcription services without incurring initial additional expense.

Prior to uploading a large audio file for transcription, we recommend that subscribers use their audio editing software to isolate a 3 - 5 minute selection of their recording, and then save this as a separate audio file for use in transcription testing. After uploading and transcribing this test file, subscribers can then export their results and troubleshoot any difficulties by adjusting the recording itself (such as through reducing background noise, or adjusting the volume and reverberation, or other means determined necessary by the subscriber). This newly adjusted audio file can then be uploaded again for further testing, and this process can be repeated until the best transcription is acquired.

Once the subscriber identifies the adjustments that produce the best transcription, these adjustments can be applied to the whole recording. This whole recording can then be uploaded for transcription.

While our partner advertises an error rate of 6% based on internal testing with recorded news casts, some may wish to pursue options with fewer errors. In this case, high quality human transcription might provide somewhat lower error rates (most advertise 1% - 2% error rates). However, we should note that such human transcription services are often incompatible with the confidentiality requirements for human subjects research established by the Institutional Review Boards of most academic and research institutions. Human transcription allows unaffiliated personnel without human subjects research training direct and unmonitored access to participant raw data.

Transcript annotation: We proceed from a phenomenological attitude, in which preserving the voice of the speaker or writer without interpretation in the raw data is paramount (interpretation instead occurs during the analysis). Therefore, annotation about non-spoken or behavioral aspects of the interview are not automatically included in the transcript. Our partner does, however, insert an annotation into the transcript in cases where verbal fillers (such as "um," "err," etc.) or false starts (i.e., a half-spoken syllable) are present. In these instances, the word, "hesitation," is inserted into the transcript. Subscribers can eliminate this annotation from their results by exporting them to their computer, and then using the Find and Replace All function of their spreadsheet program to identify and replace them with a space, " " all at once. Or, if you believe that hesitations might be meaningful to the themes in the data, it may be included as an annotation in the manner discussed in the next paragraph.

If annotation of the transcription text is desired, subscribers may do so by following our bracketing procedure in their spreadsheet programs while reviewing and editing their transcript (or anytime thereafter). As described in those bracketing procedures, our phenomenological perspective leads us to advocate that users insert annotations as a new column in their spreadsheet (perhaps labeled "annotation") and then assign this column as a Variable when uploading the edited transcript for analysis. In this way, each type of annotation is sortable, such that one might readily investigate whether those who hesitate—or engage in some other behavior of note—express different words or themes than those who do not (all while simultaneously preserving the original voice of the respondent).

Notes.

¹ In such instances, one may modify existing words (e.g., “bueish,” for something similar to blue, but not quite blue in the sense that the word already describes in the language), borrow a word from another language, create neologisms, or produce similes or metaphors that utilize words used to describe somewhat similar experiences.

² For instance, languages from areas where horses have been found historically tend to have many words to describe horses and their various attributes. However, languages from areas without the historic presence of horses often do not contain a similarly diverse set of words to describe such attributes, if at all.

Cognition.

Psychological foundations.

Methodological foundations.