In 2007 Google (the well-known lunar landings provider who did the search engine thingy) introduced a free directory inquiry service in the US called GOOG-411. Your call was digitally parsed by the ‘robot operator’ who then offered to connect your call to its top results. It wasn’t clear what Google was getting out of providing this generous free service, which they even promoted on billboards.
Three years after its introduction the service was suddenly dropped. Google had already released its search-by-voice service in Android, and so the penny.. dropped. GOOG-411, as Google has admitted, had been a covert phoneme-gathering operation intended to create a huge database to improve voice recognition technology for Google’s search products.
Google had amassed thousands of hours of requests for plumbers and pizza delivery and connections to confusingly named places like Schenectady spoken in every accent from every state of the US. The free GOOG-411 service enabled the technology and techniques that activated the speech recognition software which was and is now amassing a vast repository of spoken words in every language on earth, improving itself in a perfect feedback learning loop every time the user corrects a faulty transcription.
Anna Barham’s video “The Squid That Hid” outlines the difficulties speech presents to speech recognition software, from accent pronunciation and articulation to background noise. The big problem is that spoken words just run on from each other. It’s hard for humans too. Without visual punctuation it can be hard or impossible to arrange the string of syllables into words into sentences. To the untrained ear Polish sounds like English recorded to tape and played backwards. Yiddish sounds like someone cheating at Scrabble. English sounds like a sarcastic Swede reading words at random from a car manual (see also “Prisencolinensinainciusol”).
In the first line of Finnegans Wake we find “past Eve and Adam’s” which can also be read “Pa, Stephen: Adams” which deliberately equates Joyce’s father Stanislaus and his fictional portrait of the artist as a young man and archetypal Son Stephen Daedalus with the Bible’s archetypal Father figure Adam.
It all begins with this passage from Image Machine by Bridget Crone (2013). Anna Barham used it as the starting point for the film Double Screen (not quite tonight jellylike) which presents variations on Crone’s text as reworked and mangled by voice recognition software. I say “it all begins” but Crone’s text is itself a response to Amanda Beech’s Fi nal Mac hi ne. I daren’t look whether this also derives from something else for fear we’ll end up in some bottomless pit of recursion and influence.
Barham’s use of the text next went into Penetrating Squid, an ongoing novel whose third chapter forms the basis of the text that was the basis of her week at Fig-2. The text was generated in live reading groups where readers take it in turns to read a text into transcription software. Barham has apparently generated over a hundred versions of Crone’s text. Barham then went on to read short sections over and over again through the software to generate more radical disruptions and the three chapters of Penetrating Squid, which are audible on Soundcloud.
Crone’s text starts with a description of cleaning a squid and as bits fall onto newspaper the words distort and the text itself distorts and falls into associative chains of sounds and images. In the original we find “Tight pieces of sinewy flesh inside the squid try to hold onto this gooey mess” which is one of the short phrases whose variations form the bulk of the text Barham uses, which start off recognizably: “tried to hold onto the screen pieces of silver reflections cybersquaring trying to hold onto the screen” and get further “trying to hold onto the discreet/discrete maths inside the square” and further out: “listening in the pool pieces of seemingly flesh inside”.
In the ICA studio space for Week 30 of fig-2 Barham set up a microphone plugged into a Mac running OS X dictation software with a printer, plus a screen displaying the text as it was generated by the visitors reading. Visitors to the space were encouraged to read from the printouts they found, producing new printouts for the next visitors. As you’d expect over the course of a week the text bore little recognition to chapter three of Penetrating Squid.
Even over the course of successive readers the changes are considerable. In four steps we find, in no particular order, “Hello all this time”, “Hello I’m Harry”, “Okay hello and hurry”, “Okay hello unhurried”. It’s the old game of Chinese Whispers in electronic form.
The OS X software has real-time correction routines that try to identify the meaning of what is being said and retrospectively correcting it, so for example identifying whether you were ‘being discreet’ or rather talking about ‘discrete forms of meaning’.
Intriguingly, this illustrates aspects of Wittgenstein’s theory of language, that we create meaning not via the relation of individual words to the things we associate with them necessarily but via the relation of the words between themselves. The noun ‘Good’ stands for something different thing to the good of ‘good game’. Going on, Wittgenstein challenges us to come up with a meaning for the word “game”. We can’t agree, but we all know what it means in use. Meaning is use. This is the principle that Google Translate and OS X Dictation use: context.
It is awesomely powerful, but incomplete. While the machine understands to an extent meaning as generated by use, there’s still a step missing here, perhaps even missing from Wittgenstein’s theory, that would explain why we can’t agree on our game but still know what it means. It’s a cognitive next step that people working in voice recognition software are struggling with, entering the realm of Artificial Intelligence to seek the breakthrough.
Even with such clever tech and with the rich amount of phoneme data that has been gathered in exercises such as GOOG-411, it is still remarkable how hard it is for machines to transcribe speech, as Anna Barham’s work amusingly demonstrates. Never ask a robot to sell you fork handles.
In Week 29 of Fig-2 I said that “You are an internet” and imagined inhabiting posthuman cyberspace having transcended physical form. In an act of direct regression, this week I have experimented with subverting this in real time to explore who is The Best: machines or humans? So please put your hands together for this my experiment with manually performing voice recognition transcription. You might think the transcriptions of software are laughable, but wait until you see mine.
I typed out all seven minutes of “Penetrating Squid / Chapter 3 / Seemingly Fleshed Inside” from the soundcloud, first with stops to type, and then typing straight through trying to keep up as best I could. Both attempts are viewable in this googledoc.
In the first pass, which took about forty-five minutes, I couldn’t decide between certain homonyms (discreet/discrete, you’re/your, onto/on too), made harder by the lack of conventional running sense. My ears are pretty good but I wasn’t sure if I heard lightly or likely. I typed silly instead of city.
The second pass, in real time, was of course a trainwreck. Certain omissions and conflations occur near the start and everything is mis-spelled, and then it just gets worse as I miss more, and at some point I knock CAPS LOCK on without realizing. By this point words have bled into each other and are half formed and in the wrong order, the text obliterated, repetitious. I enjoyed afterwards finding an example of spontaneous creative accident: a Joyce-style portmanteau word QWEAKNESS. At a couple of points I froze completely and I remember typing the letter ‘i’ about six times in a row, utterly defeated.
The texts created from accidents can be beautiful and poetic. Is Anna Barham a poet? This is a kind of suggestive poetry, certainly if the meaningless syllables of dada poetry can be said to be poetry. The poetry creates kinds of sense because each word has a meaning, and new meanings are being created and found by the strange aleatory juxtapositions of the words. A clash of meanings is set up where there was no conscious intention. It is created anew by use and association, which brings us back to Wittgenstein’s notions of meaning as use.
Random associations and meanings can also occur in the physical dimension, or our perception thereof. Whenever I see or think about Anna Barham’s (amazing) anagrammatic Twitter handle “Banana_Harm” I have the sensation that I can smell foam banana sweets. For Mmm: by Anna’s tweaks.