Google Rolls out Optical Character Recognition in over 200 Languages

Google OCR1

Improvement in Optical Character Recognition (OCR) technology is one of Google’s lesser-known projects, at least to lay consumers. In reality, many of us have been using OCR for years without knowing what it actually is.

OCR is the technology that enables Google to digitize text captured in image format and make it legibile from the computer’s perspective. So if you’ve ever uploaded a scanned PDF or other image file to Drive, then asked Drive to “Open with – Google Docs,” Google employs OCR, opening a new version of the document that displays the original image and then the extracted text.

Google OCR2

Google OCR3

The big news today is that OCR has now been rolled out to over 200 languages and 25 writing systems, which is pretty dang awesome. Even if at the end of the day, Google is a company that harvests our data to sell to third parties in their quest to not be evil™, and even if OCR supports that mission, this is the sort of altruistic endeavor that gets little notice but deserves much.

And because I’m feeling saucy, I’ve provide a complete list of the supported languages below. You’re welcome.

Acehnese, Acholi, Adangme, Afrikaans, Akan, Albanian, Algonquinian, Amharic, Ancient Greek, Arabic (Modern Standard), Araucanian/Mapuche, Armenian, Assamese, Asturian, Athabaskan, Aymara, Azerbaijani, Azerbaijani (Cyrillic; old orthography), Balinese, Bambara, Bantu, Bashkir, Basque, Batak, Belorussian, Bemba, Bengali, Bikol, Bislama, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Cherokee, Chinese (Mandarin; Hong Kong), Chinese (Simplified; Mandarin), Chinese (Traditional; Mandarin), Choctaw, Chuvash, Cree, Creek, Crimean Tatar, Croatian, Czech, Dakota, Danish, Dhivehi, Duala, Dutch, Dzonkha, Efik, English (American), English (British), Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French (Canadian), French (European), Fulah, Ga, Galician, Ganda, Gayo, Georgian, German, Gilbertese, Gothic, Greek, Guarani, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Herero, Hiligaynon, Hindi, Hungarian, Iban, Icelandic, Igbo, Iloko, Indonesian, Irish, Italian, Japanese, Javanese, Kabyle, Kachin, Kalaallisut, Kamba, Kannada, Kanuri, Kara-Kalpak, Kazakh, Khasi, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Komi, Kongo, Korean, Kosraean, Kuanyama, Lao, Latin, Latvian, Lingala, Lithuanian, Low German, Lozi, Luba-Katanga, Luo, Macedonian, Madurese, Malagasy, Malay, Malayalam, Maltese, Mandingo, Manx, Maori, Marathi, Marshallese, Mende, Middle English, Middle High German, Minangkabau, Mohawk, Mongo, Mongolian, Nahuatl, Navajo, Ndonga, Nepali, Niuean, North Ndebele, Northern Sotho, Norwegian (Bokmål), Nyanja, Nyankole, Nyasa Tonga, Nzima, Occitan, Ojibwa, Old English, Old French, Old High German, Old Norse, Old Provencal, Oriya, Ossetic, Pampanga, Pangasinan, Papiamento, Pashto, Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi (Gurmukhi), Quechua, Romanian, Romansh, Romany, Rundi, Russian, Russian (Old Orthography), Sakha, Samoan, Sango, Sanskrit, Scots, Scottish Gaelic, Serbian (Cyrillic), Serbian (Latin), Shona, Sinhala, Slovak, Slovenian, Songhai, Southern Sotho, Spanish (European), Spanish (Latin American), Sundanese, Swahili, Swati, Swedish, Tahitian, Tajik, Tamil, Tatar, Telugu, Temne, Thai, Tibetan, Tigirinya, Tongan, Tsonga, Tswana, Turkish, Turkmen, Udmurt Ukrainian, Urdu, Uzbek, Uzbek (Cyrillic; old orthography), Venda, Vietnamese, Votic, Welsh, Western Frisian, Wolof, Xhosa, Yiddish, Yoruba, Zapotec, and Zulu.

The technical side of this is beyond my pay grade, but if you want to learn more, check out the link below and your dreams will be filled with Hidden Markov Models (HMMs) and Python code.

All in all, the ability to convert what is effectively “background noise,” as Google describes it, to textual content that’s recognized by a computer is hugely useful, especially as the latest language rollout supports more developing countries.

Also, Old High German and Old Norse are supported, as well as Old English. Maybe it’ll turn out we had Beowulf wrong all along.

The update works on the desktop and mobile app versions of Drive.

Source: Google Research Blog


About the Author: Geoff Openshaw

Geoff has been an Android enthusiast for many years, starting with the original Droid, which he purchased solely because he did not yet believe in the efficiency of virtual keyboards. His current phone is a Nexus 5, but admits the camera is wanting. He works for an international development firm, handling humanitarian projects overseas. Aside from his writings at Talk Android, he also edits the BLOCK magazine of the Missouri Star Quilt Co., owns and runs a podcast network, and occasionally writes and/or performs music. He's an avid Angels and FC Barcelona fan and pretends he knows how to cook, but his wife will say differently. He is from Southern California and resides in Washington, DC.