Improvement in Optical Character Recognition (OCR) technology is one of Google’s lesser-known projects, at least to lay consumers. In reality, many of us have been using OCR for years without knowing what it actually is.
OCR is the technology that enables Google to digitize text captured in image format and make it legibile from the computer’s perspective. So if you’ve ever uploaded a scanned PDF or other image file to Drive, then asked Drive to “Open with – Google Docs,” Google employs OCR, opening a new version of the document that displays the original image and then the extracted text.
The big news today is that OCR has now been rolled out to over 200 languages and 25 writing systems, which is pretty dang awesome. Even if at the end of the day, Google is a company that harvests our data to sell to third parties in their quest to not be evil™, and even if OCR supports that mission, this is the sort of altruistic endeavor that gets little notice but deserves much.
And because I’m feeling saucy, I’ve provide a complete list of the supported languages below. You’re welcome.
Acehnese, Acholi, Adangme, Afrikaans, Akan, Albanian, Algonquinian, Amharic, Ancient Greek, Arabic (Modern Standard), Araucanian/Mapuche, Armenian, Assamese, Asturian, Athabaskan, Aymara, Azerbaijani, Azerbaijani (Cyrillic; old orthography), Balinese, Bambara, Bantu, Bashkir, Basque, Batak, Belorussian, Bemba, Bengali, Bikol, Bislama, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Cherokee, Chinese (Mandarin; Hong Kong), Chinese (Simplified; Mandarin), Chinese (Traditional; Mandarin), Choctaw, Chuvash, Cree, Creek, Crimean Tatar, Croatian, Czech, Dakota, Danish, Dhivehi, Duala, Dutch, Dzonkha, Efik, English (American), English (British), Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French (Canadian), French (European), Fulah, Ga, Galician, Ganda, Gayo, Georgian, German, Gilbertese, Gothic, Greek, Guarani, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Herero, Hiligaynon, Hindi, Hungarian, Iban, Icelandic, Igbo, Iloko, Indonesian, Irish, Italian, Japanese, Javanese, Kabyle, Kachin, Kalaallisut, Kamba, Kannada, Kanuri, Kara-Kalpak, Kazakh, Khasi, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Komi, Kongo, Korean, Kosraean, Kuanyama, Lao, Latin, Latvian, Lingala, Lithuanian, Low German, Lozi, Luba-Katanga, Luo, Macedonian, Madurese, Malagasy, Malay, Malayalam, Maltese, Mandingo, Manx, Maori, Marathi, Marshallese, Mende, Middle English, Middle High German, Minangkabau, Mohawk, Mongo, Mongolian, Nahuatl, Navajo, Ndonga, Nepali, Niuean, North Ndebele, Northern Sotho, Norwegian (Bokmål), Nyanja, Nyankole, Nyasa Tonga, Nzima, Occitan, Ojibwa, Old English, Old French, Old High German, Old Norse, Old Provencal, Oriya, Ossetic, Pampanga, Pangasinan, Papiamento, Pashto, Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi (Gurmukhi), Quechua, Romanian, Romansh, Romany, Rundi, Russian, Russian (Old Orthography), Sakha, Samoan, Sango, Sanskrit, Scots, Scottish Gaelic, Serbian (Cyrillic), Serbian (Latin), Shona, Sinhala, Slovak, Slovenian, Songhai, Southern Sotho, Spanish (European), Spanish (Latin American), Sundanese, Swahili, Swati, Swedish, Tahitian, Tajik, Tamil, Tatar, Telugu, Temne, Thai, Tibetan, Tigirinya, Tongan, Tsonga, Tswana, Turkish, Turkmen, Udmurt Ukrainian, Urdu, Uzbek, Uzbek (Cyrillic; old orthography), Venda, Vietnamese, Votic, Welsh, Western Frisian, Wolof, Xhosa, Yiddish, Yoruba, Zapotec, and Zulu.
The technical side of this is beyond my pay grade, but if you want to learn more, check out the link below and your dreams will be filled with Hidden Markov Models (HMMs) and Python code.
All in all, the ability to convert what is effectively “background noise,” as Google describes it, to textual content that’s recognized by a computer is hugely useful, especially as the latest language rollout supports more developing countries.
Also, Old High German and Old Norse are supported, as well as Old English. Maybe it’ll turn out we had Beowulf wrong all along.
The update works on the desktop and mobile app versions of Drive.
Source: Google Research Blog