Three-letter codes for identifying languages
- Relation to standards
- Changes to the code set
- Structure of the code tables
- Using the code tables
- Giving feedback
- Downloading the code tables
One feature of the Ethnologue since its inception as a database in 1971 has been a system of three-letter codes for uniquely identifying languages. These became part of the publication in 1984. In the interest of fostering the uniform identification of all the world's languages in information systems, beginning with the 14th edition (2000), SIL International has released the complete set of three-letter codes (plus indexing information involving countries and alternate names) as downloadable data tables that the public may incorporate into their own database applications and dynamic web sites. Examples of efforts that are already using these codes as a standard for language identification are the Open Language Archives Community and its participating archives.
Any application that makes use of these language identifiers is just one click away from access to the full language descriptions that are available in the Ethnologue. That is, for any language identifier [abc] that may be stored in a database, an application may present a link to the following URL in order to give the user access to the Ethnologue's description of that language:
This 15th edition of the Ethnologue marks an important milestone in the development of the language identifiers, namely, their emergence as part of the draft international standard, ISO/DIS 639-3. (See History of the Ethnologue in the “Introduction to the Printed Volume_ for a fuller discussion of the history of the language identifiers.) The aim of that standard is to enable the uniform identification of all known human languages in information systems. ISO 639-3 was devised to enable the uniform identification of all known languages in a wide range of applications, particularly including information systems. It provides as complete an enumeration of languages as possible, including living, extinct, ancient, and constructed languages, whether major or minor. The Ethnologue does not cover this entire scope; it seeks to catalog all known living langauges, languages that have gone extinct since the inception of the Ethnologue (1950), and languages now extinct in terms of native speakers but which are still in use as a second language in certain commununties. Ancient, historical, and constructed languages that fall outside this scope are documented by Linguist List.
The most widely used standard for identifying languages in Internet
documents (such as in HTTP headers or HTML metadata or in the XML lang
attribute) is RFC 4646 (formerly RFC 3066). In
that standard, a three-letter identifier is interpreted as being a code from
the ISO 639-2 standard.
RFC 4646 offers an extension mechanism of tags beginning with x- to
handle custom codes for languages not covered in the standard. With the 14th
edition of the Ethnologue, we recommended that an RFC 4646 compliant
language tag be formed from an SIL three-letter language identifier as follows:
x-sil-abc. The situation is now different since the identifiers
used in the Ethnologue are a subset of the codes in ISO 639-3, which in
turn includes the individual language codes of ISO 639-2 as a subset. We
anticipate that the RFC will be revised when ISO 639-3 becomes fully adopted.
In the meantime, using an ISO/DIS 639-3 code in a context where a 639-2 code
is expected will not lead to misinterpretation, since:
- If the code is found in the 639-2 code set, then it is in fact the same as that 639-2 code.
- If the code is not found in the 639-2 code set, then it could be treated as an unknown language, or the 639-3 code set could be consulted to find its denotation.
A new edition of the Ethnologue (both in print and on the Web) is published approximately every four years. Between editions, editorial work is on-going and the code set itself may change as our knowledge of the world's languages is refined. Between the 14th and 15th editions, a change history table was periodically released. In addition to these accumulated changes, the 15th edition involves a one-time reassignment of hundreds of codes in order to achieve alignment with the existing ISO 639-2 standard. For any sites who have used codes from the 14th edition in their own application, complete instructions for making the update along with a set of data tables that assist in automating the process can be found at:
It is crucial that this update be made, since the reassignment of codes for alignment with the ISO standard means that a given three letter code may have an entirely different meaning in the new edition. It turns out that the convention formerly used by the Ethnologue was to present the codes as upper case letters, while the convention with ISO has been to use lower case letters and this is what the 15th edition follows. Therefore, during the period of transition from old codes to new codes, it is possible to use the case distinction to distinguish between old and new codes.
Now that the Ethnologue is in alignment with the ISO standards, this site will no longer need to publish a change history table. Documentation on changes to the code set will be found at the ISO 639-3 site.
Three files make up the package of data tables that SIL International releases in support of the ISO 639-3 standard for language identifiers. They are tab-delimited files in which each line represents one row of a database table. The characters are encoded in the 8-bit standard known as ISO 8859-1 (which is a subset of the default Windows code page 1252).
|LanguageCodes.tab||The complete list of three-letter language identifiers used in the 15th edition of the Ethnologue (along with name, primary country, and language status).|
|CountryCodes.tab||The list of two-letter country codes that are used in the main language code table.|
|LanguageIndex.tab||An index for finding languages by country and by all known names (including primary name, alternate names, and dialect names).|
The following declarations provide the formal definitions for SQL data tables into which the tab-delimited files can be loaded:
CREATE TABLE LanguageCodes ( LangID char(3) NOT NULL, -- Three-letter code CountryID char(2) NOT NULL, -- Main country where used LangStatus char(1) NOT NULL, -- L(iving), N(early extinct), -- E(xtinct), S(econd langauge only) Name varchar(75) NOT NULL) -- Primary name in that country
CREATE TABLE CountryCodes ( CountryID char(2) NOT NULL, -- Two-letter code from ISO3166 Name varchar(75) NOT NULL, -- Country name Area varchar(10) NOT NULL ) -- World area
CREATE TABLE LanguageIndex ( LangID char(3) NOT NULL, -- Three-letter code for language CountryID char(2) NOT NULL, -- Country where this name is used NameType char(2) NOT NULL, -- L(anguage), LA(lternate), -- D(ialect), DA(lternate) -- LP,DP (a pejorative alternate) Name varchar(75) NOT NULL ) -- The name
LanguageCodes.tab lists the 7,299 distinct language identifiers used in the 15th edition of the Ethnologue. Of these, 360 represent extinct languages, 444 are nearly extinct, 27 are a second language only, and the remainder are listed with "living" status. (See Status in Layout of Language Entries in the “Introduction to the Printed Volume_ for a fuller explanation.) The following shows the entries for the first six languages identifiers:
LangID CountryID LangStatus Name ------ --------- ---------- ------------- aaa NG L Ghotuo aab NG L Alumu-Tesu aac PG L Ari aad PG L Amal aae IT L Albanian, Arbëreshë aaf IN L Aranadan
We see that aaa and aab denote living languages spoken in Nigeria, aac and aad denote living languages spoken in Papua New Guinea, and so on. When a language is actually spoken in more than one country, the CountryId gives the country that is considered primary; usually the country of origin or country where most of the speakers are located.
CountryCodes.tab lists the two-letter identifier and name for 226 countries of the world. The codes are from the international standard known as ISO 3166-1 (1997. Codes for the representation of names of countries and their subdivisions--Part 1: Country codes. Geneva: International Organization on Standardization. http://www.din.de/gremien/nas/nabd/iso3166ma/. ). The following shows the entries for the first five codes in the list:
CountryID Name Area --------- --------------------- ---------- AD Andorra Europe AE United Arab Emirates Asia AF Afghanistan Asia AG Antigua and Barbuda Americas AI Anguilla Americas
The CountryCodes.tab table can be used to narrow the search for an identifier to a particular country. The user would choose a country from the country list in order to select the appropriate country code. That code would then be used in a SQL query to restrict the language identifier list to just entries for that country. For instance, if the user were interested only in Afghanistan, the following SQL query would return just the table rows for that country:
SELECT * FROM LanguageCodes WHERE CountryID='AF'
Alternatively, the following link to the Ethnologue web site could be used to generate a report listing all the languages for Afghanistan:
LanguageIndex.tab documents 39,418 distinct names used for the 7,299 languages. The entries in this index of names indicate in which country each name is used. The table thus contains 52,584 records since many of the names are used in more than one country and some are used with more than one language or dialect. The following shows the entries in the name index for the first three language identifiers:
LangID CountryID NameType Name ------ --------- -------- ------------- aaa NG L Ghotuo aaa NG LA Otuo aaa NG LA Otwa aab NG D Alumu aab NG D Tesu aab NG DA Arum aab NG L Alumu-Tesu aab NG LA Alumu aab NG LA Arum-Cesu aab NG LA Arum-Chessu aab NG LA Arum-Tesu aac PG L Ari
We see that aaa has two alternate names in addition to the primary name of Ghotuo; aab has four alternate names, two dialect names, and an alternate dialect name in addition to its primary name; aac has just one name.
The LanguageIndex.tab table would be used to implement a search by name. For instance, the following query returns the three-letter codes for all the languages that use the name xyz:
SELECT DISTINCT LangID FROM LangaugeIndex
Note that DISTINCT is used since the same language could be known by the same name in multiple countries. To allow the user to verify that a proposed identifier is indeed the right one, the software would offer the following link to the Ethnologue web site to see a report giving detailed information about the selected language (where abc is the proposed three-letter identifier):
Another application of the LanguageIndex.tab table is to find all the countries in which a given language is spoken. For instance, the following query returns the names of all the countries in which language abc is spoken:
SELECT DISTINCT C.Name FROM CountryCodes AS C
JOIN LanguageIndex AS L ON C.CountryID=L.CountryID
In this case DISTINCT must be used since a language could have multiple names in a given country.
Finally, the LanguageIndex.tab table can be used to learn all the languages spoken in a particular country. Whereas the query illustrated previously retrieves all languages whose primary country is Afghanistan, the following query retrieves all languages spoken in Afghanistan:
SELECT DISTINCT LangID FROM LanguageIndex
The Ethnologue is a work in progress; our knowledge of the worlds languages is always incomplete and subject to improvement. Many people who use the Ethnologue can give feedback that will make it better and SIL International has always valued this kind of input. Users may have more accurate information on details like locations or names or population figures or language development status. Or they may be able to provide information that would lead to a change to the set of language identifiers. For instance, they may be able to show that what is treated as one language is really two, or vice versa, or that a listed language does not exist or that an existing language is not listed.
If you believe any of the information in the Ethnologue is in error, send your proposed change by e-mail to [email protected]. Be sure to report the source of your information. When you want to request that a language be added because you believe it to be missing altogether, please supply as much of the information listed in Layout of Language Entries as you can.
Before a proposed change is accepted, it must meet two requirements: it needs to be in keeping with the criteria given in the Introduction to the Ethnologue, and the facts that lie behind the proposed change need to be verified. The verification process may take months as it generally involves making enquiries of individuals who are resident in the country where the language is spoken. These persons may in turn make enquiries of others in order to perform the verification. Proposals that require changes to the code set will be processed with the ISO 639-3 registrar. Such changes may take a full year to process since ISO 639-3 runs an annual cycle for reviewing and adopting proposed changes.
All three-letter codes in the range qaa to qtz are reserved for local use. That is, they will never be assigned as language identifiers in ISO 639-3. Thus, when users feel that a needed code is missing from the code set, they may freely use one of these local use codes as a temporary measure until the outcome of a change request is known.
The code tables (as tab-delimited, ISO 8859-1 encoded plain text files) may be downloaded individually by clicking the following links. In each case, the first line contains the column names rather than the first row of data. These are the tables that correspond to the 15th edition of the Ethnologue (2005):
- Language_Code_Data_2005.zip (347K)