iVia Language Metadata Assignment

This page describes the iVia Language assignment algorithm.

Language Metadata

Language metadata identifies the language a resource is written in. iVia assigns a small subset of the ISO 639 2-letter language codes.

The Language Assignment Algorithm

iVia assigns Language metadata based on two sources. First, if HTTP headers are available and contain a Content-Language field, the value of that field is used. If that fails, iVia assigns a language based on the content of the document with an implementation of the N-gram-based algorithm by Cavnar and Trenkle. The N-gram method is widely used (see, for example, TextCat) and widely considered reliable.

Language Assignment Evaluation

We have not evaluated Language assignment because we do not have a multi-lingual test set. However, the algorithm's authors report accuracy of over 97% if sufficient training data is available.