`text_quality.language.fasttext`

Module Contents

Classes

FastTextLanguageClassifier

LanguageClassifier implementation using FastText.

class text_quality.language.fasttext.FastTextLanguageClassifier(*, model_file: pathlib.Path = _DEFAULT_MODEL_PATH, line_threshold: float = 0.5)[source]

Bases: text_quality.language.classifier.LanguageClassifier

LanguageClassifier implementation using FastText.

See https://fasttext.cc/docs/en/language-identification.html for more information on the FastText language classifier and models.

MODEL_URLS: dict[str, str][source]: URLs for the FastText language models for automatic download.

_DEFAULT_MODEL_PATH: pathlib.Path[source]: Default location for the FastText language model file.

_LABEL_PREFIX: str = '__label__'[source]: The classifier always returns labels with this prefix; removed before returning it.

classify(text: str) → tuple[str, float][source]

Classify a text string.

Parameters:

text (str) – The text to classify.

Returns:

A tuple[str, float] with the language code (e.g. “nl”) and the confidence.: (“”, 0.0) if all lines were below the confidence threshold.

static _download_model(model_file: pathlib.Path)[source]

static _aggregate_lines(line_labels: list[list[str]], line_confidences: list[numpy.typing.ArrayLike]) → tuple[str, float][source]

Aggregate the results per line from the classifier.

The confidence is the weight that the most common label has from the total confidence:

Sum all confidences per line
The label with the largest total confidence is the winning label
The confidence is the the summed confidence of the winning label divided by the total confidence of all labels

Because the confidences per line do not sum up to 1 – only the most likely label(s) is/are returned – this results in a higher score than the average confidence.

Furthermore, the classify() method applies a threshold to the classifier to ignore lines with low confidence.

Parameters:

line_labels (list[list[str]]) – the labels returned by the classifier; one list of labels for each input line
line_confidences (list[ArrayLike]) – the confidences returned by the classifier; one array for each input line

Returns:

A tuple[str, float] with the language code (e.g. __label__nl) and the confidence.

Raises:

ValueError – if the labels and confidences are empty or of different lengths.

text_quality.language.fasttext

Module Contents

Classes

`text_quality.language.fasttext`