text_quality.language.fasttext
Module Contents
Classes
LanguageClassifier implementation using FastText. |
- class text_quality.language.fasttext.FastTextLanguageClassifier(*, model_file: pathlib.Path = _DEFAULT_MODEL_PATH, line_threshold: float = 0.5)[source]
Bases:
text_quality.language.classifier.LanguageClassifierLanguageClassifier implementation using FastText.
See https://fasttext.cc/docs/en/language-identification.html for more information on the FastText language classifier and models.
- _DEFAULT_MODEL_PATH: pathlib.Path[source]
Default location for the FastText language model file.
- _LABEL_PREFIX: str = '__label__'[source]
The classifier always returns labels with this prefix; removed before returning it.
- classify(text: str) tuple[str, float][source]
Classify a text string.
- Parameters:
text (str) – The text to classify.
- Returns:
- A tuple[str, float] with the language code (e.g. “nl”) and the confidence.
(“”, 0.0) if all lines were below the confidence threshold.
- static _download_model(model_file: pathlib.Path)[source]
- static _aggregate_lines(line_labels: list[list[str]], line_confidences: list[numpy.typing.ArrayLike]) tuple[str, float][source]
Aggregate the results per line from the classifier.
The confidence is the weight that the most common label has from the total confidence:
Sum all confidences per line
The label with the largest total confidence is the winning label
The confidence is the the summed confidence of the winning label divided by the total confidence of all labels
Because the confidences per line do not sum up to 1 – only the most likely label(s) is/are returned – this results in a higher score than the average confidence.
Furthermore, the classify() method applies a threshold to the classifier to ignore lines with low confidence.
- Parameters:
- Returns:
A tuple[str, float] with the language code (e.g. __label__nl) and the confidence.
- Raises:
ValueError – if the labels and confidences are empty or of different lengths.