text_quality.classifier.pipeline

Classification pipeline.

Module Contents

Classes

Reason

Reasons for the classification result.

Pipeline

A wrapper around an sklearn pipeline that adds a featurizer.

Functions

default_scores_dict(→ ClassifierScores)

Generate a ClassifierScores dict with default values.

Attributes

ClassifierScores

Container class for the scores returned by the classifier.

text_quality.classifier.pipeline.ClassifierScores[source]

Container class for the scores returned by the classifier.

class text_quality.classifier.pipeline.Reason[source]

Bases: enum.Enum

Reasons for the classification result.

CLASSIFIER[source]
SHORT_COLUMNS[source]
EMPTY[source]
LANGUAGE[source]
text_quality.classifier.pipeline.default_scores_dict(default_value, **fields) ClassifierScores[source]

Generate a ClassifierScores dict with default values.

Parameters:
  • default_value – The default value for the scores.

  • fields – arguments to add to the dict, hence not taking the default value.

class text_quality.classifier.pipeline.Pipeline(pipeline: sklearn.pipeline.Pipeline, featurizer: text_quality.feature.featurize.Featurizer, default_language: str = DEFAULT_LANGUAGE)[source]

A wrapper around an sklearn pipeline that adds a featurizer.

property features: List[str][source]

The names of the features used in the pipeline.

classify(page: text_quality.page.page.Page | str) int[source]

Single instance classification.

_classify_pagexml(pagexml: text_quality.page.page.Page) int[source]

Classify a Page object.

classify_with_scores(page: text_quality.page.page.Page | str) tuple[int, ClassifierScores, Reason][source]

Single instance classification with scores.

_classify_pagexml_with_scores(pagexml: text_quality.page.page.Page) tuple[int, ClassifierScores, Reason][source]

Classify a Page object with scores.

static _is_short(text: str)[source]
classmethod from_file(pipeline_file: pathlib.Path, featurizer: text_quality.feature.featurize.Featurizer)[source]

Load a pipeline from a file.