dlk.data.postprocessors package

Submodules

dlk.data.postprocessors.identity module

class dlk.data.postprocessors.identity.IdentityPostProcessor(config: dlk.data.postprocessors.identity.IdentityPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

docstring for DataSet

process(stage, outputs, origin_data) → Dict[source]: do nothing except gather the loss

class dlk.data.postprocessors.identity.IdentityPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

docstring for IdentityPostProcessorConfig

dlk.data.postprocessors.seq_lab module

class dlk.data.postprocessors.seq_lab.AggregationStrategy[source]

Bases: object

docstring for AggregationStrategy

AVERAGE = 'average'

FIRST = 'first'

MAX = 'max'

NONE = 'none'

SIMPLE = 'simple'

class dlk.data.postprocessors.seq_lab.SeqLabPostProcessor(config: dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

PostProcess for sequence labeling task

aggregate(pre_entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) → List[dict][source]

aggregate_word(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) → dict[source]

aggregate_words(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) → List[dict][source]

Override tokens from a given word that disagree to force agreement on word boundaries.

Example

calc_score(predict_list: List, ground_truth_list: List)[source]

use predict_list and ground_truth_list to calc scores

Parameters

predict_list – list of predict
ground_truth_list – list of ground_truth

Returns

precision, recall, f1

crf_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) → List[source]

use the crf predict label_ids get predict info

Parameters

list_batch_outputs – the crf predict info
origin_data – the origin data

Returns

all predict instances info

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

the named scores, recall, precision, f1

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → List[source]

Process the model predict to human readable format

There are three predictor for diffrent seq_lab task dependent on the config.use_crf(the predict is already decoded to ids), and config.word_ready(subword has gathered to firstpiece)

Parameters

stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

save_condition – True for save, False for depend on rt_config

Returns

None

gather_pre_entities(sentence: str, input_ids: numpy.ndarray, scores: numpy.ndarray, offset_mapping: Optional[List[Tuple[int, int]]], special_tokens_mask: numpy.ndarray) → List[dict][source]: Fuse various numpy arrays into dicts with all the information needed for aggregation

get_entity_info(sub_tokens_index: List, offset_mapping: List, word_ids: List, label: str) → Dict[source]

gather sub_tokens to get the start and end

Parameters

sub_tokens_index – the entity tokens index list
offset_mapping – every token offset in text
word_ids – every token in the index of words
label – predict label

Returns

entity_info

get_tag(entity_name: str) → Tuple[str, str][source]

group_entities(entities: List[dict]) → List[dict][source]

Find and group together the adjacent tokens with the same entity predicted.

Parameters: entities – The entities predicted by the pipeline.

group_sub_entities(entities: List[dict]) → dict[source]

Group together the adjacent tokens with the same entity predicted.

Parameters: entities – The entities predicted by the pipeline.

predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) → List[source]

general predict process (especially for subword)

Parameters

list_batch_outputs – the predict (sub-)labels logits info
origin_data – the origin data

Returns

all predict instances info

word_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) → List[source]

use the firstpiece or whole word predict label_logits get predict info

Parameters

list_batch_outputs – the predict labels logits info
origin_data – the origin data

Returns

all predict instances info

class dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for SeqLabPostProcessor

Config Example:

>>> {
>>>     "_name": "seq_lab",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "use_crf": false, //use or not use crf
>>>         "word_ready": false, //already gather the subword first token as the word rep or not
>>>         "ignore_position": true, // calc the metrics, whether ignore the ground_truth and predict position info.( if set to true, only focus on the entity content not position.)
>>>         "ignore_char": " ", // if the entity begin or end with this char, will ignore these char
>>>         //"ignore_char": " ()[]-.,:", // if the entity begin or end with this char, will ignore these char
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>             "tokenizer": "tokenizer",
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "predict_seq_label": "predict_seq_label",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "uuid": "uuid",
>>>             "sentence": "sentence",
>>>             "input_ids": "input_ids",
>>>             "entities_info": "entities_info",
>>>             "offsets": "offsets",
>>>             "special_tokens_mask": "special_tokens_mask",
>>>             "word_ids": "word_ids",
>>>             "label_ids": "label_ids",
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>         "aggregation_strategy": "max", // AggregationStrategy item
>>>         "ignore_labels": ['O', 'X', 'S', "E"], // Out, Out, Start, End
>>>     }
>>> }

dlk.data.postprocessors.txt_cls module

class dlk.data.postprocessors.txt_cls.TxtClsPostProcessor(config: dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

postprocess for text classfication

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

the named scores, acc

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → List[source]

Process the model predict to human readable format

Parameters

stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtClsPostProcessor

Config Example:

>>> {
>>>     "_name": "txt_cls",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "label_ids": "label_ids"
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "top_k": 1, //the result return top k result
>>>         "data_type": "single", //single or pair
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }

dlk.data.postprocessors.txt_reg module

class dlk.data.postprocessors.txt_reg.TxtRegPostProcessor(config: dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

text regression postprocess

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

the named scores

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → List[source]

Process the model predict to human readable format

Parameters

stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtRegPostProcessor

Config Example:

>>> {
>>>     "_name": "txt_reg",
>>>     "config": {
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "values": "values",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "data_type": "single", //single or pair
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "log_reg": false, // whether logistic regression
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }

Module contents

postprocessors

class dlk.data.postprocessors.IPostProcessor[source]

Bases: object

docstring for IPostProcessor

average_loss(list_batch_outputs: List[Dict]) → float[source]

average all the loss of the list_batches

Parameters: list_batch_outputs – a list of outputs
Returns: average_loss

abstract do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>>
>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

the named scores

abstract do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) → List[source]

Process the model predict to human readable format

Parameters

stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

Returns

all predicts

abstract do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters

predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

save_condition – True for save, False for depend on rt_config

Returns

None

gather_predict_extend_data(input_data: Dict, i: int, predict_extend_return: Dict)[source]

gather the data register in predict_extend_return :param input_data: the model output :param i: the index is i :param predict_extend_return: the name map which will be reserved

Returns: a dict of data in input_data which is register in predict_extend_return

loss_name_map(stage) → str[source]

get the stage loss name

Parameters: stage – valid, train or test
Returns: loss_name

process(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False) → Union[Dict, List][source]

PostProcess entry

Parameters

stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

rt_config –

>>> current status
>>> {
>>>     "current_step": self.global_step,
>>>     "current_epoch": self.current_epoch,
>>>     "total_steps": self.num_training_steps,
>>>     "total_epochs": self.num_training_epochs
>>> }

save_condition – if save_condition is True, will force save the predict on all stage except online

Returns

the log_info(metrics) or the stage is “online” return the predicts

property without_ground_truth_stage: set

there is not groud truth in the returned stage

Returns: without_ground_truth_stage

class dlk.data.postprocessors.IPostProcessorConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

docstring for PostProcessorConfigBase

property input_map

required the output of model process content name map

Returns: input_map

property origin_input_map

required the origin data(before pass to datamodule) column name map

Returns: origin_input_map

property predict_extend_return

save the extend data in predict

Returns: predict_extend_return

dlk.data.postprocessors.import_postprocessors(postprocessors_dir, namespace)[source]