dlk.data.postprocessors package

Submodules

dlk.data.postprocessors.identity module

class dlk.data.postprocessors.identity.IdentityPostProcessor(config: dlk.data.postprocessors.identity.IdentityPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

docstring for DataSet

process(stage, outputs, origin_data) Dict[source]

do nothing except gather the loss

class dlk.data.postprocessors.identity.IdentityPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

docstring for IdentityPostProcessorConfig

dlk.data.postprocessors.seq_lab module

class dlk.data.postprocessors.seq_lab.AggregationStrategy[source]

Bases: object

docstring for AggregationStrategy

AVERAGE = 'average'
FIRST = 'first'
MAX = 'max'
NONE = 'none'
SIMPLE = 'simple'
class dlk.data.postprocessors.seq_lab.SeqLabPostProcessor(config: dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

PostProcess for sequence labeling task

aggregate(pre_entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict][source]
aggregate_word(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) dict[source]
aggregate_words(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict][source]

Override tokens from a given word that disagree to force agreement on word boundaries.

Example

micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT

calc_score(predict_list: List, ground_truth_list: List)[source]

use predict_list and ground_truth_list to calc scores

Parameters
  • predict_list – list of predict

  • ground_truth_list – list of ground_truth

Returns

precision, recall, f1

crf_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

use the crf predict label_ids get predict info

Parameters
  • list_batch_outputs – the crf predict info

  • origin_data – the origin data

Returns

all predict instances info

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores, recall, precision, f1

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

There are three predictor for diffrent seq_lab task dependent on the config.use_crf(the predict is already decoded to ids), and config.word_ready(subword has gathered to firstpiece)

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

gather_pre_entities(sentence: str, input_ids: numpy.ndarray, scores: numpy.ndarray, offset_mapping: Optional[List[Tuple[int, int]]], special_tokens_mask: numpy.ndarray) List[dict][source]

Fuse various numpy arrays into dicts with all the information needed for aggregation

get_entity_info(sub_tokens_index: List, offset_mapping: List, word_ids: List, label: str) Dict[source]

gather sub_tokens to get the start and end

Parameters
  • sub_tokens_index – the entity tokens index list

  • offset_mapping – every token offset in text

  • word_ids – every token in the index of words

  • label – predict label

Returns

entity_info

get_tag(entity_name: str) Tuple[str, str][source]
group_entities(entities: List[dict]) List[dict][source]

Find and group together the adjacent tokens with the same entity predicted.

Parameters

entities – The entities predicted by the pipeline.

group_sub_entities(entities: List[dict]) dict[source]

Group together the adjacent tokens with the same entity predicted.

Parameters

entities – The entities predicted by the pipeline.

predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

general predict process (especially for subword)

Parameters
  • list_batch_outputs – the predict (sub-)labels logits info

  • origin_data – the origin data

Returns

all predict instances info

word_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List[source]

use the firstpiece or whole word predict label_logits get predict info

Parameters
  • list_batch_outputs – the predict labels logits info

  • origin_data – the origin data

Returns

all predict instances info

class dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for SeqLabPostProcessor

Config Example:
>>> {
>>>     "_name": "seq_lab",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "use_crf": false, //use or not use crf
>>>         "word_ready": false, //already gather the subword first token as the word rep or not
>>>         "ignore_position": true, // calc the metrics, whether ignore the ground_truth and predict position info.( if set to true, only focus on the entity content not position.)
>>>         "ignore_char": " ", // if the entity begin or end with this char, will ignore these char
>>>         //"ignore_char": " ()[]-.,:", // if the entity begin or end with this char, will ignore these char
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>             "tokenizer": "tokenizer",
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "predict_seq_label": "predict_seq_label",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "uuid": "uuid",
>>>             "sentence": "sentence",
>>>             "input_ids": "input_ids",
>>>             "entities_info": "entities_info",
>>>             "offsets": "offsets",
>>>             "special_tokens_mask": "special_tokens_mask",
>>>             "word_ids": "word_ids",
>>>             "label_ids": "label_ids",
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>         "aggregation_strategy": "max", // AggregationStrategy item
>>>         "ignore_labels": ['O', 'X', 'S', "E"], // Out, Out, Start, End
>>>     }
>>> }

dlk.data.postprocessors.txt_cls module

class dlk.data.postprocessors.txt_cls.TxtClsPostProcessor(config: dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

postprocess for text classfication

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores, acc

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtClsPostProcessor

Config Example:
>>> {
>>>     "_name": "txt_cls",
>>>     "config": {
>>>         "meta": "*@*",
>>>         "meta_data": {
>>>             "label_vocab": 'label_vocab',
>>>         },
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "label_ids": "label_ids"
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "save_root_path": ".",  //save data root dir
>>>         "top_k": 1, //the result return top k result
>>>         "data_type": "single", //single or pair
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }

dlk.data.postprocessors.txt_reg module

class dlk.data.postprocessors.txt_reg.TxtRegPostProcessor(config: dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig)[source]

Bases: dlk.data.postprocessors.IPostProcessor

text regression postprocess

do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores

do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

class dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig(config: Dict)[source]

Bases: dlk.data.postprocessors.IPostProcessorConfig

Config for TxtRegPostProcessor

Config Example:
>>> {
>>>     "_name": "txt_reg",
>>>     "config": {
>>>         "input_map": {
>>>             "logits": "logits",
>>>             "values": "values",
>>>             "_index": "_index",
>>>         },
>>>         "origin_input_map": {
>>>             "sentence": "sentence",
>>>             "sentence_a": "sentence_a", // for pair
>>>             "sentence_b": "sentence_b",
>>>             "uuid": "uuid"
>>>         },
>>>         "data_type": "single", //single or pair
>>>         "save_root_path": ".",  //save data root dir
>>>         "save_path": {
>>>             "valid": "valid",  // relative dir for valid stage
>>>             "test": "test",    // relative dir for test stage
>>>         },
>>>         "log_reg": false, // whether logistic regression
>>>         "start_save_step": 0,  // -1 means the last
>>>         "start_save_epoch": -1,
>>>     }
>>> }

Module contents

postprocessors

class dlk.data.postprocessors.IPostProcessor[source]

Bases: object

docstring for IPostProcessor

average_loss(list_batch_outputs: List[Dict]) float[source]

average all the loss of the list_batches

Parameters

list_batch_outputs – a list of outputs

Returns

average_loss

abstract do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict[source]

calc the scores use the predicts or list_batch_outputs

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>>
    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

the named scores

abstract do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List[source]

Process the model predict to human readable format

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

Returns

all predicts

abstract do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]

save the predict when save_condition==True

Parameters
  • predicts – list of predicts

  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – True for save, False for depend on rt_config

Returns

None

gather_predict_extend_data(input_data: Dict, i: int, predict_extend_return: Dict)[source]

gather the data register in predict_extend_return :param input_data: the model output :param i: the index is i :param predict_extend_return: the name map which will be reserved

Returns

a dict of data in input_data which is register in predict_extend_return

loss_name_map(stage) str[source]

get the stage loss name

Parameters

stage – valid, train or test

Returns

loss_name

process(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False) Union[Dict, List][source]

PostProcess entry

Parameters
  • stage – train/test/etc.

  • list_batch_outputs – a list of outputs

  • origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor

  • rt_config

    >>> current status
    >>> {
    >>>     "current_step": self.global_step,
    >>>     "current_epoch": self.current_epoch,
    >>>     "total_steps": self.num_training_steps,
    >>>     "total_epochs": self.num_training_epochs
    >>> }
    

  • save_condition – if save_condition is True, will force save the predict on all stage except online

Returns

the log_info(metrics) or the stage is “online” return the predicts

property without_ground_truth_stage: set

there is not groud truth in the returned stage

Returns

without_ground_truth_stage

class dlk.data.postprocessors.IPostProcessorConfig(config)[source]

Bases: dlk.utils.config.BaseConfig

docstring for PostProcessorConfigBase

property input_map

required the output of model process content name map

Returns

input_map

property origin_input_map

required the origin data(before pass to datamodule) column name map

Returns

origin_input_map

property predict_extend_return

save the extend data in predict

Returns

predict_extend_return

dlk.data.postprocessors.import_postprocessors(postprocessors_dir, namespace)[source]