dlk.data.postprocessors package
Submodules
dlk.data.postprocessors.identity module
- class dlk.data.postprocessors.identity.IdentityPostProcessor(config: dlk.data.postprocessors.identity.IdentityPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
docstring for DataSet
- class dlk.data.postprocessors.identity.IdentityPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
docstring for IdentityPostProcessorConfig
dlk.data.postprocessors.seq_lab module
- class dlk.data.postprocessors.seq_lab.AggregationStrategy[source]
Bases:
object
docstring for AggregationStrategy
- AVERAGE = 'average'
- FIRST = 'first'
- MAX = 'max'
- NONE = 'none'
- SIMPLE = 'simple'
- class dlk.data.postprocessors.seq_lab.SeqLabPostProcessor(config: dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
PostProcess for sequence labeling task
- aggregate(pre_entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict] [source]
- aggregate_word(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) dict [source]
- aggregate_words(entities: List[dict], aggregation_strategy: dlk.data.postprocessors.seq_lab.AggregationStrategy) List[dict] [source]
Override tokens from a given word that disagree to force agreement on word boundaries.
Example
micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT
- calc_score(predict_list: List, ground_truth_list: List)[source]
use predict_list and ground_truth_list to calc scores
- Parameters
predict_list – list of predict
ground_truth_list – list of ground_truth
- Returns
precision, recall, f1
- crf_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
use the crf predict label_ids get predict info
- Parameters
list_batch_outputs – the crf predict info
origin_data – the origin data
- Returns
all predict instances info
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores, recall, precision, f1
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
There are three predictor for diffrent seq_lab task dependent on the config.use_crf(the predict is already decoded to ids), and config.word_ready(subword has gathered to firstpiece)
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- gather_pre_entities(sentence: str, input_ids: numpy.ndarray, scores: numpy.ndarray, offset_mapping: Optional[List[Tuple[int, int]]], special_tokens_mask: numpy.ndarray) List[dict] [source]
Fuse various numpy arrays into dicts with all the information needed for aggregation
- get_entity_info(sub_tokens_index: List, offset_mapping: List, word_ids: List, label: str) Dict [source]
gather sub_tokens to get the start and end
- Parameters
sub_tokens_index – the entity tokens index list
offset_mapping – every token offset in text
word_ids – every token in the index of words
label – predict label
- Returns
entity_info
- group_entities(entities: List[dict]) List[dict] [source]
Find and group together the adjacent tokens with the same entity predicted.
- Parameters
entities – The entities predicted by the pipeline.
- group_sub_entities(entities: List[dict]) dict [source]
Group together the adjacent tokens with the same entity predicted.
- Parameters
entities – The entities predicted by the pipeline.
- predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
general predict process (especially for subword)
- Parameters
list_batch_outputs – the predict (sub-)labels logits info
origin_data – the origin data
- Returns
all predict instances info
- word_predict(list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame) List [source]
use the firstpiece or whole word predict label_logits get predict info
- Parameters
list_batch_outputs – the predict labels logits info
origin_data – the origin data
- Returns
all predict instances info
- class dlk.data.postprocessors.seq_lab.SeqLabPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for SeqLabPostProcessor
- Config Example:
>>> { >>> "_name": "seq_lab", >>> "config": { >>> "meta": "*@*", >>> "use_crf": false, //use or not use crf >>> "word_ready": false, //already gather the subword first token as the word rep or not >>> "ignore_position": true, // calc the metrics, whether ignore the ground_truth and predict position info.( if set to true, only focus on the entity content not position.) >>> "ignore_char": " ", // if the entity begin or end with this char, will ignore these char >>> //"ignore_char": " ()[]-.,:", // if the entity begin or end with this char, will ignore these char >>> "meta_data": { >>> "label_vocab": 'label_vocab', >>> "tokenizer": "tokenizer", >>> }, >>> "input_map": { >>> "logits": "logits", >>> "predict_seq_label": "predict_seq_label", >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "uuid": "uuid", >>> "sentence": "sentence", >>> "input_ids": "input_ids", >>> "entities_info": "entities_info", >>> "offsets": "offsets", >>> "special_tokens_mask": "special_tokens_mask", >>> "word_ids": "word_ids", >>> "label_ids": "label_ids", >>> }, >>> "save_root_path": ".", //save data root dir >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> "aggregation_strategy": "max", // AggregationStrategy item >>> "ignore_labels": ['O', 'X', 'S', "E"], // Out, Out, Start, End >>> } >>> }
dlk.data.postprocessors.txt_cls module
- class dlk.data.postprocessors.txt_cls.TxtClsPostProcessor(config: dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
postprocess for text classfication
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores, acc
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- class dlk.data.postprocessors.txt_cls.TxtClsPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for TxtClsPostProcessor
- Config Example:
>>> { >>> "_name": "txt_cls", >>> "config": { >>> "meta": "*@*", >>> "meta_data": { >>> "label_vocab": 'label_vocab', >>> }, >>> "input_map": { >>> "logits": "logits", >>> "label_ids": "label_ids" >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "sentence": "sentence", >>> "sentence_a": "sentence_a", // for pair >>> "sentence_b": "sentence_b", >>> "uuid": "uuid" >>> }, >>> "save_root_path": ".", //save data root dir >>> "top_k": 1, //the result return top k result >>> "data_type": "single", //single or pair >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> } >>> }
dlk.data.postprocessors.txt_reg module
- class dlk.data.postprocessors.txt_reg.TxtRegPostProcessor(config: dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig)[source]
Bases:
dlk.data.postprocessors.IPostProcessor
text regression postprocess
- do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores
- do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- class dlk.data.postprocessors.txt_reg.TxtRegPostProcessorConfig(config: Dict)[source]
Bases:
dlk.data.postprocessors.IPostProcessorConfig
Config for TxtRegPostProcessor
- Config Example:
>>> { >>> "_name": "txt_reg", >>> "config": { >>> "input_map": { >>> "logits": "logits", >>> "values": "values", >>> "_index": "_index", >>> }, >>> "origin_input_map": { >>> "sentence": "sentence", >>> "sentence_a": "sentence_a", // for pair >>> "sentence_b": "sentence_b", >>> "uuid": "uuid" >>> }, >>> "data_type": "single", //single or pair >>> "save_root_path": ".", //save data root dir >>> "save_path": { >>> "valid": "valid", // relative dir for valid stage >>> "test": "test", // relative dir for test stage >>> }, >>> "log_reg": false, // whether logistic regression >>> "start_save_step": 0, // -1 means the last >>> "start_save_epoch": -1, >>> } >>> }
Module contents
postprocessors
- class dlk.data.postprocessors.IPostProcessor[source]
Bases:
object
docstring for IPostProcessor
- average_loss(list_batch_outputs: List[Dict]) float [source]
average all the loss of the list_batches
- Parameters
list_batch_outputs – a list of outputs
- Returns
average_loss
- abstract do_calc_metrics(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) Dict [source]
calc the scores use the predicts or list_batch_outputs
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> >>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
the named scores
- abstract do_predict(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict) List [source]
Process the model predict to human readable format
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
- Returns
all predicts
- abstract do_save(predicts: List, stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False)[source]
save the predict when save_condition==True
- Parameters
predicts – list of predicts
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – True for save, False for depend on rt_config
- Returns
None
- gather_predict_extend_data(input_data: Dict, i: int, predict_extend_return: Dict)[source]
gather the data register in predict_extend_return :param input_data: the model output :param i: the index is i :param predict_extend_return: the name map which will be reserved
- Returns
a dict of data in input_data which is register in predict_extend_return
- loss_name_map(stage) str [source]
get the stage loss name
- Parameters
stage – valid, train or test
- Returns
loss_name
- process(stage: str, list_batch_outputs: List[Dict], origin_data: pandas.core.frame.DataFrame, rt_config: Dict, save_condition: bool = False) Union[Dict, List] [source]
PostProcess entry
- Parameters
stage – train/test/etc.
list_batch_outputs – a list of outputs
origin_data – the origin pd.DataFrame data, there are some data not be able to convert to tensor
rt_config –
>>> current status >>> { >>> "current_step": self.global_step, >>> "current_epoch": self.current_epoch, >>> "total_steps": self.num_training_steps, >>> "total_epochs": self.num_training_epochs >>> }
save_condition – if save_condition is True, will force save the predict on all stage except online
- Returns
the log_info(metrics) or the stage is “online” return the predicts
- property without_ground_truth_stage: set
there is not groud truth in the returned stage
- Returns
without_ground_truth_stage
- class dlk.data.postprocessors.IPostProcessorConfig(config)[source]
Bases:
dlk.utils.config.BaseConfig
docstring for PostProcessorConfigBase
- property input_map
required the output of model process content name map
- Returns
input_map
- property origin_input_map
required the origin data(before pass to datamodule) column name map
- Returns
origin_input_map
- property predict_extend_return
save the extend data in predict
- Returns
predict_extend_return