Appointments

Data format

Input

For one sentence processor:

The input one sentence named “sentence”, label named “labels”

The output named:

    "input_ids",
    "label_ids",
    "word_ids",
    "attention_mask",
    "special_tokens_mask",
    "type_ids", 
    "sequence_ids",
    "char_ids",

The input two sentence named “sentence_a”, “sentence_b”, label named “labels”

The output named:

    "input_ids",
    "label_ids",
    "word_ids",
    "attention_mask",
    "special_tokens_mask",
    "type_ids", 
    "sequence_ids",
    "char_ids",

MASK

We set mask==1 for used data, mask==0 for useless data

Batch First

All data set batch_first=True

Task naming appointments

DLK处理的所有问题我们都看做一个任务,而一个任务又会划分为多个子任务, 子任务又可以有自己的子任务,下面是一个任务的定义方式:

{
    "_name": "task_name", //or "_base", "base_task_name"
    "_link": {}, // this is reserved keywords
    "_search: {}, // this is reserved keywords"
    "sub_task1":{
    },
    "sub_task2":{
    }
}

由于所有的任务他们本身又可以被视为其他任务的子任务,所以我们就来看一下关于一个子任务的一些约定

这是一个子任务的配置格式

{
    "sub_task_name": {
        "_name": "sub_task_config_name",
        ...config
    }
}

or

{
    "sub_task_name": {
        "_base": "base_sub_task_config_name",
        ...additional config
    }
}

配置的key表示这个子任务

sub_task_name 的命名一般会表示该子任务在这个task中所扮演的角色,而每个子任务一般都是由dlk的一个专门的模块进行处理,比如processor任务中的subprocessor子任务均由dlk.data.subprocessors这个模块集合(这个里面会有多个subprocessor)进行处理,为了能区分不同的subprocessor我们在对sub_task_name进行命名时会采用subprocessor@subprocessor_name_a来表明我们使用的是subprocessors这个模块集合中的具有subprocessor_name_a这个功能的subprocessor来处理.

对于配置文件中的 _base_name 模块的命名则会省略掉key中已经包含的sub_task_name

采用 AA@BB#CC的方式对一个子任务的configure进行命名

其中 AA表示处理sub_task_name所在表示的模块集合中的具体模块名,比如最常见的basic表示使用basic模块处理这个子任务,处理方法在对应模块集合中的名为basic中定义的逻辑处理

BB表明这个config处理的是什么问题比如(seq_lab/txt_cls/ets.), CC则表明处理这个问题的配置文件的核心特征

Model appointments

  • All dropout put on output or intern of the module, no dropout for the module input

The main file tree:

.
├── train.py-------------------------: train entry 
├── predict.py-----------------------: predict entry
├── process.py-----------------------: process entry
├── online.py------------------------: online entry
├── managers-------------------------: pytorch_lightning or other trainer
│   └── lightning.py-----------------: 
├── configures-----------------------: all default or specifical config
│   ├── core-------------------------: 
│   │   ├── callbacks----------------: 
│   │   ├── imodels------------------: 
│   │   ├── layers-------------------: 
│   │   │   ├── decoders-------------: 
│   │   │   ├── embeddings-----------: 
│   │   │   └── encoders-------------: 
│   │   ├── losses-------------------: 
│   │   ├── models-------------------: 
│   │   ├── modules------------------: 
│   │   └── optimizers---------------: 
│   ├── data-------------------------: 
│   │   ├── datamodules--------------: 
│   │   ├── processors---------------: 
│   │   └── subprocessors------------: 
│   ├── managers---------------------: 
│   └── tasks------------------------: 
├── core-----------------------------: *core* pytorch or other model code
│   ├── base_module.py---------------: base module for "layers"
│   ├── callbacks--------------------: 
│   ├── imodels----------------------: 
│   ├── layers-----------------------: 
│   │   ├── decoders-----------------: 
│   │   ├── embeddings---------------: 
│   │   └── encoders-----------------: 
│   ├── losses-----------------------: 
│   ├── models-----------------------: 
│   ├── modules----------------------: 
│   ├── optimizers-------------------: 
│   └── schedules--------------------: 
├── data-----------------------------: *core* code for data process or manager
│   ├── datamodules------------------: 
│   ├── postprocessors---------------: 
│   ├── processors-------------------: 
│   └── subprocessors----------------: 
└── utils----------------------------: 
    ├── config.py--------------------: process config(dict) toolkit
    ├── get_root.py------------------: get project root path
    ├── logger.py--------------------: logger
    ├── parser.py--------------------: parser config
    ├── register.py------------------: register the module to a registry
    ├── tokenizer_util.py------------: tokenizer util
    └── vocab.py---------------------: vocabulary

Config Parser Rules

Inherit

Simple e.g.


default.hjson
{
    _base:  parant,
    config: {
        "will_be_rewrite": 3     
    }
}

parant.hjson
{
    _name:  base_config,
    config: {
        "will_be_rewrite": 1,
        "keep": 8     
    }
}

You have the two config named default.hjson, and  parant.hjson, the parser result will be :
{
    _name:  base_config,
    config: {
        "will_be_rewrite": 3,
        "keep": 8     
    }
}

Focus(Representation)

The focus part is for simple the logger file, we will use the value of focus dict to replace the key while logging.

SubModule(Combination)

Due to we using the dict to represent a config, and the key is regarded as the submodule name, but sometimes one top level module will have two or more same submodules(with different config). You can set the submodule name as ‘submodule@speciel_name’.