title: NLP
#+STARTUP: overview
Data process
all input should numerical, categorized character shoud be one-hot coded, starting with 1
Tokenizationn
- Breaking text to words (There are many steps to consider )
- CountWordFrequencies (counted key-value dictionary)
- list all sorted dictionary
- if the list is too big, removing infrequent words(because of incorrection, or neame...) good for one-hot coding
- encode to sequences
- with counted dictionary index
- index length is one-hot coding vector length
- one-hot coding all sequences
- if one-hot code vector is not so lang, word embedding is not needed
<!-- -->
tests[5] = "this is a cat and a"
tests_dict = {"this": {1:1}, "is": {2:1}, "a":{3:2}, "cat":{4:1}, "and":{5:1}}
tests_sequences = [1,2,3,4,3]
Word Embedding
compose high dimension one-hot vector to low dimension is high dimensional vector after one-hot coding(v,1) of collected data is the parameter matrix trained by data(d,v), is low dimensional vector(d,1), for further training : The dimension parameter d is important, can be vertified with corss validation. Each row of is called (words vector词向量), can be interpreted with classification
Embedding layer need the number of vocabulary(v), embedding~dim~(d), and word~num~(cuted words number) v*d : parameters for this layer
language model
skip-gram
CBOW
continuous bag of words
text generatation
Encoder: A is RNN layer or LMST layer, all input(x1 to xm) share the same A, hm is the last result, only give hm to decoder, we can generate text, but many content of input will be forget
seq2seq
After one resulte in Decode is generated, With Corss Enteopy to update the Network, using all the resulte we get, to predict the next resulte until all is finished consuming the previously generated symbols as additional input when generating the next.
Transformer
simple RNN + attention
Encoder Input E Decoder Input after RNN or LSTM we get Now unlike before only pass the last element to Decoder, we use attention skill to mix all input information
-
Notion:
- Encoder, lower index stands for the index of input order in Encoder
- Decoder, high index stands for the index of generated items in Decoder
stands for the parameter for generate the j-th item ()in Decoder with respect of the i-th input() in X.
-
Variables
- Encoder input, ,
- Encoder shared parameter, A: RNN or LSMT shared parameter
- Encoder output , output at each step of RNN or LSMT
- Decoder initial input , denote also as
- key,
- query
- Query Martix,
- Encoder Weight ,
- Eecoder Context Vector,
- Decoder initial input , denote also as
- Decoder output,
-
update Network with softmax(c) get the prediciton, and corss enteopy update network back()
simple RNN + self attention
only Encoder, e Without Decoder and Decoder input, after RNN or LSTM we get Now unlike before only pass the last element to Decoder, we use attention skill to mix all input information
-
Notion:
- Encoder, lower index stands for the index of input order in Encoder
- Generation, high index stands for the index of generated items
stands for the parameter for generate the j-th item ()in Encoder with respect of the i-th input() in X.
-
Variables
- Encoder input, ,
- Encoder shared parameter, A: RNN or LSMT shared parameter
- Encoder output , output at each step of RNN or LSMT
- key,
- query
- Query Martix,
- Encoder Weight ,
- Eecoder Context Vector,
-
update Network
- with softmax(c) get the prediciton, and corss enteopy update network back()
-
Note
- attention: key, with
- self attention: key,
attention layer
An attention function can be described as mapping a query and a set of key-value pairs to an output Encoder Input E Decoder Input Removing RNN or LSMT, only constructing attention layer
-
Notion:
- Encoder, lower index stands for the index of input order in Encoder
- Decoder, high index stands for the index of generated items in Decoder
stands for the parameter for generate the j-th item ()in Decoder with respect of the i-th input() in X.
-
Variables
- value,
- query
- key,
- Query Martix,
- Encoder Weight ,
- Eecoder Context Vector,
-
update Network
- with softmax(c) get the prediciton, and corss enteopy update network back()
-
Note
- X replace H, but still seq2seq model(with X')
self attention layer
only Encoder, e Without Decoder and Decoder input,
-
Notion:
- Encoder, lower index stands for the index of input order in Encoder
- Generation, high index stands for the index of generated items
stands for the parameter for generate the j-th item ()in Encoder with respect of the i-th input() in X.
-
Variables
- Encoder input, ,
- value,
- key,
- query
- Query Martix,
- Encoder Weight ,
- Eecoder Context Vector,
-
update Network with softmax(c) get the prediciton, and corss enteopy update network back()
-
Note
- in query , it's X , not X'
transformer model
/--+--\ /-----\ /-----\ /-----\
|a_1 | |a_2 | |a_3 | |a_m |
| | | | | | | |
\-----/ \-----/ \-----/ \-----/
+-------------------------------------------+
|cBLU |
|Encoder |
| h1 h2 h3 hm |
| +------+ +-----+ +-----+ +-----+ |
| | | | | | | | | |>------\
| | A | | A | | A | | A | | |
| +--+---+ +--+--+ +--+--+ +---+-+ | |
| ^ ^ ^ ^ | |
+----+----------+----------+----------+-----+ |
| | | | |
/--+--\ /-----\ /-----\ /-----\ |
|X_1 | |X_2 | |X_3 | |X_m | |
| | | | | | | | |
\-----/ \-----/ \-----/ \-----/ |
| /--+--\ /-----\ /-----\ /-----\
| |c_1 | |c_2 | |c_3 | |c_m |
| |s_1 | |s_2 | |s_3 | |s_m |
| \-----/ \-----/ \-----/ \-----/
|
| +-------------------------------------------+
| | c1AB |
\------->| Decoder |
| +------+ +-----+ +-----+ +-----+ |
| | | | | | | | | |
| | A' | | A' | | A' | | A' | |
| +--+---+ +--+--+ +--+--+ +---+-+ |
| ^ ^ ^ ^ |
+----+----------+----------+----------+-----+
| | | |
/--+--\ /-----\ /-----\ /-----\
|X'1 | |X'2 | |X'3 | |X'm |
| | | | | | | |
\-----/ \-----/ \-----/ \-----/