점점 읽은 논문은 쌓여가고 포스팅은 속도가 느려 따라가지 못하는 상황이 많이 발생합니다. 하루에 읽고 쓰고 처리할 수 있는 정보의 양이 한정적이란 것이 정말 아쉽습니다. 세상에 보고 읽고 공부해서 해봐야할 것들이 정말 많은데 좀 더 효율적인 방법 없을까..하는 생각이 들곤합니다. 머리에 GPU를 달 수도 없고...orz... 이럴땐 처리 속도가 빠른 천재들이 부럽습니다.
그런 의미에서!? 아직 InfoGAN에서 하고픈 얘기가 좀 남았지만 최근 읽은 논문 하나를 더 소개해보고자 합니다. (일단 저질르면 언젠가는 다 마무리하겠지..-_-...)
사실 순서대로 따지면 f-GAN을 한 번 소개하고 WGAN과 그 직전에 WGAN의 모태가 된 논문도 정리하고 뭐 할 것이 많지만....
어느 세월에 다 쓰나싶어 거두절미하고 이해하기 편하고 쓰기 편한 LSGAN부터 올리기로 했습니다! 이름부터 뭔가 편합니다. Least Square는 regression이나 optimization을 공부하면 가장 처음에 나오는 내용이죠. 매우 익숙한데 얘를 어떻게 이용하는지 한 번 살펴보겠습니다.
지난 글에 이어 오늘은 InfoGAN의 이론적인 부분을 중점적으로 살펴보도록 하겠습니다. 저번에 얘기했던 내용을 짧게 정리해보면 다음과 같습니다
기존의 GAN과는 달리 InfoGAN은 생성 모델에 들어갈 Input에 latent code $c$를 추가하여 학습합니다. 이 때, 무의미한(trivial) code가 학습되는 것을 막기 위해 생성 모델의 분포 $G(z,c)$와 상호 정보량(Mutual Information, MI)이 높도록 제약을 부여합니다. 즉, $I(c;G(z,c)$가 높게 유지시킵니다.
"In other words, the information in the latent code $c$ should not be lost in the generation process."
즉, $c$가 생성 모델을 지나는 동안 변형되지 않는 뭔가 중요한 정보를 담도록 학습하길 기대하는 것입니다.
결과적으로 InfoGAN은 아래와 같은 information-regulaized minimax 문제를 풉니다: $$\min_G \max_D V_I(D,G) =V(D,G)-\lambda I(c;G(z,c)).$$
오늘은 InfoGAN이라는 논문을 소개하고자 합니다. 정보 이론(information-theoretic)에서 아이디어를 가져와 GAN에 붙여 확장한 것인데, 어려워 보이는 수식어가 붙는 것에 비해 아이디어가 상당히 간단하고 매우 흥미로운 결과를 보여줍니다. 더불어 original GAN을 다른 관점으로 분석하는 길을 열어주기 때문에 GAN을 더 깊히 이해하는데 도움이 될 것이라 생각합니다.
기본적인 아이디어는 다음과 같습니다. 기존의 GAN은 생성 모델의 input이 $z$ 하나인 것에 비해 InfoGAN은 ($z,c$)로 input에 code라는 latent variable $c$가 추가됩니다. 그리고 GAN objective에 애드온(add-on)을 하나 붙여서 생성 모델이 학습을 할 때, latent 공간($z$-space)에서 추가로 넣어준 코드 $c$와 생성된 샘플(generated sample) 사이의 Mutual Information (MI)가 높아지도록 도와줍니다.
* Korean version of this content is available at HERE.
Today, I am going to review an educational tutorial which was delivered in NIPS 2016 by Prof. Andrew Ng. As far as I know, there is no official video clip available but you can download the lecture slides by searching in internet.
You can see the video with almost identical contents (even the title is exactly the same) in the following link:
* I really recommend you guys to listen to his full lecture. However, watching video takes too much time to get an overview quickly. Here, I summarized what he tried to deliver in his talk. I hope this helps. ** Note that I skipped a few slides or mixed the order to make it easier for me to explain.
How do you get deep learning to work in your business, product, or scientific study? The rise of highly scalable deep learning techniques is changing how you can best approach AI problems. This includes how you define your train/dev/test split, how you organize your data, how you should think through your search among promising model architectures, and even how you might develop new AI-enabled products. In this tutorial, you’ll learn about the emerging best practices in this nascent area. You’ll come away able to better organize your and your team’s work when developing deep learning applications.
Trend #1
Q) Why is Deep Learning working so well NOW?
A) Scale drives DL progress
The red line, which stands for the traditional learning algorithms such as SVM and logistic regression, shows a performance plateau in the big data regime (right-hand side of the x-axis). They did not know what to do with all the data we collected.
For the last ten years, due to the rise of internet, mobile and IOT (internet of things), we could march along the X-axis. Andrew commented that this is the number one reason why the DL algorithm works so well.
So... the implication of this :
To hit the top margin, you need a huge amount of data and a large NN model.
Trend #2
According to Ng, the second major trend is end-to-end learning.
Until recently, a lot of machine learning used real or integer numbers as an output, e.g. 0 or 1 as a class score . In contrast to those, end-to-end learning can give much more complex output than numbers, e.g. image captioning.
It is called "end-to-end" because the input and output of the system are directly linked by a neural network unlike traditional models which have several intermediate steps. This works well in many cases that are not effective while using traditional models. For example, end-to-end learning shows a better performance in speech recognition tasks:
While presenting this slide, he introduced the following anecdote:
"This end-to-end story really upset many people. I used to get around and say that I believe "phonemes" are the fantasy of the linguists and machines can do well without them. One day at the meeting in Stanford a linguist yelled at me in public for saying that. Well...we turned out to be right."
This story seems to say that end-to-end learning is a magic key for any application but rather he warned the audience that they should be careful while applying the model to their problems.
Despite all the excitements about end-to-end learning, he does not think that this end-to-end learning is the solution for every application.
It works well in "some" cases but it does not in many others as well. For example, given the safety-critical requirement of autonomous driving and thus the need for extremely high levels of accuracy, a pure end-to-end approach is still challenging to get to work for autonomous driving.
In addition to this, he also commented that even though DL can almost always train a mapping from X to Y with a reasonable amount of data and you may publish a paper about it, it does not mean that using DL is actually a good idea, e.g. medical diagnosis or imaging.
End-to-End works only when you have enough (x,y) data to learn function of needed level of complexity.
I totally agree with the above point that we should not naively rely on the learning capability of the neural network. We should exploit all the power and knowledge of hand-designed or carefully chosen features which we already have.
In the same context, however, I have a slightly different point of view in "phonemes". I think that this can and should be also used as an additional feature in parallel which can reduce the labor of the neural network.
Machine Learning Strategy
Now let's move on to the next phase of his lecture. Here, he tries to give a glimpse of answer or guideline to the following issues:
Often you will have a lot of ideas for how to improve an AI system, what will you do?
Good strategy will help avoid months of wasted effort. Then, what is it?
I think this part is a gist of his lecture. I really liked his practical tips all of which can be actually applied in my situations right away.
One of those tips he proposed is a kind of "standard workflow" which guides you while training the model:
When the training error is high, this implies that the bias between the output of your model and the real data is too big. To mitigate this issue, you need to
train longer,
use bigger model
or adopt a new one.
Next, you should check whether your dev error is high or not. If it is, you need
more data,
regularization
oruse a new model architecture.
* Yes I know, this seems too obvious. Still, I want to mention that everything seems simple once it is organized under an unified system. Constructing an implicit know-how to an explicit framework is not an easy task. Here, you should be careful with the implication of the keywords, bias and variance. In his talk, bias and variance have slightly different meanings than textbook definitions (so we do not try to trade off between both entities) although they share a similar concept. In the era before DL, people used to trade off between the bias and variance by playing with regularization and this coupling was not able to overcome because they were tied too strongly. Nowadays, however, the coupling between these two seems like to become weaker than before because you can deal with both separatelyby using simple strategies, i.e. use bigger model (bias) and gather more data (variance).
This also implicitly shows the reason why DL seems to be more applicable to various problems than the traditional learning models. By using DL, there are at least two ways to solve the problems which we are stuck in real life situations as mentioned above.
Use Human Level Error as a Reference
To know whether your error is high or low, you need a reference. Andrew suggests to use a human level error as an optimal error, or Bayes error . He strongly recommended to find the number before going deep in research because this is the very critical component to guide your next step.
Let's say our goal is to build a human level speech system using DL. What we usually do with our data set is to split them with three sets; train, dev(val) and test. Then, the gaps between these errors may occur as below:
You can see the gaps between the errors are named as bias and variance. If you take time and think a while, you will find that it is quite intuitive why he named the gap between human level error and training set error as bias and the other as variance.
If you find that your model has a high bias and a low variance, try to find a new model architecture or simply increase the capacity of the model. On the other hand, if you have a low bias but a high variance, you would be better to try gathering more data as an easy remedy. To see this more clearly, let's say you have the following results:
Train error : 8%
Dev error : 10%
If I were to tell you that human level error for such a task is of the order of 1%, you will immediately notice that this is the bias issue. On the other hand, if I told you that human level error is around 7.5%, this would be now more like a variance problem. Then you would be better to focus your efforts on the methods such as data synthesis or gathering the data more similar to the test. As you can see, just because you use a human level performance as a baseline, you can always have guidelines where to focus on among several options you may have.
Note that he did not say it is "easy" to train a big model or to gather a huge amount of data. What he tries to deliver here is that at least you have an "easy option to try" even though you are not an expert in this area. You know... building a new model architecture which actually works is not a trivial task even for the experts. Still, there remains some unavoidable issues you need to overcome.
Data Crisis
To deal with a finite amount of data to efficiently train the model, you need to carefully manipulate the data set or find a way to get more data (data synthesis). Here, I will focus on the former which brings more intuitions for practitioners.
Say you want to build a speech recognition system for a new in-car rearview mirror product. You have 50,000 hours of general speech data and 10 hours of in-car data. How would you split your data?
This is a BAD way to do it:
Having a mismatched dev and test distributions is not a good idea. You may spend months optimizing for dev set performance only to find it does not work well on the test set.
So the following is his suggestion to do better:
A few remarks
While the performance is worse than humans, there are many good ways to progress;
error analysis
estimate bias/variance
etc.
After surpassing the human performance or at least near the point, however, what you usually observe is that the progress becomes slow and almost gets stuck. There can be several reasons for that such as:
Label is made by human (so the limit lies here)
Maybe the human level error is close to the optimal error (Bayes error, theoretical limit)
What you can do here is to find a subset of data that still works worse than human and make the model do better.
Andrew ended the presentation with emphasizing two ways one can improve his/her skills in the field of deep learning;
practice, practice, practice and do the dirty work
(read a lot of papers and try to replicate the results).
which happen to be exactly a match with my headline of blog. I am glad that he and I share a similar point of view:
READ A LOT, THINK IN PICTURES, CODE IT, VISUALIZE MORE!
I hope you enjoyed my summary. Thank you for reading :)