The Cherry on the Cake: Thoughts on Reinforcement Learning

Tech Integer
3 min readOct 15, 2020

Yann Le Cunn, the Chief of Artificial Intelligence (AI) Scientist at Facebook, once said:

If intelligence were a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.

The cherry on the cake.

Farlex’s free dictionary defines this term as “an additional benefit or positive aspect to something that is already considered positive or beneficial.” In Macmillan’s thesaurus, it means “the final thing that makes something perfect.”

I agree more with the first definition. Maybe I’m just not fond of cherry. I think it does not significantly affect the taste of my cake. It is only a decoration — eye candy.

But, I felt confused when hearing Yann’s statement. I do not see that Reinforcement Learning holds little importance in explaining intelligence.

When I was a kid, I saw how people react to what I say or do. Maybe their face change, the way they breathe, their posture. These changes reinforce me to do or not speak of the words ever again.

So, I think humans learn through the consequence of their actions, which shapes their intelligence. This learn-from-consequence method is what Reinforcement Learning based upon.

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning algorithm that considers consequences (rewards and punishment) to solve tasks. An agent, which represents the RL algorithm, learns its behavior through trial-and-error interactions with its environment.

RL has gained popularity in the research world for the past decade. From Google’s DeepMind to Elon Musk’s OpenAI, RL has proven to be able to find solutions unthinkable by humans.

Let’s take a look at DeepMind’s AlphaGo. This RL agent beat Go Master Lee Sedol in four out of five matches in 2016. Go is a form of board game popular in eastern Asian countries like China, Korea, and Japan.

Deepmind’s AlphaGo vs Lee Sedol. Sources: https://deepmind.com/alphago-korea

Many Go experts claimed that the way AlphaGo played was untypical to humans. It produced Go movements that first seems absurd but eventually leads the agent to dominate the game.

This phenomenon happens because the RL agent learns through exploring and exploiting. Exploration is when an agent looks for a new strategy in the hope of getting a high reward. Exploitation is when the agent uses the approach it knows provides the highest reward.

In the context of AlphaGo, the agent’s exploration creates strategies unknown to humans.

The main challenge in reinforcement learning lies in building the simulation environment. For example, it is relatively simple to simulate Chess or Go. But, to simulate an autonomous car environment is more complicated.

Only when the simulation environment is designed and analogous to its real application can RL work optimally.

Recently, Manchester City Football Club with Google launched a competition to build RL to play football games. This challenge aims to discover a different approach to football strategy. In the future, the classic 4–4–2 formations might be irrelevant to the game due to Reinforcement learning.

Conclusion

Back to the cherry’s analogy, I think the statement aims to define RL’s role in explaining intelligence in biological creatures. It does not underestimate Reinforcement Learning in terms of its capability or importance.

Furthermore, we have not yet clearly defined what intelligence is. Maybe we will revise Yann Le Cunn’s statement once we have a clearer idea of how intelligence works.

In response to Yann’s statement, Pieter Abbeel, a professor at UC Berkeley, said,

I prefer to eat a cake with a lot of cherries because I like reinforcement learning.

In the end, is it only a matter of taste?

--

--

Tech Integer

Tech Integer is a blog where I share my insights and learning in Self-Driving Car. Let's take a ride!