To Explore or to Exploit

Every time I go to Taco Bell, I face a dilemma. What should I order? Should I order my favorite item (a five-dollar box with a black bean chalupa, a spicy potato soft taco, cheesy fiesta potatoes, and a Baha Blast), or should I try something new on the menu, given that there is a chance that it might become my new favorite?

For most of my childhood, my go-to order every time I went to Taco Bell was two chalupas. Taco Bell runs were not too common back then as my parents instilled in me and my brother the value of having home-cooked meals. We would only get Taco Bell on road trips and such, where home-cooked options were not available. Thus, it was in my best interest to be risk averse; I did not want to take the chance that a novel item I ordered was not better than my chalupas, which would have resulted in a waste of a Taco Bell run. Unfortunately, as a result, I was missing out on a lot of the other menu items that I have now come to love.

However, in high school, after I started earning my own money and was able to go to Taco Bell slightly more often, my propensity to explore increased. I could afford to take a risk on new menu items because I probably would just be back next week if I did not enjoy it. While I loved a lot of the new items I tried, I definitely would not order some others again, but at least now I know what I like and dislike.

In college, given that I’ve tried most of the (vegetarian) menu items by now, I’ve converged towards my current favorite order: the five-dollar box. Each time I order it, I know what I’m missing out on, and I can rest easy knowing that this item maximizes my “utility per dollar”. Sure, as new menu items come out, I order them just in case, but so far, nothing has been able to beat my five-dollar box.

While this “dilemma” of mine is very much a first-world problem and may seem trivial to most, it outlines a much larger class of decisions all of us are tasked with daily. The question may pose itself in the form: Do I do/take/make x, something I’ve done before and know what the results will be, or do I do/take/make y, something I haven’t tried before and don’t know if it will beat x or not?

Taking the route previously traveled would be considering “exploiting” (you are exploiting your past experience/knowledge) and trying something new would be “exploring,” where you take on the risk that it may or may not be worth your time or effort. This is the classic Explore/Exploit dilemma.

This dilemma best emerges in the field of reinforcement learning, a subfield of machine learning algorithms. Here, the computer (i.e. an agent) is trained in an environment (maybe a video game or the stock market) to take the actions that maximize its cumulative rewards. Let’s say we are training the computer to complete the maze below. Finishing the maze grants the agent a reward of 100 points, collecting the diamond yields 25 points, and each step the agent takes yields it negative -1 points (to incentivize quick completion of the maze). Assume that it doesn’t know where the diamond is initially. The actions available to the agent would be up, down, left, and right assuming there isn’t a wall in the way in each of those directions.

Maze

The most direct approach to completing the maze is shown in red. It is the shortest path from start to finish which grants the agent 58 points (don’t worry I counted the number of steps). If this is the strategy the agent adopts, then every time it is put in the maze, it’ll take this direct route to the exit. However, if you take a look at the path in blue, while it is longer, the additional reward of the diamond compensates for the extra steps required, yielding a reward of 63 points. The agent will never know about this path if it never chooses to explore the maze instead of exploiting the current strategy it has. We want the agent to consider other possibilities rather than employing its initial strategy repeatedly. In any environment, an agent may be missing out on a novel strategy that would reward it with even more points.

We see that the agent is also facing the Explore/Exploit dilemma. Out of all the paths the agent knows, it knows the red one is the best. However, it doesn’t know what paths it’s missing out on. In other words, it doesn’t know what it doesn’t know; exploration aims to reduce this disparity, but exploitation is what guarantees us rewards, albeit maybe not maximally.

How does one find the delicate balance between exploration and exploitation? One solution proposed by the Deep Q Learning algorithm is the epsilon-greedy approach. It consists of a parameter, epsilon, which starts at 1. This represents the probability of exploring. Over time, we multiply it by a decay factor, which let’s say is 0.95. So, the first time the agent enters the maze, it takes all random actions at each step and starts to learn the layout of the maze. Eventually, by pure luck, we hope that it’ll be able to complete the maze. Because it probably hasn’t explored every inch of the maze, it wouldn’t know if the path it just took is the optimal one.

The second time, the agent takes random actions 95% of the time, and the action it thinks is optimal 5% of the time (remember it technically doesn’t know the best action at this point). The third time it enters the maze, it takes a random action 90.25% (0.95 * 0.95) of the time. This gives the agent the liberty to explore at the beginning but towards the end, converge on a solution that is optimal for maximizing its reward. Note that the epsilon-greedy approach will not let epsilon get below a certain threshold, which might be around 0.05. This means even when the agent thinks it knows what actions in the maze are the best, it will always take random actions with some small probability given the chance of there always being a more optimal solution. Given that our toy maze example is pretty small, it’s easy to figure out the most optimal path, but the same doesn’t hold for much more complex environments, such as a video game.

This dilemma extends beyond just the world of binary bits. Let’s examine the role the epsilon-greedy approach may play in our lives when we face the Explore/Exploit dilemma.

Chances are when you walk into college, you might not know too many people. UW-Madison was a pretty popular choice for students from my high school, but none of my closest friends chose Madison, so I was tasked with founding my own social circles. Given that with classwork and clubs you have a limited time to invest in your social life, you want to maximize the quality of the people in your social circles. Every time you meet someone new, you don’t know right off the bat if they are compatible with you well enough to establish a lasting friendship; you need time to get to know them better and for them to do the same with you. Every time you decide who you want to hang out with, you need to think to yourself: Do I want to further deepen my connection with this person that I already know with whom I had a good time (exploit), or would I rather spend my time meeting new people (explore)? The Explore/Exploit appears again in the most unexpected way. Chances are if you have never asked yourself this question before, you either usually hang out with an already established friend group, or you’re naturally a people person who wants to always meet new people. In either of those situations, you’re predetermined to explore or exploit.

But for those of us in the middle, how do you thread this fine line? Let’s see what insights applying the epsilon-greedy approach yields us.

One solution to this problem might be to go out and meet a lot of people for the first couple of months of college because your epsilon is 1. You want to get a feel for your campus and the types of people that are there. You want to keep the scope of the people you meet as broad as possible, and you shouldn’t just be limited to the ones in your classes or pursuing the same major as you.

As you explore, you will find that some people you meet might not be who you thought they were. Some people just are not meant to be friends with you whether it is from their differing values or the extra drama that a friendship with them entails.

This ratio between meeting new people and hanging out with your close friends will start to shift towards the latter as time goes on. You’ll be repeatedly multiplying epsilon by a decay factor over time. You will start to find people with whom you’ll want to deepen your connection with because they seem like valuable friends. Maybe this is why many people tend to gravitate toward their already-established friend groups. This isn’t to say that people aren’t down to meet new people, but it’s definitely not the same as what it was at the beginning of the semester.

As a senior in college, I find myself hanging out with the people I’ve already established a very strong friendship with, people I know that I’ll have a worthwhile time with. However, this does not mean there should be a point in college where you never meet new people. There is never a point in the epsilon-greedy approach where epsilon reaches 0. We always want to apportion some of our time to meeting new people, even if it was less than our freshman-year selves.

Thinking back to my freshman year, I’m really glad I decided to “exploit” some people (I know this sounds really bad out of context 😭). Every time I hung out with these people, I found myself laughing until my stomach hurt or walking away with a newfound perspective on a topic that interested me. Usually both. Naturally, I came to learn that every minute spent with these people was a minute well spent. I found myself completely understood and naturally gravitating towards them.

Some of those friendships luckily became ones that have made me a better person and have further imbued me with a sense of purpose and meaning in life. I’m still really close with many of them and I can’t thank them enough for all that they have done for me. These are people that I will continue to put the effort into staying in touch with.

Some other friendships, unfortunately, fell apart after a couple of years. But that’s the nature of the epsilon-greedy approach. You’ll find that some of the times, even though you’ve been rewarded for “exploiting,” the environment dynamics change, and you’re required to change the values you assign to each action, or person in this case. This is also why I’m glad that I’ve kept my epsilon greater than 0 because I’ve met some amazing people during my junior and senior years that are everything I look for in a friend.

Beyond social connections, we face the tradeoff of exploring and exploiting daily with the vast amount of entertainment and content to consume in the world. Do I listen to my playlist or try finding a new one? Do I put on my favorite TV show or watch a new one? Do I start up my favorite video game, or do I finally buy that one I’ve heard stellar reviews about? Do I reread that book or relisten to that podcast episode that left a positive impact on me, or do I try reading/listening to a new one with the chance that I might learn even more?

It's natural to want to gravitate towards content that we know is tried and tested because we don’t want to waste a couple of hours if that new TV show sucks or if that new book is long-winded and boring. However, at the same time, we don’t want to get bored. My roommate loves reading and prefers hard-cover books and he and I were recently talking about how there is still value for a Kindle in your life even if you prefer actual paper books: a Kindle allows you to explore a large number of books with free samples found online and by saving you a trip to the library. You can still buy the paper version if you enjoyed a book and want to further peruse it. You save money by avoiding books that would not have been worth your money, and you get to enjoy your meaningful literature with the hard-cover format that you prefer.

If you’re like me then your entire library is on your Kindle, but that doesn’t mean I don’t decide to read a new book or further deepen my understanding of a previously read book that meant a lot to me. Maybe I want to jump back into fiction and reread the Harry Potter series, something I know I’m guaranteed to enjoy. Given that the time span for my reading is the rest of my life, I’ve still been exploring new books a majority of the time and rereading only in situations where I need to further understand a concept previously presented in a book. I’m sure that as I get older, I’ll find myself rereading the books I enjoyed a lot more.

Earlier this year, I made a pretty big decision to exploit rather than explore. My internship with Amazon this past spring was in California. I loved the area, and I met a lot of amazing people, all while learning a lot about the tech industry. It felt like an area that I might want to live in full-time after college. Being surrounded by very intelligent and motivated people was very important to me and I feel like I left California a more competent and intellectual person than who I was entering.

When I got the TikTok summer internship offer, I was given the choice to choose between California and Washington. Did I want to stay in California, where I had already started to establish my social connections and understanding of the area, or did I want to plunge myself into a completely new environment, acknowledging that there is a chance that I might like it even more than California? Given the fact that I wanted to build on the friendships I had already created, along with the large overhead of moving, I ended up choosing California, which was a decision I don’t regret, even though the epsilon-greedy approach suggested exploring with a higher probability.

How do we choose the decay factor and minimum epsilon for the epsilon-greedy approach?

The epsilon-greedy approach only works if you have suitable values for the decay factor and minimum epsilon. My advice is to consider the timespan for which you’ll be “playing the game” and how regularly you play it. If your timespan is limited or you don’t do that thing often, it might be more worth exploiting given your limited number of attempts. However, if you see that this is a decision you find yourself making often, you should probably take a longer period to explore first before settling on an ideal approach. Obviously, I’m not saying to walk everywhere with a calculator or generate a random number each time to determine whether to explore or exploit but having the concept of exploring versus exploiting tucked in your back pocket can yield some clarity when facing these types of decisions.

Whether or not you agree with the solution proposed by the epsilon-greedy approach to the Explore/Exploit dilemma is up to you. Everyone lives their life differently and how you live yours is your choice. However, that doesn’t mean that you’ll never find yourself in it. These scenarios outlined here are ones that everyone is tasked with facing, and it might be in your best interest to take an approach that potentially makes you happier and more satisfied in the long term.

In the maze of life, are there potential diamonds we might be missing out on?