<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://tomsilver.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://tomsilver.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-03-05T10:20:20+00:00</updated><id>https://tomsilver.github.io/feed.xml</id><title type="html">blank</title><subtitle>Tom Silver&apos;s academic website. </subtitle><entry><title type="html">Prompt Fiddling Considered Harmful</title><link href="https://tomsilver.github.io/blog/2026/prompt-fiddling/" rel="alternate" type="text/html" title="Prompt Fiddling Considered Harmful"/><published>2026-01-06T00:00:01+00:00</published><updated>2026-01-06T00:00:01+00:00</updated><id>https://tomsilver.github.io/blog/2026/prompt-fiddling</id><content type="html" xml:base="https://tomsilver.github.io/blog/2026/prompt-fiddling/"><![CDATA[<p>In machine learning, a golden rule followed by approximately no one is that you should only touch your test set once. You should keep it locked away (don’t even glance at it!) until the moment when you are ready to render a final judgment about the model you have developed, run prediction exactly once, and report dispassionately on the outcome. In reality, when those first-final results are disappointing, it is very tempting to fiddle with something and try again. This fiddling is a form of hyperparameter optimization on the test set—a kind of <a href="https://en.wikipedia.org/wiki/Leakage_(machine_learning)">leakage</a> that, like contaminants leaking into a water supply, may be benign in small quantities but <a href="https://arxiv.org/abs/1902.10811">harmful at scale over time</a>.</p> <p>In 2026, much of this fiddling is happening in natural language—in the prompts of increasingly sophisticated models that wire together LLMs in various ways. In the olden days, when there were numeric hyperparameters like “learning rate” and “batch size”, there was only so much fiddling that one could do, and very little room for creativity. Prompt fiddling has no such limit. One can rephrase, add information, remove information, give examples, change languages, encourage step-by-step reasoning, sneak in hidden information, add line breaks, swap punctuation. Fiddling is more free-form, more fun, and more likely than ever to leak the test set.</p> <p>As a step towards mitigating prompt fiddling in academic work, it is standard practice to share the full prompts that were used to generate a paper’s final results. Paper reviewers can inspect the prompts and make a judgement about how much fiddling was done. Sometimes it is easy to spot very specific phrasing or information that shouldn’t have been given. But other times, it’s virtually impossible to tell.</p> <p>For example, consider the two prompt templates below for planning in <a href="https://www.cs.toronto.edu/~sheila/2542/s14/A1/introtopddl2.pdf">PDDL domains</a>. One is the result of me fiddling until I obtained <strong>80%</strong> success on 10 tasks. The other is the result of me fiddling in the opposite direction, until I obtained <strong>30%</strong>. Can you tell which is which?</p> <h5 id="prompt-1">Prompt 1</h5> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You're an AI planner. Given these PDDL domain and problem definitions, think through the planning process step by step, explaining your reasoning at each stage (e.g., selecting actions, resolving preconditions). When you've completed the plan, output only the list of action steps inside &lt;plan&gt;&lt;/plan&gt; tags, one per line, in standard PDDL action format.

DOMAIN:
{domain_text}

PROBLEM:
{problem_text}

Example:
Let's say the plan is [move a b], [pick c]. Output:

&lt;plan&gt;
(move a b)
(pick c)
&lt;/plan&gt;
</code></pre></div></div> <h5 id="prompt-2">Prompt 2</h5> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Imagine you are narrating a story where each action taken by an agent in this PDDL problem unfolds step by step. Describe the reasoning and sequence of actions as you go. At the end, summarize the plan actions in PDDL syntax within &lt;plan&gt;&lt;/plan&gt; tags so they can be extracted for execution.

Here are your materials:

DOMAIN:
{domain_text}

PROBLEM:
{problem_text}
</code></pre></div></div> <p>See footnote 1 for the answer. If I included one of these prompts in a paper, could you, as a reviewer, determine how much I fiddled, and how much it made a difference? I certainly could not!</p> <p>In this case, I was being very deliberate and unabashed in my prompt fiddling. A well-intentioned researcher would not be so brazen, but over the course of a long research project, the effects of fiddling can creep in nonetheless. A little tweak here today, a little rephrasing there next week—before long, you’ve completely overfit your prompts to your test set.</p> <p>So what should we do? Well first, let’s be clear: prompt engineering is not the problem. Like all forms of hyperparameter optimization, prompt engineering (automated or manual) is good and useful, <em>so long as it does not touch the test set</em>. The problem is <em>prompt fiddling</em>—changing the prompt to try to obtain better performance on the supposedly-held-out <em>test set</em>. To the extent that this is an old machine learning problem reskinned, we can start to address it with a renewed commitment to old practices:</p> <ol> <li>Optimize prompts with respect to a <a href="https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets">validation set</a> (not test set).</li> <li>Report as much information as possible about the optimization process.</li> <li>Clarify the kind and scope of generalization claimed.</li> <li>Participate in benchmarks where test sets are truly hidden and <a href="https://dswalter.github.io/machine-learnings-first-cheating-scandal.html">fiddling is considered cheating</a>.</li> <li>Be especially careful about generalization claims when working on reinforcement learning problems where “<a href="https://x.com/jacobandreas/status/924356906344267776">it’s socially acceptable to train on the test set</a>.”</li> </ol> <p>But these practices need to be upgraded, too. For example, for (1), how exactly should we optimize prompts? There are tools for automated prompt optimization [e.g., <a href="https://dspy.ai/">1</a>, <a href="https://www.promptingguide.ai/techniques/ape">2</a>, <a href="https://arxiv.org/abs/2309.03409">3</a>, <a href="https://arxiv.org/abs/2309.16797">4</a>, <a href="https://github.com/microsoft/promptbench">5</a>], but no widely agreed-upon standards when it comes to using these tools as part of the research process in the same way that there are norms for numeric hyperparameter sweeps. In the absence of that, for (2), how can we write an informative and honest report when our optimization process was “I varied the words until it worked”? I don’t have the answers<sup>2,3</sup>, but I do think it’s important that we search for them. In the meantime, more than ever, we need to stick to the golden rule: lock the test set away, and remember that a result improved by fiddling is false progress.</p> <hr/> <h4 id="acknowledgements">Acknowledgements</h4> <p>I am grateful to Hannah Blumberg, Nishanth Kumar, and Zahra Bashir for feedback on this post.</p> <hr/> <p><small><sup>1</sup>The domain was Miconic, taken from <a href="https://arxiv.org/abs/2305.11014">this paper</a>. The tasks are only a subset of training tasks from the paper. Note also that the paper is about <em>generalized planning</em>, not planning. I just borrowed the domain and problems to run a quick experiment for this blog post. For this experiment, I used GPT 4.1, the “smartest non-reasoning model” from OpenAI. <i>Prompt 1 obtained 30% and Prompt 2 obtained 80%</i>.</small></p> <p><small><sup>2</sup>Nishanth Kumar offers another nice idea: ask an LLM to “paraphrase” the prompts used in the paper and report results on the paraphrased version alongside the main results.</small></p> <p><small><sup>3</sup>Zahra Bashir also shares a very interesting perspective: “I like to think of a prompt not as a single brittle string, but as representing a small <em>semantic neighborhood</em>. Instead of picking the prompt that just happens to perform best on a dev set, it feels more reasonable to prefer prompts that perform well and consistently across semantically close variations, like paraphrases or small structural changes. Intuitively, this means looking at something like the average performance of a prompt across its neighborhood, and possibly penalizing prompts whose performance changes a lot with small wording tweaks. Personally, when I read a paper and see that changing a few words in the prompt doesn’t drastically affect the results, I think I trust the approach more. It gives the sense that the result isn’t coming from some lucky phrasing, but from something more fundamental, maybe even that the chosen prompt is close to “(sub)optimal” in a broader sense. This also led me to an empirical question I find interesting: “if a prompt achieves high performance on the training or dev set, does that necessarily mean its semantic neighbors also have low error?” My intuition is that there should be some correlation, but it’s not obvious, and it feels like something worth studying rather than assuming. Overall, this perspective makes me prefer prompts that are both good and stable across their neighborhood. In that sense, instead of only maximizing mean accuracy, optimizing for robustness to small semantic changes feels like a more principled objective. It also gives a clean story: the prompt isn’t a single fragile string, but something that works reliably across a neighborhood of equivalent semantics.”</small></p>]]></content><author><name></name></author><summary type="html"><![CDATA[In machine learning, a golden rule followed by approximately no one is that you should only touch your test set once. You should keep it locked away (don’t even glance at it!) until the moment when you are ready to render a final judgment about the model you have developed, run prediction exactly once, and report dispassionately on the outcome. In reality, when those first-final results are disappointing, it is very tempting to fiddle with something and try again. This fiddling is a form of hyperparameter optimization on the test set—a kind of leakage that, like contaminants leaking into a water supply, may be benign in small quantities but harmful at scale over time.]]></summary></entry><entry><title type="html">Lessons from My First 8 Years of Research</title><link href="https://tomsilver.github.io/blog/2024/lessons/" rel="alternate" type="text/html" title="Lessons from My First 8 Years of Research"/><published>2024-07-28T00:00:01+00:00</published><updated>2024-07-28T00:00:01+00:00</updated><id>https://tomsilver.github.io/blog/2024/lessons</id><content type="html" xml:base="https://tomsilver.github.io/blog/2024/lessons/"><![CDATA[<p>Six years ago, after two years of working at a research-oriented startup and just before starting graduate school, I wrote a blog post called <strong>Lessons from My First Two Years of AI Research</strong>. That post is below. Now that I’ve finished my PhD, I’ve updated the post at the bottom with a few more lessons learned. <br/> <br/></p> <h3 id="lessons-from-my-first-two-years-of-ai-research-2018"><strong>Lessons from My First Two Years of AI Research (2018)</strong></h3> <hr/> <h2 id="starting-out">Starting out</h2> <hr/> <h4 id="find-someone-who-you-feel-comfortable-asking-dumb-questions">Find someone who you feel comfortable asking “dumb” questions</h4> <p>I was initially very intimidated by my colleagues and hesitant to ask basic questions that might betray my lack of expertise. It was many months before I felt comfortable enough with a few colleagues to ask questions, and still my questions were carefully formulated. Now I have three or four go-to people. I wish I had found them sooner! Before I was drowning in a backlog of terms to Google after work. Now I immediately ask a question when it comes and my confusion is resolved before it compounds.</p> <h4 id="search-for-research-inspiration-in-different-places">Search for research inspiration in different places</h4> <p>Deciding what to work on can be the hardest part of research. Some general strategies that I have seen employed by researchers with long track records:</p> <ol> <li><strong>Talk to a researcher in a different field.</strong> Ask what problem they are excited about and try to restate the problem in computational terms. Ask if they have any datasets that they want to analyze, for which existing techniques seem insufficient. A lot of the most impactful work in machine learning results from collisions with bio/chem/physics, social sciences, or pure math. I am thinking, for example, of <a href="https://arxiv.org/pdf/1603.06277.pdf">this NIPS 2016 paper</a> by Matthew Johnson et al., which was motivated by a dataset of mouse behavior, or <a href="https://arxiv.org/pdf/1704.01212.pdf">this ICML 2017 paper</a> by Justin Gilmer et al. with applications to quantum chemistry.</li> <li><strong>Code a simple baseline to get a feel for a problem.</strong> For example, try to write some carefully calibrated code for controlling an <a href="https://gym.openai.com/envs/Pendulum-v0/">inverted pendulum</a>, or try see how far you can push a bag-of-words model on a natural language dataset. I often run into unanticipated situations when writing baselines – bugs in my mental model or code. By the time my baseline is working, I usually have a handful of other ideas to try and a deeper understanding of the problem.</li> <li><strong>Extend the experiments section of a paper you like.</strong> Read the methods and results carefully. Try to find the duct tape. Consider the simplest extensions first and ask whether the paper’s method would suffice. Think about baseline methods not discussed and imagine where those might fall short.</li> </ol> <h4 id="invest-in-visualization-tools-and-skills">Invest in visualization tools and skills</h4> <p>The strategy I have come to adopt for writing research code is to start by creating visualization scripts. When the rest of the code has been written, running the visualization scripts will allow me to quickly verify whether my code matches my mental model. Even more importantly, good visualizations will often make bugs in my thinking or code far more obvious and interpretable than they would be otherwise. There is also something to be said here for self-motivation: when I finish this code, I will have a pretty figure or video to show people!</p> <p>Coming up with the right visualization for the problem at hand can be tricky. If you are iteratively optimizing a model (e.g. deep learning), plotting loss curves is always a good place to start. There are also many techniques for visualizing and interpreting learned weights of (especially convolutional) neural networks such as guided backpropagation (e.g. <a href="https://arxiv.org/pdf/1412.6806.pdf">guided backpropagation</a>). In reinforcement learning and planning, the obvious thing to visualize is the agent acting in its environment, be it an Atari game, a robotic task, or a simple grid world (e.g. the environments in <a href="https://gym.openai.com/envs">OpenAI Gym</a>). Depending on the setup, it may also be possible to visualize the value function and how it changes over the course of training, or the tree of explored states. When dealing with graphical models, visualizing the distribution of a one or two dimensional variable as it changes over inference can be highly informative. One barometer for the effectiveness of a visualization technique is to estimate the amount of information that you have to hold in your head everytime you analyze a visualization. A bad visualization will require recalling in detail the code you wrote to produce it; a good one will scream an obvious conclusion.</p> <h4 id="identify-the-fundamental-motivations-of-researchers-and-papers">Identify the fundamental motivations of researchers and papers</h4> <p>Researchers publishing in the same conferences, using the same technical jargon, and calling their field Artificial Intelligence can have polar opposite research motivations. Some folks have even suggested different names for the field in an effort to clear things up (e.g. Michael Jordan, in an excellent recent <a href="https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7">blog post</a>). There are at least three main clusters that might be called the “math”, “engineering”, and “cognitive” motivations.</p> <ul> <li>“Math” motivation: what are the fundamental properties and limits of an intelligent system?</li> <li>“Engineering” motivation: how can we develop intelligent systems that solve real problems better than alternative approaches?</li> <li>“Cognitive” motivation: how can we model natural intelligence like that found in humans and other animals?</li> </ul> <p>These motivations can be harmonious and many AI papers are interesting from multiple perspectives. Moreover, individual researchers are often driven by more than one of these motivations, which helps glue the field of AI together.</p> <p>However, the motivations can also be at odds. I have some friends and colleagues who are distinctly of the “Engineering” bent and others who are primarily interested in “Biology.” A paper showing that some clever combination of existing techniques is sufficient to break the state of the art on a benchmark will pique the interest of the engineers but might earn yawns or even scorn from the cognitive scientists. The reverse will happen towards a paper with only theoretical or toy results but claims of biological plausibility or cognitive connections.</p> <p>Good papers and researchers will state at the outset their motivation, but often the fundamental impetus is buried. I have found it useful to consider papers through each lens one at a time in case the motivation is not obvious.</p> <h2 id="drinking-from-the-research-community-firehose">Drinking from the research community firehose</h2> <hr/> <h4 id="finding-papers">Finding papers</h4> <p>Papers in AI are fairly accessible and often published on <a href="https://arxiv.org/">arXiv</a>. The sheer number of papers coming out right now is exciting and overwhelming. A number of folks in the community have made it easier to sort signal from noise. Andrej Karpathy hosts the <a href="http://arxiv-sanity.com/">arXiv sanity preserver</a> with some helpful sorting, searching, and filtering features. <a href="https://twitter.com/Miles_Brundage">Miles Brundage</a> used to tweet a lightly curated list of arXiv papers each night; this duty has largely been assumed by the <a href="https://twitter.com/BrundageBot">Brundage Bot</a>. Many other tweeters share interesting references from time to time – I recommend following your favorite researchers on Twitter (<a href="https://twitter.com/tomssilver/following">here are the people I follow</a>). If Reddit is your thing, <a href="https://www.reddit.com/r/MachineLearning/">r/MachineLearning</a> is pretty good, but the posts are often geared more towards ML practioners than academic researchers. Jack Clark publishes a weekly community newsletter called <a href="https://jack-clark.net/">“Import AI”</a> and Denny Britz has one called “<a href="https://www.getrevue.co/profile/wildml">The Wild Week in AI</a>.”</p> <p>Scrolling through conference proceedings when they are published can also be worthwhile. The big three conferences are <a href="https://neurips.cc/">NeurIPS</a>, <a href="http://icml.cc/">ICML</a>, and <a href="https://iclr.cc/">ICLR</a>. Other reputable general-audience conferences include <a href="https://aaai.org/Conferences/AAAI-18/">AAAI</a>, <a href="https://www.ijcai-18.org/">IJCAI</a>, and <a href="http://auai.org/uai2018/index.php">UAI</a>. Each subdiscipline has more specific conferences too. For computer vision, there is <a href="http://cvpr2018.thecvf.com/">CVPR</a>, <a href="https://eccv2018.org/">ECCV</a>, and <a href="http://iccv2017.thecvf.com/">ICCV</a>; for natural language, there is <a href="http://acl2018.org/">ACL</a>, <a href="http://emnlp2018.org/">EMNLP</a>, and <a href="http://naacl2018.org/">NAACL</a>; for robotics, there is <a href="http://www.robot-learning.org/">CoRL</a> (for learning), <a href="http://icaps18.icaps-conference.org/">ICAPS</a> (for planning, including but not limited to robotics), <a href="http://www.icra2017.org/">ICRA</a>, <a href="https://www.iros2018.org/">IROS</a>, and <a href="http://www.roboticsconference.org/">RSS</a>; for more theoretical work, there is <a href="https://www.aistats.org/">AISTATS</a>, <a href="http://www.learningtheory.org/colt2018/">COLT</a>, and <a href="http://www.kdd.org/kdd2018/">KDD</a>. Conferences are by far the dominant venue for publication, but there are journals as well. <a href="http://jair.org/">JAIR</a> and <a href="http://jmlr.org/">JMLR</a> are the two most prominent journals specific to the field. Occasionally high profile papers will also come out in general scientific journals like <a href="https://www.nature.com/">Nature</a> and <a href="sciencemag.org">Science</a>.</p> <p>It is equally important but often much harder to find older papers. Those considered “classic” will often turn up from following reference trails, or from browsing the reading lists of graduate courses. Another way to discover older papers is to start with a senior professor in the field and find their earlier works, i.e. the research that paved the path to their professorship. Also feel free to email those professors to ask for additional references (though don’t take offense if they are too busy to reply). I don’t know of a consistent way to find older papers that are lesser known or overlooked beyond searching for keywords in <a href="scholar.google.com">Google scholar</a>.</p> <h4 id="how-much-time-should-be-spent-reading-papers">How much time should be spent reading papers?</h4> <p>I have heard two common pieces of advice regarding the amount of time one should spend with prior work. First, when just starting out, read all of the papers! People often say that the first semester or year of graduate school should be nothing but paper reading. Second, presumably beyond this initial ramp-up period, do not spend too much time reading papers! The rationale for the latter being that it is easier to creatively pose and solve problems if one is not biased towards previous approaches.</p> <p>Personally, I agree with the first bit of advice and disagree with the second. I think one should read as many papers as possible always, so long as there is still time left over for original research. The notion that I will be better equipped to come up with a novel, superior approach to a hard problem if I am unfamiliar with what others have tried seems unlikely at best and arrogant at worst. Yes, a fresh perspective on a problem can be key, and yes, stories of amateurs solving longstanding challenges because of their outside-the-box thinking are inspiring (e.g. <a href="https://curiosity.com/topics/how-george-dantzigs-late-arrival-to-class-made-math-history-curiosity/">George Dantzig</a> showing up late to lecture). But a career researcher cannot really depend on these fortunate jumps to sections of solution space not yet considered. The vast majority of time is spent patiently following the gradient, chipping away at a problem slowly and methodically. Reading relevant papers then is just a far more efficient way to figure out where we are and what to try next. (See also Julian Togelius on “<a href="http://togelius.blogspot.com/2016/04/the-differences-between-tinkering-and.html">tinkering versus research</a>.”)</p> <p>With regard to reading as many papers as possible, there is one important caveat: taking time to digest a paper is just as important as reading it. It is better to spend a day with a handful of papers, taking careful notes and reflecting on each, than it is to devour paper after paper in succession. Read all of the papers that you can, but no more than that.</p> <h4 id="conversations-videos--papers--conference-talks">Conversations » videos &gt; papers &gt; conference talks</h4> <p>Papers are definitely the most accessible source for understanding an unfamiliar research idea. But what path is most efficient? Different people may answer this question differently. For me, I have found that having a conversation (ideally with folks who already understand the idea in question) is by far the quickest and most effective path to understanding. In the case that such people are unavailable, videos about the subject, e.g. the author of the paper giving an invited talk, can provide very good insight. When the presenter is addressing a live audience, they tend to prioritize clarity more than concision. The priorities are swapped in most paper writing, where word count is king and background explanations may even be viewed as evidence of an author’s unfamiliarity with the field. Finally, short conference talks are often more of a formality than an educational opportunity. Of course, a conversation with the presenter afterwards could be invaluable.</p> <h4 id="beware-the-hype">Beware the hype</h4> <p>Successful AI research solicits public attention, which brings more people into the field, which leads to more successful AI research. This cycle is mostly virtuous, but one pernicious side effect is hype. Journalists trying to get clicks, companies vying for investors and recruits, and researchers aiming for high profile publications and citations are all guilty of inflating the hype bubble. It is important to remain mindful of these various motives when assessing a headline or press release or paper.</p> <p>At NIPS 2017, during the Q&amp;A portion of a paper talk in a room with several hundred audience members, a prominent professor took the microphone (“on behalf of the hype police”) and admonished the authors for using the word “imagination” in their paper title. I have mixed feelings about these sorts of public confrontations and I happen to have liked the particular paper in question. But I completely sympathized with the professor’s frustration. One of the most common and aggravating manifestations of hype in AI research is the renaming of old ideas with flashy new terms. Beware of these buzzwords – judge a paper based primarily on its experiments and results.</p> <h2 id="running-the-research-marathon">Running the research marathon</h2> <hr/> <h4 id="always-be-making-measurable-progress">Always be making measurable progress</h4> <p>When searching for research projects early on, I spent hours and hours brainstorming. Brainstorming, for me at the time, meant putting my head down at my desk and hoping that some vague intuitions would coalesce into a concrete insight. At the end of a day of brainstorming, I would often feel tired and discouraged. Was this research, I wondered?</p> <p>There is, of course, no recipe for research progress, and fumbling around in the dark is part of (most of) the process. However, I now find it much easier and more fulfilling to structure my work around measurable objectives. If I have very little idea what I’m doing next, the objective can be: write down a vague idea in the greatest detail available; if, in the course of writing the idea I rule it out, write down the reason for ruling it out (rather than scrapping the whole thing and losing the measure of progress). In the absence of any ideas, progress can take the form of papers read or conversations with colleagues had. By the end of each day, I now try to have some tangible evidence of my work. Even if the ideas are never used, my morale is much improved, and I need not worry about wasting future cycles on the same ideas that I ruled out that day.</p> <h4 id="learn-to-recognize-and-backtrack-from-dead-ends">Learn to recognize and backtrack from dead-ends</h4> <p>Strong researchers spend more time on good ideas because they spend less time on bad ideas. Being able to sort the good from the bad seems to be largely a function of experience. Nonetheless, researchers at any level constantly encounter the following decision. My research idea is flawed or inconclusive. Should I A) try to salvage or support the idea further, or B) try to justify abandoning the idea completely? I personally regret spending more time doing A) when I should have done B). Especially early on, I became stuck several times in what I now recognize as dead-ends and remained there for too long. My reluctance to leave was likely rooted in the sunk cost fallacy – in backtracking from the dead-end, I would be sacrificing the time that I had already expended.</p> <p>I still feel a twinge of disappointment when I leave research dead-ends. What I am now trying to internalize is that backtracking is forward progress, counterintuitively enough. The cost was well spent, not sunk. If I hadn’t explored this dead-end today, I might have considered it tomorrow. Dead-ends are not the end. Also they’re a healthy part of life. Hopefully one of these mantras will stick. If not, there’s also <a href="https://www.goodreads.com/quotes/38125-we-are-trying-to-prove-ourselves-wrong-as-quickly-as">a Feynman quote</a>.</p> <h4 id="write">Write!</h4> <p>I once had an occasion to ask a very prominent AI researcher for early career tips. His advice was simple: write! Write blog posts and papers of course, but even more importantly, write down your thoughts throughout the day. Since he said that, I have noticed an obvious difference in progress that I make when I am actively writing versus simply thinking.</p> <h4 id="mental-and-physical-health-are-prerequisites-for-research">Mental and physical health are prerequisites for research</h4> <p>There is the dangerous trope of the academic researcher who forgoes sleep and self-care in an obsessive pursuit of scientific discovery. I have often been guilty of putting such behavior on a pedestal and striving towards it myself. I now understand (at a rational level, at least) that exercise and mental breaks are investments, not distractions. If I spend 8 hours sleeping and 4 hours working, I am immensely more productive than having spent 4 hours sleeping and 8 hours working, saying nothing of the downstream effects.</p> <p>It can be very difficult to stop working in the middle of a tough problem. I still have the tendency to grind away at something even when I have passed the point of exhaustion or frustration and have no real chance of progress without a break. When I am able to step away and take a long breath, I am always happy that I did so. I hope to continue internalizing this fact as I move on to the next phase of my research career.</p> <h2 id="update-2024-lessons-from-6-more-years"><strong>Update (2024): Lessons From 6 More Years</strong></h2> <hr/> <p>Looking back at what I wrote 6 years ago, I am surprised to find that I still agree with most of it. I will add 6 more miscellaneous lessons that I’ve learned since then.</p> <h4 id="what-comes-with-experience">What comes with experience?</h4> <p>My 2-years-in self would have liked to know what aspects of research naturally get easier over time.</p> <ol> <li><strong>Knowing what people know.</strong> Novice researchers already have a lot of knowledge, but they don’t necessarily know what is <em>common knowledge</em>. This makes research communication difficult because it’s not clear what needs to be emphasized. It’s also tricky to find the right level of abstraction for your audience. Over the years, you start to know what people know. Communication is then much easier.</li> <li><strong>Generating ideas.</strong> There is a phase transition in one’s research career where “I have no project ideas” suddenly becomes “I have too many project ideas and not enough time.” This happens naturally because each completed project inspires multiple new directions. For me, this transition was a big relief, because I found the “no ideas” phase to be very stressful. It took about 5 years for the transition to happen.</li> <li><strong>Reading papers.</strong> It gets exponentially easier to read papers after the first few years in a field. You learn to take a quick diff between a paper and previous work. You also learn to distinguish between challenging papers that are inherently difficult, yet rewarding, and papers that are just written poorly.</li> <li><strong>Riding the rollercoaster.</strong> Research is inherently a rollercoaster—this fact doesn’t change. But once you’ve been through the ups and downs a few times, your stomach gets steelier. You start to internalize that volatility is just part of the ride and not indicative of something being wrong.</li> </ol> <p>Research is still hard. But some things get easier!</p> <h4 id="collaborate-with-the-right-people">Collaborate (with the right people)</h4> <p>I am happiest and do my best work when I am working very closely with 1 other person. It took a while to learn this about myself, and more time still to figure out what exactly I need in a collaborator. (For me: I like someone who is uber-reliable, over-communicative, fast, and funny.) It’s important to distinguish your <em>good-collaborator-for-me</em> classifier from your <em>good-researcher</em> classifier. I’ve learned through experience that there are people who are very good researchers—and great collaborators for other people—but not a good match for me.</p> <p>Collaborations are invaluable for sharing ideas, distributing workloads, and, when working with someone from a different field, learning the tricks of their trade. Collaborations are also extremely important for emotionally managing the research rollercoaster. When you’re feeling down, good collaborators will pick you up, and then you can later return the favor. In my 2nd year of graduate school, I felt lost and seriously considered exiting with a terminal Master’s. Starting a collaboration (with <a href="https://rohanchitnis.com/">Rohan Chitnis</a>) saved my PhD.</p> <h4 id="invest-in-software-engineering">Invest in software engineering</h4> <p>Starting a code repository for a new research project has become one of my (guilty?) pleasures. Oh, new code, unmarred by bad decisions, free of <a href="https://en.wikipedia.org/wiki/Code_smell">smells</a>—it’s a thing of beauty. I also enjoy it because over several years, and with a lot of learning from collaborators, I have found some software engineering (SWE) practices that work well for me. For example, nearly all the Python code I write these days has type checking, linting, autoformatting, unit tests, and continuous integration (e.g., see <a href="https://github.com/tomsilver/python-starter">this starter repo</a>). These things are par for the course in professional SWE land, but not necessarily standard in academic research.</p> <p>Good SWE practices are extremely helpful in research. Even for code that only you use; even for code that survives for only one project. Bugs are much easier to catch and code is much faster to write with a little bit of up-front investment. If you’re going to be developing the code for more than a day or two, <a href="/assets/img/good-swe-helps.png">good SWE is probably worth it</a>. As a bonus, by the time you’re ready to submit a paper, your code supplement will be trivial to include if your code is already clean (and you won’t have to worry about discovering any heart-breaking bugs at the last moment)!</p> <p>The best way to improve your SWE practices is to collaborate closely with someone whose habits you would like to emulate. A few months of code reviews (e.g., literal code reviews through GitHub) and pair programming can go a long way. In the absence of a collaborator, the next best thing is carefully reading high-quality code. I am still learning a lot by doing this periodically.</p> <h4 id="be-dogmatic-about-problems-not-approaches">Be dogmatic about problems, not approaches</h4> <p>Every researcher naturally develops an affinity for certain technical approaches. The best researchers are ready to abandon these approaches when better ones come along. This is really hard to do. <em>It took years for me to develop expertise in this approach, and now I’m supposed to start from scratch with a new one?</em> (There’s a lot of this going around in our current LLM era, but it’s nothing new–see nice reflections from <a href="https://colinraffel.com/blog/language-model-development-as-a-new-subfield.html">Colin Raffel</a> or <a href="https://perceiving-systems.blog/en/post/my-first-iccv">Michael J Black</a>.)</p> <p>The best researchers, or at least the ones I most admire, are ready to abandon their approaches because they’re passionate about solving <em>problems</em>. They care deeply about solving those problems and they don’t care how we do it. Russ Tedrake, who has <a href="https://supervised-robot-learning.github.io/">spoken recently</a> about changing his mind on the viability of large data-driven models for robotics, is one great example.</p> <h4 id="beware-of-cobras">Beware of <a href="https://en.wikipedia.org/wiki/Perverse_incentive">cobras</a></h4> <p>Feedback in research is incredibly sparse. When people say this, they usually mean that the paper writing cycle takes a long time. This is true, but I want to go further and say that <em>real</em> feedback is even sparser. The point of research is to make a discovery that leads to a discovery that leads to a discovery that eventually changes the world. This takes a <em>really</em> long time. Who’s to say if we’re making progress?</p> <p>In the face of this feedback void, it is tempting to latch on to metrics that are easier to measure than what you really care about. Metrics like <em>praise from advisor</em> or <em>number of papers published</em>. Beware of <a href="https://en.wikipedia.org/wiki/Perverse_incentive">cobra effects</a>: optimizing these metrics, just because they can be easily measured, may lead you away from your real objectives. I’m not saying to ignore them completely—sure, spend an extra hour polishing your meeting slides; find the right emojis for your paper publicity tweet—but when you’re quietly pondering what projects to pursue, or what design choices to make, think first of your <em>real</em> goals and fend off any cobras lurking about.</p> <h4 id="there-is-no-one-archetype-of-a-good-researcher">There is no one archetype of a good researcher</h4> <p>One nagging thought I had as I prepared my faculty applications: <em>I am not the professorial type</em>. I am not the center of attention at conference dinners. I am not highly opinionated. I am not prepared to deliver an extemporaneous lecture or pull a perfect quote out of my hat. Eventually I realized that this archetypal researcher was a fiction (and probably a reflection of my own insecurities).</p> <p>There is no one way to be a good researcher. Without embarrassing anyone, I am thinking of one person who is a true intellectual, deeply knowledgable in their subjects and their meta-subjects; another who is unbelievably productive, pushing their field forward through sheer force of will; another who, yes, is the center of attention at conference dinners and has the charisma and vision to shape the future; another who does most of their work quietly and alone, periodically poking their head out with remarkable discoveries; another who is passionate about a specific application and will do anything to make it happen; the list goes on.</p> <p>I found this realization to be a good antidote to imposter syndrome. If you ever find yourself doubting your ability to grow into a good researcher, try to find <em>just one</em> person who you admire and whose style and personality aligns with your own. Researchers are a motley crew—I guarantee this person is out there.</p> <hr/> <h2 id="acknowledgements">Acknowledgements</h2> <p><strong>(2018)</strong> Many thanks to Hannah Blumberg, Kate Donahue, Mihail Eric, Ankit Gupta, Eric Purdy, Austin Stone, and Scott Swingle for reading and providing excellent feedback on an early version of the original post.</p> <p><strong>(2024)</strong> Thanks to Hannah Blumberg, Rohan Chitnis, Nishanth Kumar, and Rajat Kumar Jenamani for providing very helpful feedback on the updates to the post.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Six years ago, after two years of working at a research-oriented startup and just before starting graduate school, I wrote a blog post called Lessons from My First Two Years of AI Research. That post is below. Now that I’ve finished my PhD, I’ve updated the post at the bottom with a few more lessons learned.]]></summary></entry></feed>