5.4 C
New York
Thursday, January 2, 2025

Exploring generative AI


TDD with GitHub Copilot

by Pablo Sobocinski

Will the arrival of AI coding assistants like GitHub Copilot mean we won’t need testing? Will TDD become obsolete? To answer this, let’s examine two ways TDD helps software development: by providing good feedback and a means to “divide and conquer” when solving problems.

TDD for good comments

Good feedback is fast and accurate. In both respects, there is nothing better than starting with a well-written unit test. No manual testing, no documentation, no code review, and yes, not even generative AI. In fact, LLMs provide irrelevant and even hallucinate. TDD is especially necessary when using AI coding assistants. For the same reasons we need fast, accurate feedback on the code we write, we need fast, accurate feedback on the code our AI coding assistant writes.

TDD for divide and conquer problems

Divide and conquer problem solving means that smaller problems can be solved before larger ones. This enables continuous integration, trunk-based development, and ultimately continuous delivery. But do we really need all this if AI assistants do the coding for us?

Yes. LLMs rarely provide the exact functionality we need after a single message. Therefore, iterative development is not going away just yet. Additionally, LLMs appear to “elicit reasoning” (see linked study) when they solve problems incrementally through chain of thoughts. LLM-based AI coding assistants work best when they divide and conquer problems, and TDD is the way we do it for software development.

TDD Tips for GitHub Copilot

At Thoughtworks, we’ve been using GitHub Copilot with TDD since the beginning of the year. Our objective has been to experiment, evaluate and develop a series of effective practices around the use of the tool.

0. Getting started

Starting with a blank test file does not mean starting with a blank context. We often start from a user story with some rough notes. We also discussed a starting point with our matchmaking partner.

This is all context that Copilot doesn’t “see” until we put it in an open file (for example, the top of our test file). Copilot can work on typos, spot formatting, bad grammar, you name it. But it can’t work with a blank file.

Some examples of initial context that have worked for us:

  • ASCII art mockup
  • Acceptance criteria
  • Guiding assumptions such as:
    • “No GUI needed”
    • “Use object-oriented programming” (vs. functional programming)

Copilot uses open files as context, so keeping both the test and deployment files open (for example, side by side) greatly improves Copilot’s code completion capabilities.

1. red

TDD represented as a three-part wheel with the 'red' part highlighted in the upper left third

We start by writing a descriptive test example name. The more descriptive the name, the better Copilot’s code completion performance.

We find that a Given-when-then The structure helps in three ways. First, it reminds us to provide business context. Second, it allows Copilot to provide rich and expressive name recommendations for test examples. Third, it reveals Copilot’s “understanding” of the problem from the context of the beginning of the file (described in the previous section).

For example, if we are working on backend code and Copilot is completing the code for our test example name to be, “given the user… click the buy buttonthis tells us that we should update the context at the top of the file to specify, “Assume no GUI” either, “This test suite interacts with the API endpoints of a Python Flask application”.

More “traps” to be aware of:

  • Copilot can code-complete multiple tests at once. These tests are often useless (we eliminate them).
  • As we add more tests, Copilot will complete the code for multiple lines instead of one line at a time. You will often infer the correct “organize” and “act” steps from the names of the tests.
    • Here is the problem: infers the correct “assertion” step less frequently, so here we take special care that the new proof is failing correctly before moving to the “green” step.

2. green

TDD represented as a three-part wheel with the part

Now we are ready for Copilot to help us with the implementation. An existing, expressive, and readable test suite maximizes Copilot’s potential in this step.

That said, Copilot often doesn’t take “baby steps.” For example, when adding a new method, the “baby step” means returning a hardcoded value that passes the test. To date, we have been unable to convince Copilot to take this approach.

Filling tests

Instead of taking “baby steps,” Copilot takes a step forward and provides functionality that, while often relevant, is still untested. As a workaround, we “fill in” the missing tests. While this differs from the standard TDD flow, we have not yet seen any serious issues with our workaround.

Delete and regenerate

For deployment code that needs updating, the most effective way to involve Copilot is to remove the deployment and have it regenerate the code from scratch. If this fails, it may be helpful to remove the content of the method and write the approach step by step using code comments. Otherwise, the best way forward may be to simply turn off Copilot momentarily and code the solution manually.

3. Refactor

TDD represented as a three-part wheel with the 'Refactor' part highlighted in the bottom third

Refactoring in TDD means making incremental changes that improve the maintainability and extensibility of the codebase, all done while preserving behavior (and a functional codebase).

Because of this, we have found that Copilot’s ability is limited. Consider two scenarios:

  1. “I know the refactoring move I want to try”: IDE refactoring shortcuts and features like multi-cursor selection get us where we want to go faster than Copilot.
  2. “I don’t know what refactoring move to take”: Copilot code completion cannot guide us through a refactoring. However, Copilot Chat can make code improvement suggestions directly in the IDE. We’ve started exploring that feature and see the promise of making useful suggestions in a small, localized setting. But we haven’t had much success yet with refactoring suggestions on a larger scale (i.e. beyond a single method/function).

Sometimes we know the refactoring movement but we don’t know the syntax necessary to carry it out. For example, create a mock test that allows us to inject a dependency. For these situations, Copilot can help provide an inline response when prompted via a code comment. This saves us the context switch to documentation or web search.

Conclusion

The common saying, “garbage in, garbage out” applies to data engineering as well as generative AI and LLMs. In other words: higher quality inputs allow better use of the capacity of LLMs. In our case, TDD maintains a high level of code quality. This high-quality information leads to better Copilot performance than would otherwise be possible.

Therefore, we recommend using Copilot with TDD and hope you find the tips above helpful in doing so.

Thanks to the “Ensembling with Copilot” team started at Thoughtworks Canada; They are the main source of the findings covered in this memo: Om, Vivian, Nenad, Rishi, Zack, Eren, Janice, Yada, Geet and Matthew.


Related Articles

Latest Articles