A bit of a shorter video, but I tried throwing a Triplebyte-style programming quiz at GPT-3 to see how it would do.
Spoiler: it doesn't do very well.
My best guess for why it doesn't do very well is that the multiple-choice format of the quiz trips it up a bit. If you were to instead ask basic data-retrieval-style questions like, "What does HTML stand for?" then the results would probably be quite a bit better.
For this experiment though, I tried multiple question formats: with/without a "Reason:", labeling the choices with asterisks vs letters, different example questions/answers, etc. The highest score I got was 6/10, although with enough tries it could probably score better through sheer luck.
On the topic of luck, keep in mind that I ran all my tests with temperature=0.0, which should mean that the model behaves deterministically - given the same input, it should return the same output. However, the input was not deterministic because the choices were shuffled randomly each time. As shown in the video, the order of the choices affects the weights of the next tokens, sometimes enough to change the answer completely.
The other thing that the language model seemed to have trouble with are negative questions, e.g. "Which of the following choices is true?" or "Which of the following is not applicable?" The video has an example where GPT-3 generated a reason that was true on its own, but not true in the context of the question.
Overall, I think this ties back to a point that Gwern and others have made, which is that successful usage of GPT-3 and similar models require some amount of "prompt engineering". And it's likely that soon we will have an industry-standard job of "people who get pre-trained models to do useful things". That is to say, we won't eliminate programming with intelligent language models - instead we'll be programming one level higher by "programming" prompts and prompt templates that produce good results.