Evaluation
What does it mean to do a good job?
Generated code matches a reference solution?
Generated code resembles a reference solution? (CodeBLEU)
Generate code passes some test cases: pass@1, pass@k.
Is the generated code algorithmically optimal? Test against competitive programming problems.
Does the generated code compile and pass static analysis checks?
Is the generated code aesthetically pleasing? (HumanEval)
Training
Data Collection
Data set is mostly source code but also an English code related natural language corpus
- Public repositories on GitHub
- Github Markdown and StackExchange
Pre-trained on repository-level code to better learn the nuances of file-dependencies
Filter code
- Syntax: syntax errors and StarCoder filtering rules
- Semantic: filter out code that has poor readability and low modularity
Dependency parsing: parse dependencies between files and then arrange them in the proper order.
Repository-Level De-depulication: remove “duplicate” repositories.
Techniques
Next Token Prediction (NTP): given an entry, predict what comes next. Goal is to predict the subsequent token based on the provided context.
Fill in the Middle (FIM).
Grounding
Want to connect code with natural language. So we also collect tutorials, blog posts, documentation, StackOverflow answers, etc. Use pull requests, commit messages, GitHub discussions to provide context to the code. Computer Science textbooks can be used as well.
Want to also make this model aware of the world and logically sound, so we train it on large text corpus (the same way as with standard LLMs) and on math word problems (to teach it logic).
Post Training
Alignment
Instruction fine tuning: retrain the model on instructions with correct output. This is usually with a manually curated dataset.
Chain-of-Thought: add the phrase “You need to write a step-by-step outline and then write the code” following the initial prompt. This allows it to think through the actual problems with more inference time.
Synthetic Instruction Datasets
We can use LLMs to generate instruction datasets in instruction, response pairs.
- Pick a code snippet from GitHub
- Ask the LLM to provide an instruction that matches this code
- Ask the LLM to provide the response to the instruction
- Ask the LLM to score this pair of instruction and response
- Create a dataset of only the highest scoring pairs.
Multilingual Code Generation
The datasets can be skewed towards a few languages (i.e. Python). How can we extend what is learnt for one language to every language?
- Have one model specialized for each language
- Identify instructions that one model can handle very well while others struggle. use that model to create test cases and train other models to get better on that instruction. Finally, identify knowledge gaps across languages and fix them.
Beam Search
Maintains multiple candidate sequences concurrently.
- Start with a set of “beams” (initial sequences)
- At each step, expand every beam by considering the top candidate tokens
- Keep only the top ‘b’ beams based on cumulative probabilities
Pros:
- Explores multiple paths to potentially find better sequences than
Cons:
- Computationally expensive