After testing Mistral-Instruct and Zephyr, I decided to start figuring out more ways to integrate them in my workflow. Running some unit tests now, and noting down my observations over multiple iterations. Sharing my current list:

  • give clean and specific instructions (in a direct, authoritative tone - - like “do this” or “do that”)
  • If using ChatGPT to generate/improve prompts, make sure you read the generated prompt carefully and remove any unnecessary phrases. ChatGPT can get very wordy sometimes, and may inject phrases into the prompt that will nudge your LLM into responding in a ChatGPT-esque manner. Smaller models are more “literal” than larger ones, and can’t generalize as well. If you have “delve” in the prompt, you’re more likely to get a “delving” in the completion.
  • be careful with adjectives - - you can ask for a concise explanation, and the model may throw the word “concise” into its explanation. Smaller models tend to do this a lot (although GPT3.5 is also guilty of it) - - words from your instruction bleed into the completion, whether they’re relevant or not.
  • use delimiters to indicate distinct parts of the text - - for example, use backticks or brackets etc. Backticks are great for marking out code, because that’s what most websites etc do.
  • using markdown to indicate different parts of the prompt - I’ve found this to be the most reliable way to segregate different sections of the prompt.
  • markdown tends to be the preferred format for training these things, so makes sense that it’s effective in inference as well.
  • use structured input and output formats: JSON, markdown, HTML etc
  • constrain output using JSON schema
  • Use few-shot examples in different niches/use cases. Try to avoid few-shot examples that are in the same niche/use case as the question you’re trying to answer, this leads to answers that “overfit”.
  • Make the model “explain” its reasoning process through output tokens (chain-of-thought). This is especially useful in prompts where you’re asking the language model to do some reasoning. Chain-of-thought is basically procedural reasoning. To teach chain-of-thought to the model you need to either give it few-shot prompts, or fine-tune it. Few-shot is obviously cheaper in the short run, but fine tune for production. Few shot is also a way to rein in base models and reduce their randomness. (note: ChatGPT seems to do chain-of-thought all on its own, and has evidently been extensively fine-tuned for it).
  • break down your prompt into steps, and “teach” the model each step through few-shot examples. Assume that it’ll always make a mistake, given enough repetition, this will help you set up the necessary guardrails.
  • use “description before completion” methods: get the LLM to describe the entities in the text before it gives an answer. ChatGPT is also able to do this natively, and must have been fine-tuned for it. For smaller models, this means your prompt must include a chain-of-thought (or you can use a chain of prompts) to first extract the entities of the question, then describe the entities, then answer the question. Be careful about this, sometimes the model will put chunks of the description into its response, so run multiple unit tests.
  • Small models are extremely good at interpolation, and extremely bad at extrapolation (when they haven’t been given a context).
  • Direct the model towards the answer you want, give it enough context.
  • at the same time, you can’t always be sure which parts of the context the LLM will use, so only give it essential context - - dumping multiple unstructured paragraphs of context into the prompt may not give you what you want.
  • This is the main issue I’ve had with RAG + small models - - it doesn’t always know which parts of the context are most relevant. I’m experimenting with using “chain-of-density” to compress the RAG context before putting it into the LLM prompt… let’s see how that works out.
  • Test each prompt multiple times, Sometimes the model won’t falter for 20 generations, and when you run an integration test it’ll spit out something you never expected.
  • Eg: you prompt the model to generate a description based on a given JSON string. Let’s say the JSON string has the keys “name” “gender” “location” “occupation” “hobbies”.
  • Sometimes, the LLM will respond with a perfectly valid description “John is a designer based in New York City, and he enjoys sports and video games”.
  • Other times, you’ll get "The object may be described as having the name “John”, has the gender “Male”, the location “New York City”, the occupation “designer”, and hobbies “sports” and “video games”.
  • At one level, this is perfectly “logical” - - the model is technically following instructions, but it’s also not an output you want to pass on to the next prompt in your chain. You may want to run verifications for all completions, but this also adds to the cost/time.
  • Completion ranking and reasoning: I haven’t yet come across an open source model that can do this well, and am still using OpenAI API for this.
  • Things like ranking 3 completions based on their “relevance”, “clarity” or “coherence” --these are complex tasks, and, for the time being, seem out of reach for even the largest models I’ve tried (LLAMA2, Falcon 180b).
  • The only way to do this may be to get a ranking dataset out of GPT4 and then fine tune an open-source model on it. I haven’t worked this out yet, just going to use GPT4 for now.
  • Use stories. This is a great way to control the output of a base model. I was trying to get a base model to give me JSON output, and I wrote a short story of a guy named Bob who makes an API endpoint for XYZ use case, tests it, and the HTTP response body contains the JSON string … (and let the model complete it, putting a “}” as the stop sequence).
  • GBNF grammars to constrain output. Just found out about this, testing it out now.

Some of these may sound pretty obvious, but I like having a list that I can run through whenever I’m troubleshooting a prompt.

Copied from here