We have a certain fixed vocabulary or ontology and would like to match the output of a LLM to this vocabulary.
The LLM output could be either a single term, a list of terms, or free form text. For example, the vocabulary is a list of occupations, and we would like to ask "provide a list of occupations for which this skill is useful".
The solution could either be a post-processing step (which would make it useful for non-LLM-texts, too), or somehow integrated in the generation.
- For a small vocabulary, I could include it in the prompt. However the vocabulary is likely very large. Going by the identifiers alone it could also be ambiguous, like "Doctor" (of medicine or of physics).
- One thing I tried was to calculate the embedding of each term in the ontology, then the embedding of the term I would like to match, and then pick the term that is closest to that. This works for simple words or lists, but it doesn't work very well in free-form texts. I did manage to work around it, by asking the LLM to mark occupation terms with "[]". It also doesn't work well when the word alone doesn't completely define the term.
Is there any standard way to restrict or match the output of an LLM to a fixed vocabulary?