I work with Gemini in a project that involves large-scale detection of bounding boxes. Given the size of the use case, I need structured output with minimal post-processing, so I rely on Pydantic models as the schema.
What I notice is that Gemini returns valid responses with a level of consistency that is unusual for LLMs. Even with fairly complex, conditional and nested schemas, the output always validates correctly for me. I would normally expect failures in structure or type when providing the schema via the prompt, but those do not seem to occur here.
From what I suspect, this reliability is not so much a result of training as it is a property of how tokens are selected during decoding. It looks as though the model is constrained in such a way that schema compliance becomes the default path.
Has anyone looked into this in more detail? I would be interested in pointers to documentation or prior discussion about how Gemini achieves this behaviour, and whether there are known edge cases where it breaks down.
Pydantic is definetely a good choice when structured responses are required. Also, pydantic is compatible with many LLMs which is quite handy if you have to change the model in the future.
Since I write a paper involving Bounding Box detection, I asked myself how Gemini handles the structured output so well. During the PoC i sent way over 1.000 images to Gemini and not a single time the response format was invalid.
From what i found in the internet is that OpenAI solves this by using Context Free Grammar to enforce JSON rules.