Skip to content

Improve response quality with Gemini. Improve evaluation harness #1150

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Conversation

debanjum
Copy link
Member

@debanjum debanjum commented Apr 4, 2025

Improve Gemini usage

  • Allow text tool to give agent ability to terminate research
  • Set default context for gemini 2 flash models
    2x context window for small, commercial models to 120K
  • Default temperature of Gemini models to 1.0 to reduce repetition

Improve evaluation harness

  • Add more knobs to control eval workflow
    • Allow running eval with any chat model served over an openai compatible api
    • Control random sampling from eval set
    • Auto read web page
  • Use embedded postgres instead of postgres server for eval workflow
  • Use Gemini 2.0 flash as evaluator. Set seed for evaluator to reduce decision variance

@debanjum debanjum force-pushed the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch from 6be6107 to 9156a92 Compare April 4, 2025 14:29
debanjum added 8 commits April 4, 2025 20:11
…iance.

Gemini 2.0 flash model is cheaper and better than Gemini 1.5 pro
- Control auto read webpage via eval workflow. Prefix env var with KHOJ_
  Default to false as it is the default that is going to be used in prod
  going forward.

- Set openai api key via input param in manual eval workflow runs
  - Simplify evaluating other chat models available over openai
    compatible api via eval workflow.
  - Mask input api key as secret in workflow.
  - Discard unnecessary null setting of env vars.

- Control randomization of samples in eval workflow.
  If randomization is turned off, it'll take the first SAMPLE_SIZE
  items from the eval dataset instead of a random collection of
  SAMPLE_SIZE items.
This is the default temperature for non-thinking gemini models on ai
studio. See if using this alleviates the problem.
The queries field name in the first example isn't wrapped in double
quotes, rest are.
Make research planner consistently select tool before query. As the
model should tune it's query for the selected tool. It got space to
think about tool to use in the scratchpad already.
We'd moved research planner to only use tools in enum of schema. This
enum tool enforcement prevented model from terminating research by
setting tool field to empty.

Fix the issue by adding text tool to research tools enum and tell
model to use that to terminate research and start response instead.
…odels

Previously Gemini 2 flash and flash lite were using context window of
10K by default as no defaults were added for it.

Increase default context for small commercial models to 120K from 60K
as cheaper and faster than their pro models equivalents at 60K context.
@debanjum debanjum force-pushed the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch from 9156a92 to 7f18bc0 Compare April 4, 2025 14:41
@debanjum debanjum added the upgrade New feature or request label Apr 4, 2025
@debanjum debanjum changed the title Improve research quality with Gemini. Improve evaluation harness Improve response quality with Gemini. Improve evaluation harness Apr 4, 2025
@debanjum debanjum merged commit 751215a into master Apr 4, 2025
10 checks passed
@debanjum debanjum deleted the reduce-repetition-by-gemini-and-add-more-knobs-to-evals-workflow branch April 4, 2025 14:48
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
upgrade New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant