Tool Use with
Open-Source LLMs

Rick Lamers

Groq

About Me

  • Rick Lamers - AI Engineer & Researcher at Groq Groq
  • Previous life: founder of Orchest, an Apache Airflow (ETL flows) alternative
  • Expertise: Machine Learning, Data Engineering, Full Stack
  • Notable open source projects:
    • Shell-AI (968 ⭐)
    • GPT-Code UI (3,518 ⭐)
    • Orchest (4,031 ⭐)
    • Grid Studio (8,900 ⭐)
  • Background: Studied Computer Science at TU Delft

Why tool use?

Mistake

Logic? ✅

Circuit

LLMs + Logic

Toolformer paper

Tools Schema

ISO 3166-2

Generated Tool Call

[...]
Powered by
Groq

Interlude: tool use vs structured outputs




Structured Output

AKA JSON Mode + JSON Schema

  • Adheres to a predefined schema
  • Can be easily parsed
  • Consistent format
  • Suitable for direct consumption by programs

Tool Use Adds

  • Intent detection
  • Multi-step planning
  • Gathering complete information before acting
  • Interpreting structured tool use outputs
  • Context-aware decision making


Tool Use = Agents?

Tool Use with Open Source LLMs

When considering open-source LLMs for tool use, we have two high-level options:

  • Does the model natively support tool use?
    • If yes:
      • Use the model's built-in tool use capabilities
      • Examples: Mixtral 8x22B
    • If no:
      • Implement tool use through a combination of prompt engineering, fine-tuning, and constrained decoding
      • Examples: Llama 3, Qwen 2, DeepSeek-Coder V2, DeepSeek-V2, ...

Mixtral 8x22B Tool Use

Mixtral 8x22B efficiently tokenizes tool use, enabling accurate execution.

Mixtral 8x22B Tokenization

Tool Use for Models Without Native Support

For models that don't natively support function calling, we can implement it through prompt engineering. Here's a simplified example of a system prompt:

You are an AI assistant capable of using tools. When you need to use a tool, respond with a JSON object in this format:

<tool_calls>
[
  {
    "id": "pending",
    "type": "function",
    "function": {
      "name": "function_name"
    },
    "arguments": {
      "arg1": "value1",
      "arg2": "value2"
    }
  }
]
</tool_calls>

Available tools are defined as follows:

<available_tools>
[
  {
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "The city and state, e.g. San Francisco, CA"
        },
        "unit": {
          "type": "string",
          "enum": ["celsius", "fahrenheit"]
        }
      },
      "required": ["location"]
    }
  }
]
</available_tools>

Instructions:
- Provide all required parameters, even if you're unsure of the value.
- Don't use tools if they're not needed; respond directly in those cases.

Either use a tool as instructed above or reply with text to answer the user's question.

Beyond Prompting:
Constrained Decoding for Tool Use

Constrained Decoding Example Outlines Logo

Constrained decoding ensures model outputs adhere to the specified format in the system prompt, enhancing reliability and consistency in tool usage.

Beyond Prompting: Fine-Tuning for Tool Use

  • Training style choices
    • Full Fine-tuning
    • LoRA (Low-Rank Adaptation)
  • Fine-tuning
    • SFT (Supervised Fine-Tuning)
  • Humand Feedback
    • PPO (Proximal Policy Optimization)
    • DPO (Direct Preference Optimization)
    • KTO (Kahneman-Tversky Optimization)
    • ORPO (Odds Ratio Preference Optimization)
    • SLiC (Sequence Likelihood Calibration)
    • ...

Beyond Prompting: Fine-Tuning for Tool Use

  • Post training choices
    • Model merging
      • SLERP
      • DARE
      • TIES
      • See Maxime Labonne's talk at 2.40pm (today)
    • Model quantization
      • INT4/8 quantization
      • GPTQ
      • AWQ (Activation-aware Weight Quantization)
      • SqueezeLLM
  • Data challenges
    • Data quality and relevance
    • Data scarcity
    • Bias in training data

Berkeley Function Calling Leaderboard (BFCL)

Berkeley Function Calling Leaderboard

Berkeley Function Calling Leaderboard (BFCL)

@inproceedings{berkeley-function-calling-leaderboard, title={Berkeley Function Calling Leaderboard}, author={Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E. Gonzalez}, year={2024}, howpublished={\url{https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html}}, }

Dataset (BFCL)

Berkeley Function Calling Leaderboard Dataset

Example: Triangle Area Calculation (BFCL)


{
  "question": "Find the area of a triangle with a base of 10 units and height of 5 units.",
  "function": {
    "name": "calculate_triangle_area",
    "description": "Calculate the area of a triangle given its base and height.",
    "parameters": {
      "type": "dict",
      "properties": {
        "base": {
          "type": "integer",
          "description": "The base of the triangle."
        },
        "height": {
          "type": "integer",
          "description": "The height of the triangle."
        },
        "unit": {
          "type": "string",
          "description": "The unit of measure (defaults to 'units' if not specified)"
        }
      },
      "required": ["base", "height"]
    }
  }
}
                

{
  "calculate_triangle_area": {
    "base": [10],
    "height": [5],
    "unit": ["units", ""]
  }
}
                

Example: Parallel Multiple Functions (BFCL)


{
  "question": "What is the mean of the following numbers: 5, 10, 15, 20, and 25, and can you also tell me the timezone of the coordinate with longitude '120.97388' and latitude '14.6042'?",
  "function": [
    {
      "name": "get_time_zone_by_coord",
      "description": "Finds the timezone of a coordinate.",
      "parameters": {
        "type": "dict",
        "properties": {
          "long": {
            "type": "string",
            "description": "The longitude of the coordinate."
          },
          "lat": {
            "type": "string",
            "description": "The latitude of the coordinate."
          }
        },
        "required": ["long", "lat"]
      }
    },
    {
      "name": "calculate_mean",
      "description": "Calculates the mean of a list of numbers.",
      "parameters": {
        "type": "dict",
        "properties": {
          "numbers": {
            "type": "array",
            "items": {
              "type": "float"
            },
            "description": "The list of numbers."
          }
        },
        "required": ["numbers"]
      }
    }
  ],
  "execution_result": [15.0, "Asia/Manila"],
  "execution_result_type": ["exact_match", "exact_match"]
}
                

Beyond BFCL: forced sequential

Demonstrating the use of multiple tools in a forced sequential order to solve a complex problem.


[
  {
    "name": "multiply",
    "description": "Multiplies two numbers",
    "parameters": {
      "type": "object",
      "properties": {
        "a": {
          "type": "number",
          "description": "The first number to multiply"
        },
        "b": {
          "type": "number",
          "description": "The second number to multiply"
        }
      },
      "required": ["a", "b"]
    }
  },
  {
    "name": "add",
    "description": "Adds two numbers",
    "parameters": {
      "type": "object",
      "properties": {
        "a": {
          "type": "number",
          "description": "The first number to add"
        },
        "b": {
          "type": "number",
          "description": "The second number to add"
        }
      },
      "required": ["a", "b"]
    }
  },
  {
    "name": "exponentiate",
    "description": "Raises a number to a power",
    "parameters": {
      "type": "object",
      "properties": {
        "base": {
          "type": "number",
          "description": "The base number"
        },
        "exponent": {
          "type": "number",
          "description": "The exponent"
        }
      },
      "required": ["base", "exponent"]
    }
  }
]
                

Prompt: What is 8 times 6 to the 5th power plus 9?

Challenge: How to deal with latency of multiple round trips? (Hint: server side tools)

Example: Relevance Detection (BFCL)

Assessing whether a model can correctly identify when a given function is not relevant to the user's query.


{
  "question": "Calculate the volume of the sphere with radius 3 units.",
  "function": {
    "name": "calculate_park_area",
    "description": "Calculate the total area of a park based on the radius of its circular part.",
    "parameters": {
      "type": "dict",
      "properties": {
        "radius": {
          "type": "float",
          "description": "The radius of the circular part of the park."
        },
        "units": {
          "type": "string",
          "description": "The units of the radius."
        },
        "shape": {
          "type": "string",
          "description": "The shape of the park. Default is 'circle'."
        }
      },
      "required": ["radius", "units"]
    }
  }
}
                

Beyond BFCL: follow-ups

Measuring when a model needs to ask a follow-up question. This can occur when the user's initial prompt doesn't provide enough information to use a given tool.


{
  "name": "book_flight",
  "description": "Books a flight based on user preferences",
  "parameters": {
    "type": "object",
    "properties": {
      "departure_city": {
        "type": "string",
        "description": "The city the user is departing from"
      },
      "arrival_city": {
        "type": "string",
        "description": "The city the user is traveling to"
      },
      "departure_date": {
        "type": "string",
        "description": "The date the user wants to depart (YYYY-MM-DD format)"
      },
      "return_date": {
        "type": "string",
        "description": "The date the user wants to return, if applicable (YYYY-MM-DD format)"
      },
      "num_passengers": {
        "type": "integer",
        "description": "The number of passengers traveling"
      },
      "class": {
        "type": "string",
        "enum": ["economy", "business", "first"],
        "description": "The class of travel"
      }
    },
    "required": ["departure_city", "arrival_city", "departure_date", "num_passengers", "class"]
  }
}
                

Prompt: Book me a flight to New York next month

Missing: departure_city, num_passengers, class

Desired response: Certainly! I'd be happy to help you book a flight to New York next month. To proceed with the booking, I'll need a few more details: 1. What is your departure city? 2. How many passengers will be traveling? 3. What class of travel would you prefer (economy, business, or first)? Once you provide this information, I'll be able to search for the best flight options for you.

Explicit modeling of inputs and outputs

TypeChat

https://github.com/microsoft/TypeChat

export type API = {
    add(x: number, y: number): number;
    sub(x: number, y: number): number;
    mul(x: number, y: number): number;
    div(x: number, y: number): number;
    neg(x: number): number;
    id(x: number): number;
    unknown(text: string): number;
}
import { API } from "./schema";
function program(api: API) {
  const step1 = api.mul(2, 3); // -> independent step
  const step2 = api.mul(4, 5); // -> independent step
  return api.add(step1, step2); // -> type safe passing of subresults into add
}

Function Calling as ... code generation?

Idea: standard (stateful, sandboxed, WASM) REPL like Code Interpreter/Artifacts

OpenAI API Reference: a local minimum?

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
  "model": "gpt-4-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What'\''s the weather like in Boston today?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

This is what everyone is standardizing on, but how desirable is it?

  • Pros: Widespread adoption leads to ecosystem compatibility
  • Cons: May limit innovation in tool use paradigms
  • Question: Are we settling for a suboptimal standard?

Thank You!

Any questions?

Follow me on X @RickLamers

X QR Code