Google Gemini API Integration

Overview

A comprehensive skill for building applications with Google's Gemini API — covering text generation, multimodal inputs (images, video, audio, documents), function calling, structured outputs, system instructions, and streaming. Gemini models offer competitive performance with unique capabilities like native multimodal understanding, large context windows (up to 2M tokens), and built-in Google Search grounding.

When to Use

Building applications with Google Gemini models
Need native multimodal understanding (images, video, audio, PDFs)
Processing long documents (up to 2M token context window)
Using function calling / tool use with Gemini
Need Google Search grounding for real-time information
Building with the Google AI Studio or Vertex AI

Quick Start


# Install
pip install google-genai

# Set API key
export GOOGLE_API_KEY="your-key-here"


from google import genai

client = genai.Client()

# Simple text generation
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Explain quantum computing in simple terms",
)
print(response.text)

# With system instruction
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Write a haiku about coding",
    config=genai.types.GenerateContentConfig(
        system_instruction="You are a creative poet who loves technology",
        temperature=0.9,
    ),
)

Multimodal Capabilities

Image Understanding


from google import genai
from google.genai import types
import base64

client = genai.Client()

# From URL
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[
        types.Part.from_uri("https://example.com/image.jpg", mime_type="image/jpeg"),
        "Describe what you see in this image",
    ],
)

# From local file
with open("photo.jpg", "rb") as f:
    image_data = f.read()

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[
        types.Part.from_bytes(image_data, mime_type="image/jpeg"),
        "What objects are in this image?",
    ],
)

Video and Audio


# Upload video file
video_file = client.files.upload(file="video.mp4")

# Analyze video
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[video_file, "Summarize what happens in this video"],
)

# Audio transcription and analysis
audio_file = client.files.upload(file="recording.mp3")
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[audio_file, "Transcribe and summarize this audio"],
)

PDF Document Analysis


# Upload and analyze PDF
pdf_file = client.files.upload(file="report.pdf")

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[pdf_file, "Extract the key findings from this report"],
)

Function Calling


from google.genai import types

# Define tools
tools = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_weather",
            description="Get current weather for a location",
            parameters=types.Schema(
                type="OBJECT",
                properties={
                    "location": types.Schema(type="STRING", description="City name"),
                    "unit": types.Schema(type="STRING", enum=["celsius", "fahrenheit"]),
                },
                required=["location"],
            ),
        ),
    ]),
]

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What's the weather in Tokyo?",
    config=genai.types.GenerateContentConfig(tools=tools),
)

# Process function call
for part in response.candidates[0].content.parts:
    if part.function_call:
        name = part.function_call.name
        args = dict(part.function_call.args)
        print(f"Call: {name}({args})")

Structured Output


from pydantic import BaseModel
from typing import List

class Recipe(BaseModel):
    name: str
    ingredients: List[str]
    prep_time_minutes: int
    difficulty: str

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Give me a recipe for chocolate cake",
    config=genai.types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=Recipe,
    ),
)

recipe = Recipe.model_validate_json(response.text)

Model Comparison

Model	Context	Speed	Cost	Best For
Gemini 2.0 Flash	1M	Fast	Low	General purpose, multimodal
Gemini 2.0 Pro	2M	Medium	Medium	Complex reasoning
Gemini 1.5 Flash	1M	Fastest	Lowest	High-throughput tasks
Gemini 1.5 Pro	2M	Medium	Medium	Long document analysis

Configuration Reference

Parameter	Default	Description
`temperature`	1.0	Sampling randomness (0-2)
`top_p`	0.95	Nucleus sampling threshold
`top_k`	40	Top-k sampling
`max_output_tokens`	8192	Maximum response length
`stop_sequences`	[]	Stop generation strings
`candidate_count`	1	Number of response candidates
`response_mime_type`	text/plain	Output format (JSON, etc.)
`safety_settings`	Default	Content safety thresholds

Best Practices

Use Gemini 2.0 Flash as default — Best speed/quality ratio for most tasks
Leverage multimodal natively — No need for separate vision models
Use structured output — response_schema with Pydantic for reliable JSON
Set system instructions — Define persona and constraints upfront
Use streaming for long responses — generate_content_stream for real-time output
Upload large files first — Use client.files.upload() for files >20MB
Enable Google Search grounding — For tasks needing current information
Handle safety filters — Check response.prompt_feedback for filtered responses
Use caching for repeated contexts — Context caching reduces cost for long system prompts
Monitor usage — Track token usage for cost management

Troubleshooting

Response blocked by safety filters


# Adjust safety settings
from google.genai import types

config = types.GenerateContentConfig(
    safety_settings=[
        types.SafetySetting(
            category="HARM_CATEGORY_HARASSMENT",
            threshold="BLOCK_ONLY_HIGH",
        ),
    ],
)

File upload fails


# Check file size (<2GB) and supported formats
# Wait for file processing
file = client.files.upload(file="large_video.mp4")
while file.state == "PROCESSING":
    import time
    time.sleep(5)
    file = client.files.get(name=file.name)

Context window exceeded


# Count tokens before sending
count = client.models.count_tokens(
    model="gemini-2.0-flash",
    contents=long_text,
)
print(f"Token count: {count.total_tokens}")

⚠️ Loading Issue

Advanced Gemini