React hook for Google Gemini Live API - gemini-live-react

Description:

gemini-live-react is a React hook that handles bidirectional voice streaming with the Google Gemini Live API.

It manages audio format conversion between browser standards and Gemini’s requirements, handles buffer management for smooth playback, and chains audio chunks to prevent gaps during real-time conversation.

The library runs on React applications and connects through a WebSocket proxy that keeps your API key secure.

Install the hook, deploy the included proxy function to Supabase Edge Functions, and call the connection methods to start streaming.

Audio flows from your microphone through the proxy to Gemini and back through resampling logic before reaching your speakers.

Features

🎙️ Full Duplex Audio Streaming: Sends microphone input to Gemini while receiving and playing AI responses simultaneously at different sample rates.
📺 Screen Sharing Support: Captures video frames from screen sharing or camera streams and sends them to Gemini at 1 FPS with automatic scaling to 1024px width.
🔧 Tool Calling System: Defines function schemas that Gemini can invoke during conversation and handles asynchronous result returns.
🎯 Voice Activity Detection: Integrates with @ricky0123/vad-web to detect speech and only transmit audio when the user speaks.
📝 Real-Time Transcription: Provides completed transcripts for both user and AI speech plus streaming partial transcripts that update as words arrive.
🔄 Connection State Management: Tracks connection lifecycle through idle, connecting, connected, reconnecting, error, and disconnected states.
🔌 Auto-Reconnection: Implements exponential backoff reconnection with configurable attempt limits and delay parameters.
📹 Session Recording: Captures all events including transcripts, audio metadata, tool calls, and DOM snapshots for export as JSON.
⚙️ Workflow Builder: Registers multi-step automation sequences that chain browser controls, waits, conditions, and AI prompts.
🎯 Smart Element Detection: Scans pages for interactive elements and returns their types, text content, bounding boxes, and CSS selectors without manual configuration.
🗣️ Voice Selection: Supports six voice options through proxy query parameters ranging from bright and clear to deep and authoritative.
📱 Mobile Detection Utilities: Exports helper functions to detect iOS, mobile browsers, and screen recording capabilities for fallback logic.
🐛 Debug Logging: Accepts boolean flag or custom callback function to log connection events, audio processing, and state changes.
📦 TypeScript Definitions: Ships with complete type definitions for all hook returns, configuration objects, and callback signatures.

Use Cases

Voice Assistant Interfaces: Build conversational AI applications where users speak naturally to get answers or execute commands through voice.
Screen Analysis Tools: Create applications where AI watches screen content or camera feeds and provides real-time guidance or annotation.
Customer Support Automation: Implement voice-based support systems that handle common queries and execute tool functions to retrieve account data or submit tickets.
Accessibility Applications: Add voice control to web applications for users who need hands-free operation or screen reader alternatives.

How to Use It

Installation

Install the package through npm.

npm install gemini-live-react

For Voice Activity Detection support, install the optional peer dependency.

npm install @ricky0123/vad-web

Proxy Deployment

Copy the included Deno proxy file to your Supabase project structure.

mkdir -p supabase/functions/gemini-live-proxy
cp node_modules/gemini-live-react/packages/proxy-deno/index.ts supabase/functions/gemini-live-proxy/

Set your Google AI API key as a secret.

supabase secrets set GOOGLE_AI_API_KEY=your-api-key-here

Deploy the function to Supabase Edge Functions.

supabase functions deploy gemini-live-proxy

The deployed function returns a WebSocket URL in the format wss://your-project.supabase.co/functions/v1/gemini-live-proxy.

Basic Voice Chat

Import the hook and destructure the connection methods and state values.

import { useGeminiLive } from 'gemini-live-react';
function VoiceChat() {
  const {
    connect,
    disconnect,
    transcripts,
    isConnected,
    isSpeaking,
  } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  });
  return (
    <div>
      <button onClick={() => isConnected ? disconnect() : connect()}>
        {isConnected ? 'End Call' : 'Start Call'}
      </button>
      {isSpeaking && <p>AI is speaking</p>}
      <div>
        {transcripts.map(transcript => (
          <p key={transcript.id}>
            <strong>{transcript.role}:</strong> {transcript.text}
          </p>
        ))}
      </div>
    </div>
  );
}

Call connect() to start the session. The hook requests microphone access, opens the WebSocket connection, and begins streaming audio. Call disconnect() to end the session and release resources.

Screen Sharing

Create a video element ref and capture a display stream before connecting.

import { useRef } from 'react';
import { useGeminiLive } from 'gemini-live-react';
function ScreenShareAssistant() {
  const videoRef = useRef<HTMLVideoElement>(null);
  const { connect, disconnect, isConnected } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  });
  const startWithScreen = async () => {
    const stream = await navigator.mediaDevices.getDisplayMedia({
      video: true
    });
    if (videoRef.current) {
      videoRef.current.srcObject = stream;
      await videoRef.current.play();
      await connect(videoRef.current);
    }
  };
  return (
    <div>
      <video ref={videoRef} style={{ display: 'none' }} />
      <button onClick={isConnected ? disconnect : startWithScreen}>
        {isConnected ? 'Stop Sharing' : 'Share Screen'}
      </button>
    </div>
  );
}

Pass the video element to connect() after the stream starts playing. The hook captures frames at one second intervals, scales them to 1024px maximum width, and sends them as base64-encoded images.

Mobile and iOS Support

Import detection utilities to check browser capabilities.

import {
  useGeminiLive,
  canScreenRecord,
  isIOS,
  isMobile,
} from 'gemini-live-react';
function MobileVoiceApp() {
  const videoRef = useRef<HTMLVideoElement>(null);
  const { connect, isConnected } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  });
  const startWithVideo = async () => {
    let stream: MediaStream;
    if (canScreenRecord()) {
      stream = await navigator.mediaDevices.getDisplayMedia({
        video: true
      });
    } else {
      stream = await navigator.mediaDevices.getUserMedia({
        video: { facingMode: 'environment' }
      });
    }
    if (videoRef.current) {
      videoRef.current.srcObject = stream;
      await videoRef.current.play();
      await connect(videoRef.current);
    }
  };
  return (
    <div>
      <video ref={videoRef} playsInline muted style={{ display: 'none' }} />
      <button onClick={startWithVideo}>Start</button>
    </div>
  );
}

Add playsInline attribute to video elements for iOS compatibility. The canScreenRecord() function returns false on iOS Safari and triggers camera fallback logic.

Tool Calling

Define tool schemas in the configuration and handle execution in the callback.

import { useGeminiLive } from 'gemini-live-react';
function AgentInterface() {
  const { connect, transcripts } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
    tools: [
      {
        name: 'lookup_knowledge',
        description: 'Search the knowledge base for information',
        parameters: {
          type: 'object',
          properties: {
            query: {
              type: 'string',
              description: 'Search query to execute'
            }
          },
          required: ['query']
        }
      },
      {
        name: 'create_ticket',
        description: 'Create a support ticket',
        parameters: {
          type: 'object',
          properties: {
            title: { type: 'string', description: 'Ticket title' },
            priority: { type: 'string', enum: ['low', 'medium', 'high'] }
          },
          required: ['title', 'priority']
        }
      }
    ],
    onToolCall: async (toolName, args) => {
      if (toolName === 'lookup_knowledge') {
        const result = await searchKnowledgeBase(args.query);
        return { answer: result };
      }
      if (toolName === 'create_ticket') {
        const ticketId = await createSupportTicket(args.title, args.priority);
        return { ticketId, status: 'created' };
      }
    }
  });
  return (
    <div>
      <button onClick={connect}>Start Agent</button>
      <div>
        {transcripts.map(t => (
          <p key={t.id}>{t.role}: {t.text}</p>
        ))}
      </div>
    </div>
  );
}

The callback receives the tool name and parsed arguments. Return an object that Gemini receives as the tool result. The AI continues conversation based on the returned data.

Voice Activity Detection

Enable VAD to transmit audio only during speech.

import { useGeminiLive } from 'gemini-live-react';
function VoiceWithVAD() {
  const {
    connect,
    isConnected,
    isUserSpeaking,
    transcripts
  } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
    vad: true,
    vadOptions: {
      threshold: 0.5,
      minSpeechDuration: 250,
      silenceDuration: 300,
    }
  });
  return (
    <div>
      <button onClick={connect}>Connect</button>
      {isUserSpeaking && (
        <div className="recording-indicator">
          Recording
        </div>
      )}
      <div>
        {transcripts.map(t => (
          <p key={t.id}>{t.role}: {t.text}</p>
        ))}
      </div>
    </div>
  );
}

The threshold parameter controls speech detection sensitivity from 0 to 1. The minSpeechDuration value sets milliseconds before triggering active speech. The silenceDuration value determines milliseconds of silence before ending transmission.

Welcome Messages

Trigger an AI greeting when connection establishes.

const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  welcomeMessage: 'Hello! I am ready to help you with your questions.',
});

The welcome message sends to Gemini after the WebSocket handshake completes. The AI responds based on the provided text.

Streaming Transcripts

Access partial transcripts that update during speech.

const {
  transcripts,
  streamingText,
  streamingUserText,
  isConnected
} = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
});
return (
  <div>
    {transcripts.map(t => (
      <p key={t.id}>{t.role}: {t.text}</p>
    ))}
    {streamingUserText && (
      <p style={{ opacity: 0.6 }}>
        You: {streamingUserText}
      </p>
    )}
    {streamingText && (
      <p style={{ opacity: 0.6 }}>
        AI: {streamingText}
      </p>
    )}
  </div>
);

The streamingUserText value updates as the user speaks before finalization. The streamingText value updates as AI responses stream in. Both become empty strings when speech completes and entries move to the transcripts array.

Session Recording

Record all session events for debugging or analysis.

import { useGeminiLive } from 'gemini-live-react';
function RecordedSession() {
  const {
    connect,
    startRecording,
    stopRecording,
    exportRecording,
    isRecording,
  } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
    recording: {
      audio: true,
      frames: true,
      domSnapshots: true,
      snapshotInterval: 5000,
    },
    onRecordingEvent: (event) => {
      console.log(event.type, event.timestamp, event.data);
    },
  });
  const handleExport = () => {
    const recording = stopRecording();
    console.log('Total events:', recording.events.length);
    const blob = exportRecording();
    const url = URL.createObjectURL(blob);
    const link = document.createElement('a');
    link.href = url;
    link.download = `session-${Date.now()}.json`;
    link.click();
  };
  return (
    <div>
      <button onClick={connect}>Connect</button>
      <button onClick={startRecording} disabled={!isConnected || isRecording}>
        Start Recording
      </button>
      <button onClick={handleExport} disabled={!isRecording}>
        Stop and Export
      </button>
    </div>
  );
}

The recording captures transcript events, audio chunk metadata, frame captures, tool calls, tool results, browser controls, UI commands, DOM snapshots, and errors. Call exportRecording() to generate a downloadable JSON blob.

Workflow Builder

import { useGeminiLive } from 'gemini-live-react';
function WorkflowExample() {
  const {
    registerWorkflow,
    executeWorkflow,
    workflowExecution,
  } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  });
  const setupLoginWorkflow = () => {
    registerWorkflow({
      id: 'login-flow',
      name: 'Login to Dashboard',
      entryPoint: 'click-login',
      steps: {
        'click-login': {
          id: 'click-login',
          type: 'browser_control',
          action: 'click',
          args: { selector: '#login-button' },
          next: 'wait-modal',
        },
        'wait-modal': {
          id: 'wait-modal',
          type: 'wait',
          waitMs: 500,
          next: 'fill-email',
        },
        'fill-email': {
          id: 'fill-email',
          type: 'browser_control',
          action: 'type',
          args: { selector: '#email', text: '[email protected]' },
          next: 'fill-password',
        },
        'fill-password': {
          id: 'fill-password',
          type: 'browser_control',
          action: 'type',
          args: { selector: '#password', text: 'password123' },
          next: 'click-submit',
        },
        'click-submit': {
          id: 'click-submit',
          type: 'browser_control',
          action: 'click',
          args: { selector: '#submit' },
          next: 'check-success',
        },
        'check-success': {
          id: 'check-success',
          type: 'condition',
          condition: { selector: '.dashboard', check: 'visible' },
          next: ['complete', 'error-handler'],
        },
      },
    });
  };
  const runLogin = async () => {
    setupLoginWorkflow();
    const result = await executeWorkflow('login-flow', {
      customVar: 'value'
    });
    console.log(result.status);
    console.log(result.history);
  };
  return (
    <div>
      <button onClick={runLogin}>Run Login Workflow</button>
      {workflowExecution && (
        <div>
          Status: {workflowExecution.status}
          Current Step: {workflowExecution.currentStepId}
        </div>
      )}
    </div>
  );
}

Workflow steps support four types. The browser_control type executes click, type, or scroll actions. The wait type pauses for a specified duration. The condition type checks element visibility or existence and branches based on results. The ai_prompt type sends text to Gemini and waits for a response.

Smart Element Detection

Scan pages for interactive elements without CSS selectors.

import { useGeminiLive } from 'gemini-live-react';
function SmartDetection() {
  const {
    detectElements,
    clickDetectedElement,
    detectedElements,
    isDetecting,
  } = useGeminiLive({
    proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
    smartDetection: {
      highlightDetections: true,
    },
  });
  const scanPage = async () => {
    const result = await detectElements();
    console.log('Found elements:', result.elements);
  };
  const handleClick = async (elementId: string) => {
    await clickDetectedElement(elementId);
  };
  return (
    <div>
      <button onClick={scanPage} disabled={isDetecting}>
        Scan Page
      </button>
      <div>
        {detectedElements.map(element => (
          <div key={element.id}>
            <span>{element.type}: {element.text}</span>
            <button onClick={() => handleClick(element.id)}>
              Click
            </button>
          </div>
        ))}
      </div>
    </div>
  );
}

The detection returns element type, visible text content, bounding box coordinates, CSS selector when available, and confidence score. The clickDetectedElement method uses the selector if present or falls back to coordinate-based clicking.

Voice Selection

Change the AI voice by adding a query parameter to the proxy URL.

const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy?voice=Kore',
});

Available voices include Zephyr (bright and clear), Puck (warm and friendly), Charon (deep and authoritative), Kore (soft and gentle), Fenrir (strong and confident), and Aoede (melodic and expressive). Zephyr serves as the default when no voice parameter exists.

Debug Logging

Enable console logging for connection and audio events.

const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  debug: true,
});

Pass a custom callback function to route logs to external services.

const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  debug: (level, message, data) => {
    myLogger.log({
      level,
      message,
      timestamp: Date.now(),
      ...data
    });
  },
});

The callback receives log level (info, warn, error), message string, and optional data object.

Reconnection Configuration

Configure automatic reconnection behavior.

const { connect, connectionState } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  reconnection: {
    maxAttempts: 5,
    initialDelay: 1000,
    maxDelay: 10000,
    backoffFactor: 2,
  },
});

The hook attempts reconnection up to maxAttempts times. The first retry waits initialDelay milliseconds. Each subsequent attempt multiplies the delay by backoffFactor until reaching maxDelay. The connectionState value changes to “reconnecting” during retry attempts.

Session Management

Resume sessions using session identifiers.

const sessionId = generateUniqueId();
const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  sessionId: sessionId,
});

Pass the same session ID to multiple hook instances to maintain conversation context across reconnections or component remounts.

Text Messages

Send text to Gemini without voice input.

const { sendText, transcripts } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
});
const handleSend = () => {
  sendText('What is the weather today?');
};

The text appears in the transcripts array with role “user” and triggers an AI response.

Microphone Control

Mute and unmute the microphone during active sessions.

const { isMuted, setMuted } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
});
return (
  <button onClick={() => setMuted(!isMuted)}>
    {isMuted ? 'Unmute' : 'Mute'}
  </button>
);

Muting stops audio transmission to Gemini but maintains the WebSocket connection.

Transcript Management

Clear conversation history programmatically.

const { clearTranscripts, transcripts } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
});
return (
  <div>
    <button onClick={clearTranscripts}>Clear History</button>
    <div>
      {transcripts.map(t => (
        <p key={t.id}>{t.role}: {t.text}</p>
      ))}
    </div>
  </div>
);

Calling clearTranscripts() empties the transcripts array without disconnecting the session.

Audio Buffer Tuning

Adjust playback buffer timing to balance latency and smoothness.

const { connect } = useGeminiLive({
  proxyUrl: 'wss://your-project.supabase.co/functions/v1/gemini-live-proxy',
  minBufferMs: 200,
  transcriptDebounceMs: 1500,
});

The minBufferMs parameter controls minimum audio buffering before playback starts. Lower values reduce latency but risk choppy audio. The transcriptDebounceMs parameter sets the delay before finalizing streaming transcripts into completed entries.

API Reference

Hook Configuration

proxyUrl (string, required): WebSocket URL for the deployed proxy function.
sessionId (string, optional): Unique identifier for session resumption across reconnections.
welcomeMessage (string, optional): Text sent to Gemini immediately after connection establishes.
debug (boolean or function, optional): Enables console logging when true or routes logs to custom callback function.
tools (array, optional): Array of tool definition objects with name, description, and parameters properties.
onToolCall (function, optional): Callback that receives tool name and arguments, returns result object.
vad (boolean, optional): Enables Voice Activity Detection when true.
vadOptions (object, optional): Configuration object with threshold, minSpeechDuration, and silenceDuration properties.
reconnection (object, optional): Configuration object with maxAttempts, initialDelay, maxDelay, and backoffFactor properties.
onTranscript (function, optional): Callback that receives transcript objects when speech completes.
onError (function, optional): Callback that receives error messages.
onConnectionChange (function, optional): Callback that receives connection state changes.
minBufferMs (number, optional): Minimum milliseconds to buffer before starting audio playback (default 200).
transcriptDebounceMs (number, optional): Milliseconds to wait before finalizing streaming transcripts (default 1500).
recording (object, optional): Configuration object with audio, frames, domSnapshots, and snapshotInterval properties.
onRecordingEvent (function, optional): Callback that receives recording event objects.
smartDetection (object, optional): Configuration object with highlightDetections property.

Hook Return Values

isConnected (boolean): True when WebSocket connection is active and audio streaming.
isConnecting (boolean): True during connection establishment phase.
connectionState (string): Current connection state (idle, connecting, connected, reconnecting, error, disconnected).
isSpeaking (boolean): True when AI audio is playing through speakers.
isMuted (boolean): True when microphone transmission is disabled.
isUserSpeaking (boolean): True when Voice Activity Detection identifies active speech.
error (string or null): Current error message or null when no errors exist.
transcripts (array): Array of completed transcript objects with id, role, and text properties.
streamingText (string): Current partial AI transcript that updates in real-time.
streamingUserText (string): Current partial user transcript that updates in real-time.
connect (function): Establishes connection and starts audio streaming. Accepts optional video element parameter for screen sharing.
disconnect (function): Closes connection and releases audio resources.
sendText (function): Sends text message to Gemini. Accepts string parameter.
sendToolResult (function): Sends tool execution result back to AI. Accepts tool name and result object parameters.
setMuted (function): Controls microphone state. Accepts boolean parameter.
clearTranscripts (function): Removes all entries from transcripts array.
startRecording (function): Begins session recording when recording configuration exists.
stopRecording (function): Ends recording and returns recording object with events array.
exportRecording (function): Generates downloadable JSON blob of recorded session.
isRecording (boolean): True when session recording is active.
registerWorkflow (function): Registers workflow definition. Accepts workflow object with id, name, entryPoint, and steps properties.
executeWorkflow (function): Runs registered workflow. Accepts workflow id and optional context object parameters.
pauseWorkflow (function): Pauses workflow execution at current step.
resumeWorkflow (function): Resumes paused workflow from last completed step.
cancelWorkflow (function): Stops workflow execution and clears state.
workflowExecution (object or null): Current workflow execution state with status, currentStepId, and history properties.
detectElements (function): Scans page for interactive elements and returns detection results.
clickDetectedElement (function): Clicks detected element by id. Accepts string parameter.
detectedElements (array): Array of detected element objects with id, type, text, bounds, selector, and confidence properties.
isDetecting (boolean): True during element detection scan.

Related Resources

@ricky0123/vad-web: Web-based Voice Activity Detection library that integrates with gemini-live-react for speech detection.
react-speech-recognition: React hook for browser speech recognition API with transcript management and continuous listening.
web-audio-api: Browser API for audio processing, routing, and analysis used internally by gemini-live-react.
Supabase Edge Functions: Serverless Deno runtime for deploying the WebSocket proxy included with gemini-live-react.

FAQs

Q: Why does audio sound choppy or cut out during playback?
A: Increase the minBufferMs configuration value to buffer more audio before playback starts. The default 200ms works for most connections but slower networks need 300-500ms. You can also check browser console with debug: true to see if audio chunks arrive irregularly.

Q: How do I handle microphone permission errors on mobile browsers?
A: Call connect() inside a user interaction handler like button click. Mobile browsers block microphone access from automatic page load. Wrap the connection logic in try-catch to show permission error messages to users.

Q: Why do streaming transcripts sometimes take several seconds to finalize?
A: The transcriptDebounceMs parameter defaults to 1500ms to wait for complete sentences. Lower this value to 500-800ms for faster finalization at the cost of potentially splitting sentences across multiple transcript entries.

Prev Next

ReactScript

React hook for Google Gemini Live API – gemini-live-react