Skip to content

Divinci-AI/langextract-ts

Repository files navigation

LangExtract-TS

TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with precise character-level source grounding.

Based on LangExtract v1.1.1 by Google. Ported to TypeScript with Gemini and Cloudflare Workers AI support.

Features

  • Source grounding — every extraction maps back to exact character positions in the original text
  • Sentence-aware chunking — three-strategy chunker that respects sentence boundaries
  • Two-phase alignment — exact token matching + fuzzy fallback for robust source mapping
  • Universal runtime — runs on Node.js 18+, Cloudflare Workers, Deno, and Bun
  • Minimal dependencies — only zod required; provider SDKs are optional
  • Interactive visualization — self-contained HTML with playback controls
  • Provider plugins — built-in Gemini + Cloudflare, extensible for custom providers

Installation

npm install langextract-ts
# or
pnpm add langextract-ts

For Gemini support (optional):

npm install @google/genai

Quick Start

import { extract } from "langextract-ts";

const result = await extract(
  "The patient takes Aspirin 81mg daily for heart health.",
  {
    promptDescription: "Extract all medications with their dosage and frequency.",
    examples: [{
      text: "She takes Lisinopril 10mg once daily.",
      extractions: [{
        extractionClass: "medication",
        text: "Lisinopril",
        attributes: { dosage: "10mg", frequency: "once daily" },
      }],
    }],
    modelId: "gemini-2.0-flash",
    apiKey: process.env.GOOGLE_API_KEY,
  },
);

// result.extractions[0]:
// {
//   extractionClass: "medication",
//   text: "Aspirin",
//   charInterval: { startPos: 18, endPos: 25 },
//   alignmentStatus: "exact",
//   attributes: { dosage: "81mg", frequency: "daily" },
// }

With Cloudflare Workers AI

import { extract } from "langextract-ts";

const result = await extract(
  "Romeo professes his love for Juliet in the famous balcony scene.",
  {
    promptDescription: "Extract all characters mentioned.",
    examples: [{
      text: "Hamlet speaks to Horatio.",
      extractions: [{
        extractionClass: "character",
        text: "Hamlet",
      }],
    }],
    modelId: "@cf/meta/llama-3.3-70b-instruct-fp8-fast",
    apiKey: process.env.CF_API_TOKEN,
    accountId: process.env.CF_ACCOUNT_ID,
  },
);

API

extract(input, options)

Main entry point. Accepts strings, URLs, or Document[].

Option Default Description
promptDescription required Task instructions for the LLM
examples required Few-shot examples
modelId "gemini-2.0-flash" Model identifier
apiKey env var Provider API key
maxCharBuffer 1000 Max characters per chunk
batchLength 10 Chunks per inference batch
maxWorkers 10 Concurrent requests
extractionPasses 1 Number of extraction passes
contextWindowChars 0 Cross-chunk context window
formatType "json" Output format ("json" or "yaml")

Chunking

import { chunkDocument, createDocument } from "langextract-ts";

const doc = createDocument("Your long text here...");
for (const chunk of chunkDocument(doc, { maxCharBuffer: 500 })) {
  console.log(chunk.text, chunk.charInterval);
}

Tokenization

import { RegexTokenizer, UnicodeTokenizer } from "langextract-ts";

const tokenizer = new RegexTokenizer();
const { tokens } = tokenizer.tokenize("Hello world!");
// tokens: [{ text: "Hello", tokenType: "word", charInterval: { startPos: 0, endPos: 5 } }, ...]

// For CJK/international text:
const unicode = new UnicodeTokenizer();
const { tokens: cjkTokens } = unicode.tokenize("Hello 世界");

Visualization

import { visualize } from "langextract-ts";

const html = visualize(annotatedDocument, {
  title: "Medication Extraction",
  animationSpeed: 1500,
});
// Save `html` to a file and open in browser

Custom Providers

import { BaseLanguageModel, registerProvider } from "langextract-ts";

class MyProvider extends BaseLanguageModel {
  async *infer(prompts) {
    for (const prompt of prompts) {
      const response = await myApi.call(prompt);
      yield [{ output: response, score: 1.0 }];
    }
  }
}

registerProvider([/^my-model/], () => MyProvider, 20);

Architecture

Input Text/URL
  -> Tokenization (RegexTokenizer or UnicodeTokenizer)
  -> Sentence-aware Chunking (3 strategies)
  -> Few-shot Prompt Construction
  -> Batched LLM Inference (concurrent with Semaphore)
  -> JSON Parsing + Extraction
  -> Two-phase Alignment (exact + fuzzy via SequenceMatcher)
  -> AnnotatedDocument with CharInterval positions

Runtime Compatibility

Runtime Supported Notes
Node.js 18+ Yes Full support
Cloudflare Workers Yes Web APIs only
Deno Yes V8-based
Bun Yes JavaScriptCore

License

Apache-2.0

This project is a derivative work of Google's LangExtract, originally licensed under Apache-2.0.

About

TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with source grounding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors