Sitemap Integration

Sitemap Integration enables you to automatically crawl and index website content into datasets, making web pages searchable and accessible to your AI agents. By providing a sitemap.xml URL or website URL, the integration discovers and syncs all linked pages, extracting text content and metadata for knowledge retrieval.

This is particularly useful for building chatbots that answer questions based on documentation sites, knowledge bases, blogs, or any web content you want to make available to your AI agents.

Creating a Sitemap Integration

To create a sitemap integration, you need to provide the website or sitemap URL and specify which dataset should receive the crawled content. The integration will automatically discover pages through sitemap.xml files or by crawling links, extracting content and storing it as searchable records.

POST /api/v1/integration/sitemap/create Content-Type: application/json { "name": "Product Documentation", "description": "Crawls our docs site for support chatbot", "datasetId": "dataset-abc123", "url": "https://docs.example.com/sitemap.xml", "glob": "**\/docs/**", "selectors": "article.content, div.documentation", "javascript": false, "syncSchedule": "0 0 * * *", "expiresIn": 7776000000 }

http

Advanced Configuration Options

URL Filtering with Glob Patterns: The glob parameter allows you to filter which pages to crawl using glob patterns. For example, "**\/blog/**" only crawls blog posts, while "**\/docs/**" focuses on documentation pages.

Content Extraction with CSS Selectors: The selectors parameter specifies which HTML elements to extract content from using CSS selectors. This helps focus on main content and exclude navigation, footers, and other UI elements. Multiple selectors can be comma-separated.

JavaScript Rendering: Set javascript: true to enable JavaScript execution during crawling, necessary for single-page applications or dynamic content. This increases crawl time but ensures complete content extraction.

Sync Scheduling: Use cron expressions to control crawl frequency. Daily syncs ("0 0 * * *") work well for most documentation sites, while more frequent syncs may be needed for rapidly changing content.

Record Expiration: The expiresIn parameter (in milliseconds) determines how long crawled pages are retained. Set to 90 days (7776000000ms) for typical documentation, or null for permanent retention.

Warning: Large websites may take significant time to crawl initially. The integration processes pages incrementally and subsequent syncs only update changed content for efficiency.

Listing Sitemap Integrations

Retrieve all sitemap integrations configured in your account to manage web crawlers, monitor crawl configurations, and review which websites are being synced into your datasets. This endpoint provides complete visibility into all active website crawling operations.

GET /api/v1/integration/sitemap/list

http

Each integration entry includes the full crawl configuration including URL patterns, content selectors, JavaScript rendering settings, and sync schedules, allowing you to audit and manage your web content synchronization.

Query Parameters:

  • cursor: Pagination cursor for retrieving additional results
  • order: Sort order ("asc" or "desc", default: "desc")
  • take: Number of integrations to retrieve (default: 25)
  • meta: Filter by metadata key-value pairs
  • blueprintId: Filter integrations by blueprint association

The response includes complete crawl configurations, enabling you to verify which websites are being monitored, understand their extraction rules, and identify which datasets receive the crawled content for each integration.

Example Response:

{ "items": [ { "id": "sitemap-integration-123", "name": "Documentation Site", "url": "https://docs.example.com/sitemap.xml", "glob": "**\/docs/**", "selectors": "article.content", "javascript": false, "syncSchedule": "0 0 * * *", "expiresIn": 7776000000, "datasetId": "dataset-xyz", "createdAt": "2025-01-10T09:00:00Z" } ] }

json

Use this endpoint to maintain an inventory of your web crawling operations and ensure your documentation sites, blogs, and knowledge bases are being properly synchronized for AI agent access.