Sitemap Integrations

Fetching Sitemap Integration Details

To retrieve comprehensive information about a specific sitemap integration, including its URL configuration, crawling rules, and synchronization settings, use the fetch endpoint. This operation provides a complete view of how the integration is configured to crawl and extract content from websites, making it essential for verifying configurations, troubleshooting crawl issues, and understanding the current state of your web content synchronization.

The fetch operation returns all configuration details including the target URL, glob patterns for URL filtering, CSS selectors for content extraction, JavaScript rendering settings, sync schedule, and record expiration policies. This comprehensive view enables you to audit integration behavior, plan configuration updates, and ensure that your web content is being properly captured and synchronized into your datasets.

GET /api/v1/integration/sitemap/{sitemapIntegrationId}/fetch Content-Type: application/json

http

The response includes detailed configuration information:

{ "id": "sitemap_abc123", "name": "Documentation Site Crawler", "description": "Crawls product documentation for knowledge base", "datasetId": "dataset_xyz789", "url": "https://docs.example.com/sitemap.xml", "glob": "https://docs.example.com/**", "selectors": "article.content, main.documentation", "javascript": true, "syncSchedule": "@daily", "expiresIn": 604800000, "blueprintId": "blueprint_def456", "meta": {}, "createdAt": "2025-11-20T10:00:00Z", "updatedAt": "2025-11-22T15:30:00Z" }

json

Configuration Fields Explained:

  • url: The sitemap XML URL or starting page URL for the crawler
  • glob: Pattern matching rules to include/exclude URLs during crawling
  • selectors: CSS selectors that identify the content to extract from each page
  • javascript: Whether to execute JavaScript before extracting content (required for single-page applications)
  • syncSchedule: Automatic sync frequency (@daily, @hourly, @weekly, or cron expressions)
  • expiresIn: Time in milliseconds before records are considered stale (e.g., 604800000 = 7 days)

Use Cases:

  • Verifying that crawl rules are correctly configured before initiating a sync
  • Troubleshooting why certain pages are not being crawled or extracted properly
  • Auditing integration configurations across multiple website sources
  • Preparing to update glob patterns or selectors based on website structure changes

Important: The javascript setting significantly impacts crawl speed and resource usage. Enable it only when necessary (e.g., for React, Vue, or Angular applications). For static HTML sites, leave it disabled for faster, more efficient crawling.

Updating Sitemap Integration Configuration

To modify the configuration of an existing sitemap integration, including changing target URLs, adjusting crawling rules, updating content selectors, or modifying synchronization schedules, use the update endpoint. This operation enables you to adapt your web content crawling strategy as websites evolve, refine content extraction rules for better accuracy, or adjust crawl frequency based on content update patterns.

Sitemap integration updates are particularly valuable when website structures change, requiring updated CSS selectors or glob patterns to correctly identify and extract content. You can also use updates to enable or disable JavaScript rendering based on website technology changes, redirect crawling to new URLs or domains, or modify record expiration policies to align with content freshness requirements.

POST /api/v1/integration/sitemap/{sitemapIntegrationId}/update Content-Type: application/json { "name": "Updated Documentation Crawler", "description": "Crawls product and API documentation", "url": "https://docs.example.com/sitemap.xml", "selectors": "article.docs-content, div.api-reference", "javascript": true, "syncSchedule": "@daily" }

http

Updatable Configuration Fields:

  • name: Update the integration's display name for better organization
  • description: Modify the description to reflect current crawling scope or purpose
  • blueprintId: Reassign the integration to a different blueprint for organizational purposes
  • datasetId: Change the target dataset where crawled content is stored
  • url: Update the sitemap URL or starting page for the crawler
  • glob: Modify URL pattern matching rules
  • selectors: Update CSS selectors for content extraction (e.g., article.content, main.documentation)
  • javascript: Enable/disable JavaScript execution before content extraction
  • syncSchedule: Adjust crawl frequency (@hourly, @daily, @weekly, or cron expressions)
  • expiresIn: Modify record expiration time in milliseconds (max: 3 months)
  • meta: Update custom metadata for tracking or organizational purposes

Common Update Scenarios:

Updating Selectors After Website Redesign:

{ "selectors": "main.new-content-wrapper, article.post-body" }

json

Essential when websites change their HTML structure or CSS classes.

Enabling JavaScript for SPA Content:

{ "javascript": true }

json

Required when sites migrate from static HTML to React, Vue, or Angular.

Refining Crawl Scope with Glob Patterns:

{ "glob": "https://docs.example.com/api/** https://docs.example.com/guides/**" }

json

Useful for including specific sections while excluding others (e.g., skipping blog posts).

Adjusting Crawl Frequency:

{ "syncSchedule": "0 9 * * 1" }

json

Use cron expressions for custom schedules (this example: every Monday at 9:00 AM).

Important Considerations:

  • Configuration changes take effect on the next scheduled crawl, not immediately
  • Changing selectors may result in different content being extracted from previously crawled pages
  • Enabling JavaScript significantly increases crawl time and resource usage
  • Changing the dataset redirects future crawls; existing records in the old dataset remain
  • Glob patterns must be valid and properly escaped for special characters
  • The expiresIn value must be between 0 and three months (7,776,000,000 milliseconds)

Testing Updated Configurations: After updating crawling rules, consider triggering a manual sync to verify that the new selectors, glob patterns, or JavaScript settings correctly extract the desired content. Monitor the crawl results to ensure accuracy before relying on automatic scheduled syncs.

Performance Impact: Changes to the javascript setting, glob patterns, or selectors can significantly affect crawl performance and resource consumption. Enable JavaScript only when necessary, use specific glob patterns to limit crawl scope, and ensure selectors are as specific as possible to avoid extracting unwanted content.

Deleting a Sitemap Integration

To permanently remove a sitemap integration and stop all web content crawling from the associated URL or domain, use the delete endpoint. This operation irreversibly removes the integration configuration, cancels any scheduled crawl operations, and disconnects the integration from its associated dataset. This is typically used when you no longer need to crawl content from a particular website, when migrating to a different content source, or when cleaning up unused integrations from your account.

Deleting a sitemap integration does not automatically remove the content that was previously crawled and stored in the associated dataset. Records that were extracted from web pages remain in the dataset unless explicitly deleted. This design ensures that valuable content is preserved even after the integration is removed, giving you the option to manually manage or migrate existing data before deletion.

POST /api/v1/integration/sitemap/{sitemapIntegrationId}/delete Content-Type: application/json {}

http

The response confirms successful deletion:

{ "id": "sitemap_abc123" }

json

What Gets Deleted:

  • The sitemap integration configuration and all crawling settings
  • URL patterns, glob rules, and CSS selector configurations
  • JavaScript rendering settings and crawl schedules
  • All scheduled crawl tasks for this integration
  • Metadata and configuration history for the integration

What Is NOT Deleted:

  • Records previously crawled from websites that now exist in the dataset
  • The associated dataset itself (remains intact and operational)
  • Any blueprints or other resources that referenced this integration
  • Audit logs and event logs documenting past crawl operations

Important Considerations:

Before Deleting:

  • Verify that you no longer need content updates from this website or domain
  • Consider whether you want to preserve crawled records in the dataset
  • Check if any bots or applications depend on content from this integration
  • Export integration configuration (URL, selectors, glob patterns) if you might want to recreate it later
  • Review crawl logs to ensure all desired content has been successfully captured

After Deletion:

  • All scheduled crawls will immediately stop, and no new content will be imported
  • The integration cannot be recovered; you must recreate it from scratch if needed
  • Existing dataset records remain available but will not receive updates from the website
  • You can safely delete records from the dataset separately if you no longer need them
  • Any custom glob patterns or selectors configured for this integration will be lost

Impact on Content Freshness: Once a sitemap integration is deleted, the crawled content in your dataset will gradually become outdated as the source website changes. If you need to maintain current content, ensure you have an alternative integration or manual update process in place before deleting the integration.

Alternative to Deletion: If you want to temporarily pause crawling without losing the integration configuration, consider updating the syncSchedule to a less frequent interval or setting it to a far-future cron expression, rather than deleting the integration. This preserves your carefully configured selectors, glob patterns, and crawling rules for future use.

Data Retention: If you plan to delete both the integration and the crawled content, delete the integration first, then separately delete or archive the records from the dataset. This two-step process gives you the opportunity to verify that you have backups or exports of any valuable content before permanent removal.

Syncing Website Content

Syncing a sitemap integration initiates the process of crawling and importing content from websites into a ChatBotKit dataset, transforming your web pages, documentation sites, and online content into searchable knowledge bases for your conversational AI bots. This powerful capability enables your bots to answer questions based on your website content, documentation, blog posts, and other web-based information.

The sync operation runs asynchronously as a background task, discovering pages through sitemaps or URL patterns, extracting content using configurable methods, and populating your dataset with structured, searchable records. It supports sophisticated filtering, content extraction strategies, and plan-based limits to handle websites of various sizes and complexities.

POST /api/v1/integration/sitemap/{sitemapIntegrationId}/sync Content-Type: application/json {}

http

Crawl Process and Content Extraction

When you trigger a sync operation, the system launches an asynchronous web crawler that discovers pages starting from your configured URL. The crawler intelligently follows links, respects glob patterns for URL filtering, and extracts content using your specified engine (Cheerio for static content or Puppeteer for JavaScript-rendered pages).

The crawler supports multiple content extraction strategies including HTML parsing, JSON-LD structured data, microdata, and sitemap-based discovery. You can configure which methods to use based on your website's structure and content format. The system automatically cleans and structures extracted content for optimal searchability and conversational use.

Content discovery respects glob patterns that you've configured on the integration, allowing you to include specific URL patterns and exclude others. For example, you might include /docs/** to crawl documentation while excluding /blog/** to skip blog posts. The crawler processes only URLs matching your inclusion patterns while skipping those matching exclusion patterns.

URL Filtering and Glob Patterns

The integration uses glob patterns to control which URLs get crawled and indexed. Patterns are evaluated against full URLs, allowing precise control over content inclusion. You can specify patterns like /docs/** to match all documentation pages, or !/admin/** to exclude administrative sections.

When no glob patterns are specified, the system defaults to crawling all pages under the configured URL using a /** pattern. This ensures that basic setups work without requiring pattern configuration, while still allowing sophisticated filtering when needed.

Crawl Engines and Rendering

The integration supports two crawling engines with different capabilities:

Cheerio Engine: Fast, lightweight parsing of static HTML content. This engine works well for traditional server-rendered websites and documentation sites where content is present in the initial HTML response. It's more resource-efficient and faster for static content.

Puppeteer Engine: Full browser rendering for JavaScript-heavy websites and single-page applications. This engine executes JavaScript and waits for dynamic content to render, making it suitable for modern web applications that depend on client-side rendering. It requires more resources but handles complex websites effectively.

Rate Limiting and Plan Limits

Manual sync operations are rate-limited to once every 15 minutes per integration, preventing excessive API usage and ensuring system stability. This rate limit applies to manual triggers but doesn't affect scheduled automatic syncs configured on the integration.

The crawler respects plan-based limits for both URL count and execution time. These limits ensure fair resource usage across the platform while providing sufficient capacity for typical websites. Higher-tier plans offer increased limits for larger websites and more comprehensive crawls.

Prerequisites and Validation

Before syncing, the system validates several requirements:

Dataset Association: The integration must be connected to a dataset where website content will be imported. Syncing without a dataset association results in a conflict error.

Valid URL: The configured URL must be valid and accessible. Invalid URLs cause the sync to fail and may automatically disable scheduled syncs to prevent repeated failures.

Available Resources: Your account must have sufficient dataset record limits. The system checks limits before beginning the crawl to prevent partial imports that exceed plan allocations.

Content Freshness and Updates

Each successful sync updates the lastSyncedAt timestamp on the integration, helping you track content freshness and sync frequency. If you've configured automatic sync scheduling, this timestamp reflects the most recent sync whether manual or automatic.

Depending on your sync configuration, records may have expiration times that automatically remove outdated content from your dataset. This ensures your conversational AI always works with current information from your website.

Note: This operation is rate-limited to once every 15 minutes per integration. Attempting to sync more frequently will result in a rate limit error. For continuous updates, configure automatic sync scheduling on the integration instead of frequent manual triggers.