[AI Assist] Website and Document ingest release FAQs

Hello Community,

With last week’s release of Website and Document ingestion features for AI Assist, here are some of the FAQs to help you get introduced:

Website Ingestion

  1. What is the new website ingestion feature?
  • Users can now input website and sitemap URLs. Web pages will be crawled, and the information will be added to the AI’s knowledge.
  1. How does the system handle URL redirects?
  • If the provided URLs redirect, the system will internally store the final redirected URL.
  1. What criteria are used to determine which web pages will be ingested?
  • For Websites: Only pages under the provided URL’s subdirectory. For Sitemaps: All pages under the sitemap URL’s domain.
  1. Is there a limit to the number of web pages that can be ingested?
  • Ingested web pages for each unique URL are currently limited to 500. It is possible to upload multiple URLs with up to 500 unique pages each.
  1. Can users manually update or re-crawl web pages?
  • Yes, users can manually update info such as title, content, searchable status, and initiate re-crawling of individual web pages.
  1. Are there any tags or class names that are omitted during ingestion?
  • Yes, several tags like nav, script, footer, and class names like navigation, menu, ads, banner, etc., are omitted.
  1. Will website ingest automatically update if I change something on my site?
  • Currently ingestion process is a one-off event, but we are looking into automating this in future updates.*

Document Ingestion

  1. What is the new document ingestion feature?
  • Documents can now serve as a source of knowledge for the AI-assistant, supporting various types like PDF, TXT, and CSV.
  1. What are the limitations for PDF files?
  • Only text is extracted, scanned images or empty pages are ignored, and the maximum size is 10MB. Text is split into 1000 character chunks.
  1. What are the limitations for TXT files?
  • The maximum size is 5MB, rows greater than 1000 characters are ignored, and text is split into 1000 character chunks.
  1. What are the limitations for CSV files?
  • The maximum size is 5MB, empty rows or rows with content-length greater than 1000 characters are ignored, and each row is considered a knowledge chunk.

If you have any questions of your own about these features feel free to reach out to us vai reply here on Community and we will help you find the answer.

3 Likes

Thank you very much! I’ve been eagerly waiting for this feature to help fill the gap as I build my knowledge base.

what happen when there is new articlas? does, we need to add each time link or tool check by itself to see if there is new links?
also limit of 500 is currectly limit or there is option to increase limit or limit is different with payment plane?

Hi djorbdjeb,

The 500 site limitation applies to batch, so if you have more than 500 products you can add each unique product category URL that would have up to 500 sites in it.

Currently, the Website ingest is one-off event that needs to be repeated once website info has been updated.

Our product team is looking into the auto-website-ingestion feature and other quality-of-life changes for future updates.

Let us know if you have any other questions.

1 Like

there is an option to manually re crawl the website and or page.