|
@@ -6,8 +6,8 @@ identity:
|
|
|
zh_Hans: 单页面抓取
|
|
|
description:
|
|
|
human:
|
|
|
- en_US: Extract data from a single URL.
|
|
|
- zh_Hans: 从单个URL抓取数据。
|
|
|
+ en_US: Turn any url into clean data.
|
|
|
+ zh_Hans: 将任何网址转换为干净的数据。
|
|
|
llm: This tool is designed to scrape URL and output the content in Markdown format.
|
|
|
parameters:
|
|
|
- name: url
|
|
@@ -21,45 +21,35 @@ parameters:
|
|
|
zh_Hans: 要抓取并提取数据的网站URL。
|
|
|
llm_description: The URL of the website that needs to be crawled. This is a required parameter.
|
|
|
form: llm
|
|
|
-############## Page Options #######################
|
|
|
- - name: headers
|
|
|
+############## Payload #######################
|
|
|
+ - name: formats
|
|
|
type: string
|
|
|
label:
|
|
|
- en_US: headers
|
|
|
- zh_Hans: 请求头
|
|
|
+ en_US: Formats
|
|
|
+ zh_Hans: 结果的格式
|
|
|
+ placeholder:
|
|
|
+ en_US: Use commas to separate multiple tags
|
|
|
+ zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
human_description:
|
|
|
en_US: |
|
|
|
- Headers to send with the request. Can be used to send cookies, user-agent, etc. Example: {"cookies": "testcookies"}
|
|
|
+ Formats to include in the output. Available options: markdown, html, rawHtml, links, screenshot, extract, screenshot@fullPage
|
|
|
zh_Hans: |
|
|
|
- 随请求发送的头部。可以用来发送cookies、用户代理等。示例:{"cookies": "testcookies"}
|
|
|
- placeholder:
|
|
|
- en_US: Please enter an object that can be serialized in JSON
|
|
|
- zh_Hans: 请输入可以json序列化的对象
|
|
|
+ 输出中应包含的格式。可以填入: markdown, html, rawHtml, links, screenshot, extract, screenshot@fullPage
|
|
|
form: form
|
|
|
- - name: includeHtml
|
|
|
- type: boolean
|
|
|
- default: false
|
|
|
- label:
|
|
|
- en_US: include Html
|
|
|
- zh_Hans: 包含HTML
|
|
|
- human_description:
|
|
|
- en_US: Include the HTML version of the content on page. Will output a html key in the response.
|
|
|
- zh_Hans: 返回中包含一个HTML版本的内容,将以html键返回。
|
|
|
- form: form
|
|
|
- - name: includeRawHtml
|
|
|
+ - name: onlyMainContent
|
|
|
type: boolean
|
|
|
default: false
|
|
|
label:
|
|
|
- en_US: include Raw Html
|
|
|
- zh_Hans: 包含原始HTML
|
|
|
+ en_US: only Main Content
|
|
|
+ zh_Hans: 仅抓取主要内容
|
|
|
human_description:
|
|
|
- en_US: Include the raw HTML content of the page. Will output a rawHtml key in the response.
|
|
|
- zh_Hans: 返回中包含一个原始HTML版本的内容,将以rawHtml键返回。
|
|
|
+ en_US: Only return the main content of the page excluding headers, navs, footers, etc.
|
|
|
+ zh_Hans: 只返回页面的主要内容,不包括头部、导航栏、尾部等。
|
|
|
form: form
|
|
|
- - name: onlyIncludeTags
|
|
|
+ - name: includeTags
|
|
|
type: string
|
|
|
label:
|
|
|
- en_US: only Include Tags
|
|
|
+ en_US: Include Tags
|
|
|
zh_Hans: 仅抓取这些标签
|
|
|
placeholder:
|
|
|
en_US: Use commas to separate multiple tags
|
|
@@ -70,20 +60,10 @@ parameters:
|
|
|
zh_Hans: |
|
|
|
仅在最终输出中包含HTML页面的这些标签,可以通过标签名、类或ID来设定,使用逗号分隔值。示例:script, .ad, #footer
|
|
|
form: form
|
|
|
- - name: onlyMainContent
|
|
|
- type: boolean
|
|
|
- default: false
|
|
|
- label:
|
|
|
- en_US: only Main Content
|
|
|
- zh_Hans: 仅抓取主要内容
|
|
|
- human_description:
|
|
|
- en_US: Only return the main content of the page excluding headers, navs, footers, etc.
|
|
|
- zh_Hans: 只返回页面的主要内容,不包括头部、导航栏、尾部等。
|
|
|
- form: form
|
|
|
- - name: removeTags
|
|
|
+ - name: excludeTags
|
|
|
type: string
|
|
|
label:
|
|
|
- en_US: remove Tags
|
|
|
+ en_US: Exclude Tags
|
|
|
zh_Hans: 要移除这些标签
|
|
|
human_description:
|
|
|
en_US: |
|
|
@@ -94,29 +74,24 @@ parameters:
|
|
|
en_US: Use commas to separate multiple tags
|
|
|
zh_Hans: 多个标签时使用半角逗号分隔
|
|
|
form: form
|
|
|
- - name: replaceAllPathsWithAbsolutePaths
|
|
|
- type: boolean
|
|
|
- default: false
|
|
|
- label:
|
|
|
- en_US: All AbsolutePaths
|
|
|
- zh_Hans: 使用绝对路径
|
|
|
- human_description:
|
|
|
- en_US: Replace all relative paths with absolute paths for images and links.
|
|
|
- zh_Hans: 将所有图片和链接的相对路径替换为绝对路径。
|
|
|
- form: form
|
|
|
- - name: screenshot
|
|
|
- type: boolean
|
|
|
- default: false
|
|
|
+ - name: headers
|
|
|
+ type: string
|
|
|
label:
|
|
|
- en_US: screenshot
|
|
|
- zh_Hans: 截图
|
|
|
+ en_US: headers
|
|
|
+ zh_Hans: 请求头
|
|
|
human_description:
|
|
|
- en_US: Include a screenshot of the top of the page that you are scraping.
|
|
|
- zh_Hans: 提供正在抓取的页面的顶部的截图。
|
|
|
+ en_US: |
|
|
|
+ Headers to send with the request. Can be used to send cookies, user-agent, etc. Example: {"cookies": "testcookies"}
|
|
|
+ zh_Hans: |
|
|
|
+ 随请求发送的头部。可以用来发送cookies、用户代理等。示例:{"cookies": "testcookies"}
|
|
|
+ placeholder:
|
|
|
+ en_US: Please enter an object that can be serialized in JSON
|
|
|
+ zh_Hans: 请输入可以json序列化的对象
|
|
|
form: form
|
|
|
- name: waitFor
|
|
|
type: number
|
|
|
min: 0
|
|
|
+ default: 0
|
|
|
label:
|
|
|
en_US: wait For
|
|
|
zh_Hans: 等待时间
|
|
@@ -124,57 +99,54 @@ parameters:
|
|
|
en_US: Wait x amount of milliseconds for the page to load to fetch content.
|
|
|
zh_Hans: 等待x毫秒以使页面加载并获取内容。
|
|
|
form: form
|
|
|
-############## Extractor Options #######################
|
|
|
- - name: mode
|
|
|
- type: select
|
|
|
- options:
|
|
|
- - value: markdown
|
|
|
- label:
|
|
|
- en_US: markdown
|
|
|
- - value: llm-extraction
|
|
|
- label:
|
|
|
- en_US: llm-extraction
|
|
|
- - value: llm-extraction-from-raw-html
|
|
|
- label:
|
|
|
- en_US: llm-extraction-from-raw-html
|
|
|
- - value: llm-extraction-from-markdown
|
|
|
- label:
|
|
|
- en_US: llm-extraction-from-markdown
|
|
|
- label:
|
|
|
- en_US: Extractor Mode
|
|
|
- zh_Hans: 提取模式
|
|
|
- human_description:
|
|
|
- en_US: |
|
|
|
- The extraction mode to use. 'markdown': Returns the scraped markdown content, does not perform LLM extraction. 'llm-extraction': Extracts information from the cleaned and parsed content using LLM.
|
|
|
- zh_Hans: 使用的提取模式。“markdown”:返回抓取的markdown内容,不执行LLM提取。“llm-extractioin”:使用LLM按Extractor Schema从内容中提取信息。
|
|
|
- form: form
|
|
|
- - name: extractionPrompt
|
|
|
- type: string
|
|
|
+ - name: timeout
|
|
|
+ type: number
|
|
|
+ min: 0
|
|
|
+ default: 30000
|
|
|
label:
|
|
|
- en_US: Extractor Prompt
|
|
|
- zh_Hans: 提取时的提示词
|
|
|
+ en_US: Timeout
|
|
|
human_description:
|
|
|
- en_US: A prompt describing what information to extract from the page, applicable for LLM extraction modes.
|
|
|
- zh_Hans: 当使用LLM提取模式时,用于给LLM描述提取规则。
|
|
|
+ en_US: Timeout in milliseconds for the request.
|
|
|
+ zh_Hans: 请求的超时时间(以毫秒为单位)。
|
|
|
form: form
|
|
|
- - name: extractionSchema
|
|
|
+############## Extractor Options #######################
|
|
|
+ - name: schema
|
|
|
type: string
|
|
|
label:
|
|
|
en_US: Extractor Schema
|
|
|
zh_Hans: 提取时的结构
|
|
|
placeholder:
|
|
|
en_US: Please enter an object that can be serialized in JSON
|
|
|
+ zh_Hans: 请输入可以json序列化的对象
|
|
|
human_description:
|
|
|
en_US: |
|
|
|
- The schema for the data to be extracted, required only for LLM extraction modes. Example: {
|
|
|
+ The schema for the data to be extracted. Example: {
|
|
|
"type": "object",
|
|
|
"properties": {"company_mission": {"type": "string"}},
|
|
|
"required": ["company_mission"]
|
|
|
}
|
|
|
zh_Hans: |
|
|
|
- 当使用LLM提取模式时,使用该结构去提取,示例:{
|
|
|
+ 使用该结构去提取,示例:{
|
|
|
"type": "object",
|
|
|
"properties": {"company_mission": {"type": "string"}},
|
|
|
"required": ["company_mission"]
|
|
|
}
|
|
|
form: form
|
|
|
+ - name: systemPrompt
|
|
|
+ type: string
|
|
|
+ label:
|
|
|
+ en_US: Extractor System Prompt
|
|
|
+ zh_Hans: 提取时的系统提示词
|
|
|
+ human_description:
|
|
|
+ en_US: The system prompt to use for the extraction.
|
|
|
+ zh_Hans: 用于提取的系统提示。
|
|
|
+ form: form
|
|
|
+ - name: prompt
|
|
|
+ type: string
|
|
|
+ label:
|
|
|
+ en_US: Extractor Prompt
|
|
|
+ zh_Hans: 提取时的提示词
|
|
|
+ human_description:
|
|
|
+ en_US: The prompt to use for the extraction without a schema.
|
|
|
+ zh_Hans: 用于无schema时提取的提示词
|
|
|
+ form: form
|