A collection of articles from The African Times newspaper, written by members of the African diaspora in the late 19th century.
Follow the steps below to prepare and upload your document collection. This guide will help you extract text, structure your documents, chunk the content, and save it in the required format before uploading. This guide uses The Wikipedia article for Stanford University as an example, but you can use any document collection of any type.
If your documents are not already in text form, you need to extract their text. Use Docling, a simple tool to extract text from PDFs, DOCX files, and more. While it may struggle with complex scanned documents, it works well for most standard documents. Be sure to use the "Export to Markdown" option.
Now each document in your collection should have the following fields:
Breaking each document into smaller chunks (e.g., individual paragraphs, tables, infoboxes) can make it easier for retrieval systems to process.
Refer to this guide on chunking methods: Text Splitters. For most use cases, simple length-based chunking with chunk sizes between 200-800 tokens (roughly 150-600 words) works well.
Extracting the section structure can improve search results. Aim to capture the structure in the format "document title > section level 1 > section level 2 > ...". For example:
Stanford University > Land > Central campus
You can derive this structure from the text extracted in the previous steps. By now, each document in your collection should have one or more blocks with the following fields:
{ "document_title": "Stanford University", "section_title": "Land > Central campus", "content": "The central campus is adjacent to Palo Alto ..." }
These three fields are the minimum required fields for each block. Together, they will be used to find the most relevant blocks for a given query.
JSON Lines is a convenient format for storing structured data. Each line in a JSON Lines file is a valid JSON object.
Here you can add optional fields last_edit_date
and url
to each block.
You may set these fields to null
if they are not available. The url
field
can be used to link back to the original document, for instance, the specific Wikipedia article for Stanford University.
The last_edit_date
is the date the document was published or, if known, the last date the information
in the document was considered up-to-date. This field can be used to obtain better search results for time-sensitive queries.
You can also add other custom metadata fields to each block by providing a JSON object as block_metadata
.
All these fields will be returned to you in the search results. Each line in your JSON Lines file should follow this format.
All metadata fields in all blocks should be consistent. This applies even to their types. For instance, if you want a metadata
field to be a float
in one block, it should be explicitly set to float
in all blocks by using 5.0
instead of 5
.
The following example includes the optional metadata fields block_type
and language
.
Note: Line breaks outside quotes are for readability and should not be included.
{ "document_title", "section_title", "content", "last_edit_date": "The last edit date of the block in YYYY-MM-DD format", "url": "The URL of the block, e.g. https://en.wikipedia.org/wiki/Stanford_University" "block_metadata": { "block_type": "The type of the block, e.g. 'text', 'table', 'infobox'", "language": "The language of the block, e.g. 'en' for English", } }
If there are any formatting errors, an error message will appear. Please correct the issues and try uploading again.
After uploading your file, email genie@cs.stanford.edu to request the addition of your collection. You will receive a confirmation email with details on how to access the search API once your collection is added.