Upload Your Document Collection

Follow the steps below to prepare and upload your document collection. This guide will help you extract text, structure your documents, chunk the content, and save it in the required format before uploading. This guide uses The Wikipedia article for Stanford University as an example, but you can use any document collection of any type.

Step 1: Extract Text from Your Documents

If your documents are not already in text form, you need to extract their text. Use Docling, a simple tool to extract text from PDFs, DOCX files, and more. While it may struggle with complex scanned documents, it works well for most standard documents. Be sure to use the "Export to Markdown" option.

Now each document in your collection should have the following fields:

  • document_title, for example:
    Stanford University
  • content, for example:
    Stanford University (officially Leland Stanford Junior University) is a private research university in Stanford, California, United States. ...
    # History
    Stanford University was founded in 1885. ...
    # Land

    Most of Stanford is on an 8,180-acre (12.8 sq mi; 33.1 km2) campus ...
    ## Central campus
    The central campus is adjacent to Palo Alto ...
Step 2: Chunk the Document Content

Breaking each document into smaller chunks (e.g., individual paragraphs, tables, infoboxes) can make it easier for retrieval systems to process.

Refer to this guide on chunking methods: Text Splitters. For most use cases, simple length-based chunking with chunk sizes between 200-800 tokens (roughly 150-600 words) works well.

Step 3: Extract Section Structure

Extracting the section structure can improve search results. Aim to capture the structure in the format "document title > section level 1 > section level 2 > ...". For example:

Stanford University > Land > Central campus

You can derive this structure from the text extracted in the previous steps. By now, each document in your collection should have one or more blocks with the following fields:

{
    "document_title": "Stanford University",
    "section_title": "Land > Central campus",
    "content": "The central campus is adjacent to Palo Alto ..."
}

These three fields are the minimum required fields for each block. Together, they will be used to find the most relevant blocks for a given query.

Step 4: Save Your Collection in JSON Lines Format

JSON Lines is a convenient format for storing structured data. Each line in a JSON Lines file is a valid JSON object.

Here you can add optional fields last_edit_date and url to each block. You may set these fields to null if they are not available. The url field can be used to link back to the original document, for instance, the specific Wikipedia article for Stanford University. The last_edit_date is the date the document was published or, if known, the last date the information in the document was considered up-to-date. This field can be used to obtain better search results for time-sensitive queries.

You can also add other custom metadata fields to each block by providing a JSON object as block_metadata. All these fields will be returned to you in the search results. Each line in your JSON Lines file should follow this format. All metadata fields in all blocks should be consistent. This applies even to their types. For instance, if you want a metadata field to be a float in one block, it should be explicitly set to float in all blocks by using 5.0 instead of 5. The following example includes the optional metadata fields block_type and language.

Note: Line breaks outside quotes are for readability and should not be included.

{
    "document_title",
    "section_title",
    "content",
    "last_edit_date": "The last edit date of the block in YYYY-MM-DD format",
    "url": "The URL of the block, e.g. https://en.wikipedia.org/wiki/Stanford_University"
    "block_metadata": {
        "block_type": "The type of the block, e.g. 'text', 'table', 'infobox'",
        "language": "The language of the block, e.g. 'en' for English",
    }
}
Step 5: Upload Your JSON Lines File

If there are any formatting errors, an error message will appear. Please correct the issues and try uploading again.

No file chosen
Step 6: Request Collection Addition

After uploading your file, email genie@cs.stanford.edu to request the addition of your collection. You will receive a confirmation email with details on how to access the search API once your collection is added.