Document Chunking for LLMs
Chunking documents into smaller segments for LLM-based retrieval augmented generation and semantic search engines presents a challenge to building robust and useful systems. Here I outline some of these challenges, the competing priorities, and provide a simple but effective method to chunk documents for LLMs using recursive sentence embedding and semantic similarity.
I’ve found that splitting large documents into smaller chunks requires some trial and error to find the right strategy.
There are basically three competing factors:
- Semantic coherence within chunks - the chunk should contain related information. A small chunk size loses context but is more coherent while a large chunk size is less coherent but contains more context.
- Semantic separation between chunks - the chunks should be distinct enough to avoid overlap but not so distinct that they lose context. This means we want to avoid splitting chunks at arbitrary points like sentences or paragraphs and instead split them at semantic boundaries of sections and topics. Even having some overlap between chunks can be helpful to keep context.
- Information preservation - chunks should be self-contained enough to be useful on their own. This means that we need to split text at context boundaries and ensure that the similar concepts are grouped together in the same chunk and not split across chunks.
Ultimately, the appropriate startegy depends on the type of document and the intended use case. For example, a user guide is organized into chapters and sections. The beginning of a chapter introduces the topic, followed by sections that provide detailed workflows and instructions. In this case, it makes sense to keep the workflows and instructions together in the same chunk. We want to avoid splitting a step in a workflow into two independent chunks that loses it’s place in the workflow and more importantly, the context of the step. If the document is more structured, like an invoice or technical specifications with tables, than a more sophisticated chunking strategy may be needed to preserve the relationships between the data points. This article focuses on chunking textual documents, such as reports and manuals, where the text is more free-form and less structured.
Table of contents:
Algorithm Overview
Lets build a chunking algorithm based on these principles from the ground up. Conceptually, we want to start with sentences since they represent the basic unit of a coherent meaning. And from there we need a strategy for iteratively merging sentences that achieves a balance between the three stated objectives: semantic coherence within chunks, semantic separation between chunks, and information preservation.
So the algorithm consists of two main stages:
- Semantic chunking breaks the document into semantically coherent chunks by:
- Splitting into sentences
- Computing embeddings for each sentence
- Using cosine distance between consecutive sentence embeddings to detect topic shifts
- Enforcing size constraints while adding overlap
- Section grouping organizes chunks into a hierarchical structure by:
- Generating descriptive metadata (title and category) for each chunk using an LLM
- Computing weighted similarity between chunk metadata
- Grouping similar chunks into sections based on similarity thresholds
- Creating subsections within sections for fine-grained organization
Example: Berkshire Hathaway Annual Report
As an example, I’ve taken the 2024 Berkshire Hathaway Annual Report and passed it through the chunking process. First I generates sentences from the document using regex and then computes embeddings and uses cosine similarity to merge sentences into partially overlapping chunks. Next I take each chunk and use an LLM to generate metadata including a title and category for each chunk. Notice how there is consistency in the titles and categories across chunks. This is because the LLM is prompted with the trailing 5 metadata entries to provide context for the current chunk. Finally, I use the embedding approach to merge chunks into sections based on the semantic similarity of the title and category of each chunk. A tunable similarity threshold of 0.75 is used to combine chunks into sections. The resulting sections are more coherent and self-contained, making them easier to use in a RAG system.
The table below shows the title and category generated by an LLM for each chunk. Splits happen when the chunk similarity between the current chunk and the previous chunk is below a threshold of 0.75 or when the chunk size exceeds a maximum character limit. When there is a split, the table below reports the similarity of the current chunk to the previous chunk.
In the beginning of the letter to the shareholders, the chunks are smaller and more granular because Buffet is introducing several concepts and providing an overview of the risks, management philosophy, and strategic focus. Chunks are smaller, spanning a few sentences and the similarity between chunks is lower. As the letter progresses, the chunks become larger and more coherent, with higher similarity scores as Buffet discusses specific topics in more detail. The final sections of the letter are larger and more comprehensive, covering specific companies, investments, and their performance in depth. Notice that even though the similarity between chunks is high and the title and category are the same, the chunk size grows larger and a new split is created.
Chunk | Title | Category | Previous Chunk Similarity |
---|---|---|---|
1 | Berkshire Hathaway Annual Report | Financial Report | |
2 | Berkshire Hathaway Annual Report | Financial Report | |
3 | Company Transparency & Reporting | Corporate Communication & Reporting | 0.541 |
4 | Responsibility & Communication | Corporate Communication & Reporting | 0.74 |
5 | Strategic Review & Ownership Dialogue | Corporate Communication & Reporting | 0.696 |
6 | Communication Strategy | Business & Investor Relations | 0.576 |
7 | Berkshire Hathaway’s Risk Management Approach | Financial Risk Management | 0.479 |
8 | Mistakes and Strategic Assessment | Business Strategy | 0.528 |
9 | Mistakes in Berkshire Acquisitions | Business Risk & Management | 0.641 |
10 | Mistakes in Hiring, Impacting Berkshire | Corporate Management & Risk | |
11 | Mistakes in Hiring Assessment | Personnel Management & Decision Making | 0.668 |
12 | Painful Mistakes, Diminishing Returns | Financial & Strategic | 0.506 |
13 | Mistakes and Delay | Strategic & Operational | 0.679 |
14 | Mistakes and Analysis | Business Strategy | 0.643 |
15 | Mistakes and Their Impact | Business Strategy | |
16 | Word Frequency Analysis | Text Analysis | 0.37 |
17 | Company Observations | Business & Communication | 0.405 |
18 | Behavioral Observations | Business & Risk | 0.652 |
19 | CEO Succession & Risk | Corporate Management & Risk | 0.568 |
20 | CEO Transition & Risk | Corporate Management & Risk | |
21 | CEO Succession & Berkshire’s Risk | Corporate Strategy & Risk | |
22 | Berkshire CEO Philosophy | Corporate Strategy & Leadership | |
23 | Pete Liegl’s Legacy | Business & Leadership | 0.53 |
24 | Pete Liegl - A Wealthing Story | Business & Financial | |
25 | Pete - Forest River Founder | Business & Founding | 0.572 |
26 | Forest River Acquisition | Business & Financial | 0.638 |
27 | RV Deal - Initial Communication | Business & Communication & Financial | 0.648 |
28 | Berkshire Acquisition Deal | Business & Finance | 0.646 |
29 | Meeting Details & Price Discussion | Business & Communication & Strategy | 0.513 |
30 | Meeting & Deal | Business & Communication | |
31 | Business Meeting & Financial Planning | Business & Strategy | 0.664 |
32 | Berkshire Hathaway Business Deal | Business & Financial | 0.576 |
33 | Real Estate Deal | Business & Financial | |
34 | Real Estate Lease Dispute | Business & Financial | |
35 | Meeting Dynamics | Business & Communication & Strategy | 0.514 |
36 | Compensation Structure | Financial & Human Resources | 0.429 |
37 | Compensation Structure | Financial & Strategy | |
38 | Berkshire’s Compensation Offer | Financial & Business | 0.669 |
39 | Berkshire’s Financial Strategy | Financial Strategy & Risk | 0.677 |
40 | Berkshire’s Early Success | Business & Financial | 0.741 |
41 | Simple Success | Business & Financial | 0.737 |
42 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | 0.492 |
43 | Berkshire’s Performance | Financial & Strategic | 0.736 |
44 | Mistakes in Berkshire Acquisitions | Business Risk & Management | 0.645 |
45 | Berkshire’s Strategic Imperfections | Business Strategy & Risk | |
46 | Strategic Focus & Partnership | Business & Strategy | 0.624 |
47 | Strategic Imperative | Business & Strategy | |
48 | CEO Mistakes & Analysis | Business Strategy & Risk | 0.614 |
49 | Mistakes in Berkshire Acquisitions | Business Risk & Management | 0.745 |
50 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
51 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
52 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
53 | Berkshire’s Mistakes & Strategic Challenges | Business & Risk & Strategy | |
54 | Mistakes in Berkshire Acquisitions | Business Risk & Management | |
55 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
56 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | 1 |
57 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
58 | Pete Liegl’s Performance | Business & Financial | 0.475 |
59 | GEICO Restructuring | Business & Strategy | 0.46 |
60 | GEICO Transformation | Business Strategy & Operational | |
61 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.536 |
62 | Property Casualty Pricing Surge | Financial & Strategic | 0.522 |
63 | Convective Storm Damage | Financial Risk & Strategic | 0.652 |
64 | Berkshire’s Recent Challenges | Business & Risk & Strategy | 0.55 |
65 | Berkshire’s Recent Financial Challenges | Financial & Strategic | |
66 | Insurance Losses & Strategic Risks | Business & Risk & Strategy | 0.559 |
67 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | 0.747 |
68 | Berkshire’s Financial Performance | Financial & Strategic | 0.74 |
69 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
70 | Berkshire’s Financial Performance | Financial & Strategic | |
71 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
72 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
73 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
74 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
75 | Berkshire’s Recent Mistakes & Challenges | Business & Risk & Strategy | 0.923 |
76 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
77 | Berkshire’s Recent Mistakes & Strategic Challenges | Business & Risk & Strategy | |
78 | Berkshire’s Recent Mistakes & Challenges | Business & Risk & Strategy | |
79 | Berkshire’s Recent Mistakes & Strategic Challenges | Business & Risk & Strategy | 0.981 |
80 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
81 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
82 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
83 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
84 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
85 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
86 | Berkshire’s Mistakes | Business & Risk & Strategy | 0.982 |
87 | Berkshire’s Early Mistakes | Business & Risk & Strategy | |
88 | Berkshire’s Early Mistakes | Business & Risk & Strategy | |
89 | Berkshire’s Early Mistakes | Business & Risk & Strategy | |
90 | Berkshire’s Strategic Missteps | Business & Risk & Strategy | |
91 | Berkshire’s Tax Burden | Financial & Strategic | 0.697 |
92 | Berkshire’s Tax Burden | Financial & Risk & Strategy | |
93 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
94 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
95 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
96 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
97 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
98 | Berkshire’s Financial Challenges | Financial & Strategic | 0.742 |
99 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.742 |
100 | Berkshire’s Financial Mistakes | Financial & Strategic | |
101 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
102 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
103 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
104 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
105 | Berkshire’s Recent Mistakes & Challenges | Business & Risk & Strategy | |
106 | Berkshire’s Recent Mistakes & Challenges | Business & Risk & Strategy | |
107 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.964 |
108 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
109 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
110 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
111 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
112 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
113 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
114 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
115 | Berkshire’s Strategic Setbacks | Business & Risk & Strategy | |
116 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
117 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
118 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
119 | Berkshire’s Strategic Challenges | Business & Risk & Strategy | |
120 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
121 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
122 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
123 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
124 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
125 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
126 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
127 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
128 | Berkshire’s Strategic Challenges | Business & Risk & Strategy | |
129 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.851 |
130 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
131 | Berkshire’s Strategic Mistakes | Business & Risk & Strategy | |
132 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
133 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
134 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
135 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
136 | Berkshire’s Early Struggles | Business & Risk & Strategy | |
137 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
138 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
139 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
140 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
141 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
142 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
143 | Berkshire’s Recent Mistakes & Challenges | Business & Risk & Strategy | |
144 | Berkshire’s Early Struggles | Business & Risk & Strategy | |
145 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.86 |
146 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
147 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
148 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
149 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
150 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
151 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
152 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
153 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
154 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
155 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
156 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
157 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
158 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
159 | CEOs’ Mistakes | Business Strategy & Risk | |
160 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
161 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
162 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
163 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
164 | Berkshire’s Strategic Challenges | Business & Risk & Strategy | 0.851 |
165 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
166 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
167 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
168 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
169 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
170 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
171 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
172 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
173 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
174 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
175 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
176 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
177 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
178 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
179 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
180 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
181 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
182 | Strategic Setbacks | Business & Risk & Strategy | |
183 | Recent Mistakes in Berkshire’s Operations | Business & Risk & Strategy | 0.743 |
184 | Hurricane, Tornado, and Wildfire Risks | Financial & Strategic | 0.547 |
185 | Strategic Mistakes | Business & Risk & Strategy | 0.55 |
186 | Auto Insurance Transition | Financial & Strategic | 0.481 |
187 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 0.49 |
188 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
189 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
190 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
191 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
192 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
193 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
194 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
195 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
196 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
197 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
198 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
199 | Berkshire’s Strategic Shifts | Business & Strategy | |
200 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
201 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
202 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
203 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
204 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
205 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
206 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
207 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
208 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
209 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
210 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
211 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | 1 |
212 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
213 | Berkshire’s Recent Mistakes | Business & Risk & Strategy | |
214 | Strategic Missteps | Business & Risk & Strategy | |
215 | Strategic Missteps | Business & Risk & Strategy | |
216 | Strategic Missteps | Business & Risk & Strategy | |
217 | Strategic Missteps | Business & Risk & Strategy | |
218 | Strategic Challenges | Business & Risk & Strategy | 0.803 |
219 | Strategic Missteps | Business & Risk & Strategy | |
220 | Strategic Setbacks | Business & Risk & Strategy | |
221 | Recent Mistakes | Business & Risk & Strategy | |
222 | Recent Mistakes | Business & Risk & Strategy | |
223 | Recent Mistakes | Business & Risk & Strategy | |
224 | Recent Mistakes | Business & Risk & Strategy | |
225 | Recent Mistakes | Business & Risk & Strategy | |
226 | Recent Mistakes | Business & Risk & Strategy | 1 |
227 | Recent Mistakes | Business & Risk & Strategy | |
228 | Strategic Missteps | Business & Risk & Strategy | |
229 | Recent Mistakes | Business & Risk & Strategy | |
230 | Strategic Mishaps | Business & Risk & Strategy | |
231 | Recent Mistakes | Business & Risk & Strategy | |
232 | Recent Mistakes | Business & Risk & Strategy | |
233 | Recent Mistakes | Business & Risk & Strategy | |
234 | Financial Risks & Challenges | Business & Risk & Strategy | 0.671 |
235 | Recent Business Mistakes | Business & Risk & Strategy | 0.694 |
236 | Recent Mistakes | Business & Risk & Strategy | |
237 | Strategic Challenges | Business & Risk & Strategy | 0.678 |
238 | Strategic Challenges | Business & Risk & Strategy | |
239 | Recent Business Mistakes | Business & Risk & Strategy | 0.671 |
240 | Strategic Challenges | Business & Risk & Strategy | 0.671 |
241 | Recent Business Mistakes | Business & Risk & Strategy | 0.671 |
242 | Strategic Challenges | Business & Risk & Strategy | 0.671 |
243 | Recent Business Mistakes | Business & Risk & Strategy | 0.671 |
244 | Recent Business Mistakes | Business & Risk & Strategy | |
245 | Recent Business Mistakes | Business & Risk & Strategy | |
246 | Recent Business Mistakes | Business & Risk & Strategy | |
247 | Recent Business Mistakes | Business & Risk & Strategy | |
248 | Recent Business Mistakes | Business & Risk & Strategy | |
249 | Recent Business Mistakes | Business & Risk & Strategy | 1 |
250 | Recent Business Mistakes | Business & Risk & Strategy | |
251 | Recent Business Mistakes | Business & Risk & Strategy | |
252 | Strategic Challenges | Business & Risk & Strategy | 0.671 |
253 | Recent Business Mistakes | Business & Risk & Strategy | 0.671 |
254 | Recent Business Mistakes | Business & Risk & Strategy | |
255 | Recent Business Mistakes | Business & Risk & Strategy | |
256 | Recent Business Mistakes | Business & Risk & Strategy | |
257 | Recent Business Mistakes | Business & Risk & Strategy | |
258 | Recent Business Mistakes | Business & Risk & Strategy | 1 |
Implementation Code
import re
import json
import ollama
import logging
import numpy as np
from typing import List
from pathlib import Path
from dataclasses import dataclass
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)
# Dont show logging for httpcore or httpx
logging.getLogger("httpcore").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
@dataclass
class ChunkingConfig:
max_chunk_size: int = 1000 # characters
min_chunk_size: int = 100 # characters
similarity_threshold: float = 0.15 # cosine distance threshold for splitting (higher means more splits)
ollama_model: str = "nomic-embed-text:v1.5" # Ollama model for embeddings
overlap_sentences: int = 1 # sentences to overlap between chunks
@dataclass
class SectioningConfig:
# Model settings
ollama_model: str = "gemma3:1b"
max_tokens: int = 200
temperature: float = 0.1 # Low for consistency
# Similarity thresholds
section_merge_threshold: float = 0.75
title_weight: float = 0.6
category_weight: float = 0.4
# Size constraints
max_section_size: int = 3000
min_section_size: int = 300
class SimpleChunkMetadata(BaseModel):
title: str = Field(..., description="Concise descriptive title for this content")
category: str = Field(..., description="Broad topic category")
class Subsection(BaseModel):
subsection_title: str
chunks: List[str]
combined_content: str
chunk_count: int
total_length: int
class DocumentSection(BaseModel):
section_title: str
category: str
subsections: List[Subsection]
total_chunks: int
total_length: int
class SemanticChunker:
def __init__(self, config: ChunkingConfig):
self.config = config
def get_embedding(self, text: str) -> np.ndarray:
"""Get embedding from Ollama using the Python client"""
# Ensure the text is not empty, as empty strings might cause issues with embeddings
if not text.strip():
return np.zeros(768) # Return a zero vector for empty text, assuming 768 dimensions for nomic-embed-text
response = ollama.embeddings(model=self.config.ollama_model, prompt=text)
return np.array(response["embedding"])
def split_into_sentences(self, text: str) -> List[str]:
"""Split text into sentences using regex"""
# Improved regex to handle common sentence endings while avoiding splitting on abbreviations (e.g., Mr. Smith)
# It looks for . ! or ? followed by whitespace and an uppercase letter, but not if preceded by a common abbreviation pattern.
# This is a common pattern for sentence splitting.
sentence_pattern = r"(?<!\b[A-Z]\.)(?<!\b[A-Z][a-z]\.)(?<=[.!?])\s+(?=[A-Z])"
sentences = re.split(sentence_pattern, text)
return [s.strip() for s in sentences if s.strip()]
def chunk_sentences(self, sentences: List[str]) -> List[str]:
"""
Chunks sentences based on semantic similarity, respecting max_chunk_size,
and adding overlap.
"""
if not sentences:
return []
chunks: List[str] = []
current_chunk_sentences: List[str] = []
current_chunk_char_length = 0
# Get embeddings for all sentences
sentence_embeddings = [self.get_embedding(s) for s in sentences]
for i, sentence in enumerate(sentences):
sentence_length = len(sentence)
# Check if adding the current sentence would exceed max_chunk_size
# and if we have enough content to form a valid chunk
if current_chunk_char_length + sentence_length > self.config.max_chunk_size and current_chunk_sentences:
# If we're at the beginning of a potential new chunk and the first sentence itself is too long,
# we'll add it as a standalone chunk and handle it.
if current_chunk_char_length == 0:
chunks.append(sentence)
current_chunk_sentences = []
current_chunk_char_length = 0
continue # Move to the next sentence
else:
# Finalize the current chunk
chunks.append(" ".join(current_chunk_sentences))
# Reset for the next chunk, adding overlap
current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i]
current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)
# Add sentence to current chunk
current_chunk_sentences.append(sentence)
current_chunk_char_length += sentence_length
# Semantic split check (only if there's more than one sentence in current_chunk_sentences)
if len(current_chunk_sentences) > 1:
# Compare the last two sentences in the current_chunk_sentences
# We're interested in the similarity between the last added sentence and the one before it
# to detect a potential topic shift at the boundary.
embed1 = sentence_embeddings[i - 1] # Embedding of the sentence before the current one
embed2 = sentence_embeddings[i] # Embedding of the current sentence
# Calculate cosine distance (1 - cosine_similarity)
# A higher distance means less similarity.
if np.dot(embed1, embed2) == 0 and np.linalg.norm(embed1) == 0 and np.linalg.norm(embed2) == 0:
# Handle case where both embeddings are zero vectors (e.g., from empty strings), distance is 0
distance = 0
elif np.linalg.norm(embed1) == 0 or np.linalg.norm(embed2) == 0:
# If one is zero and other is not, they are dissimilar
distance = 1.0
else:
distance = 1 - cosine_similarity(embed1.reshape(1, -1), embed2.reshape(1, -1))[0][0]
if distance > self.config.similarity_threshold:
log.debug(f"Semantic split detected between sentences {i - 1} and {i} with distance {distance:.4f}")
# Semantic split detected!
# If the chunk is long enough, finalize it before the current sentence.
if current_chunk_char_length - sentence_length >= self.config.min_chunk_size:
# Append chunk excluding the current sentence, which starts a new one
chunks.append(" ".join(current_chunk_sentences[:-1]))
# Reset for the next chunk, adding overlap
current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i + 1]
current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)
# If the chunk is too short, we don't split yet and continue to build it.
# This prevents very small chunks due to minor semantic shifts.
# Add any remaining sentences as the last chunk
if current_chunk_sentences:
chunks.append(" ".join(current_chunk_sentences))
# Post-processing: Ensure no chunks are empty and optionally merge very small chunks
final_chunks = [chunk for chunk in chunks if chunk.strip()]
return final_chunks
class SectionGrouper:
def __init__(self, sectioning_config: SectioningConfig, embedding_model: str = "nomic-embed-text:v1.5"):
self.config = sectioning_config
self.embedding_model = embedding_model
def generate_chunk_metadata(self, chunk: str, previous_metadata) -> SimpleChunkMetadata:
"""Generate structured metadata for a chunk using Ollama"""
# Prepare previous metadata as a string for context
metadata = ""
if len(previous_metadata) > 0:
metadata = (
"\n".join(
f"{i + 1}. Title: {meta.title}, Category: {meta.category}"
for i, (chunk, meta) in enumerate(previous_metadata)
)
if previous_metadata
else ""
)
if metadata != "":
metadata = f"Previous metadata:\n{metadata}\n\n"
prompt = f"""Analyze the following text and provide structured metadata:
Text: {chunk}
{metadata}
Generate a JSON response with:
- title: A concise 3-8 word descriptive title
- category: A broad 1-3 word topic category
Be specific and descriptive but concise. Keep consistency in section titles and categories across similar content."""
try:
response = ollama.generate(
model=self.config.ollama_model,
prompt=prompt,
format=SimpleChunkMetadata.model_json_schema(),
options={"temperature": self.config.temperature, "num_predict": self.config.max_tokens},
keep_alive=True,
)
# Parse the JSON response
metadata_dict = json.loads(response["response"])
return SimpleChunkMetadata(**metadata_dict)
except Exception as e:
log.warning(f"Failed to generate metadata for chunk, using fallback: {e}")
# Fallback metadata
return SimpleChunkMetadata(
title=f"Section {hash(chunk[:100]) % 1000}",
category="General",
)
def get_embedding(self, text: str) -> np.ndarray:
"""Get embedding for text"""
if not text.strip():
return np.zeros(768)
response = ollama.embeddings(model=self.embedding_model, prompt=text)
return np.array(response["embedding"])
def calculate_similarity(self, metadata1: SimpleChunkMetadata, metadata2: SimpleChunkMetadata) -> float:
"""Calculate weighted similarity between two chunk metadata objects"""
# Get embeddings for each field
title1_emb = self.get_embedding(metadata1.title)
title2_emb = self.get_embedding(metadata2.title)
category1_emb = self.get_embedding(metadata1.category)
category2_emb = self.get_embedding(metadata2.category)
# Calculate cosine similarities
def safe_cosine_similarity(emb1, emb2):
if np.linalg.norm(emb1) == 0 or np.linalg.norm(emb2) == 0:
return 0.0
return cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]
title_sim = safe_cosine_similarity(title1_emb, title2_emb)
category_sim = safe_cosine_similarity(category1_emb, category2_emb)
# Weighted similarity
weighted_sim = self.config.title_weight * title_sim + self.config.category_weight * category_sim
return weighted_sim
def group_chunks_into_sections(self, chunks: List[str]) -> List[DocumentSection]:
"""Group chunks into coherent sections based on metadata similarity"""
if not chunks:
return []
log.info(f"Generating metadata for {len(chunks)} chunks...")
# Stage 1: Generate metadata for all chunks
chunk_metadata = []
for i, chunk in enumerate(chunks):
# We pass in the last 5 metadata items as context for the next chunk
metadata = self.generate_chunk_metadata(
chunk, chunk_metadata[:-5] if len(chunk_metadata) >= 5 else chunk_metadata
)
chunk_metadata.append((chunk, metadata))
log.debug(f"Chunk {i + 1}: {metadata.title} | {metadata.category}")
log.info("Grouping chunks into sections...")
# Stage 2: Group chunks based on similarity
sections = []
current_section_chunks = [(chunk_metadata[0][0], chunk_metadata[0][1])]
current_section_category = chunk_metadata[0][1].category
for i in range(1, len(chunk_metadata)):
chunk, metadata = chunk_metadata[i]
prev_metadata = chunk_metadata[i - 1][1]
# Calculate similarity with previous chunk
similarity = self.calculate_similarity(metadata, prev_metadata)
# Check if we should start a new section
should_split = (
# Similarity is below threshold so the current chunk is likely different
# from the previous section
similarity < self.config.section_merge_threshold
# Or if the current section is too long
or sum(len(c[0]) for c in current_section_chunks) + len(chunk) > self.config.max_section_size
)
if should_split and len(current_section_chunks) > 0:
# Finalize current section
section = self._create_section(current_section_chunks, current_section_category)
sections.append(section)
# Start new section
current_section_chunks = [(chunk, metadata)]
current_section_category = metadata.category
log.debug(f"New section started at chunk {i + 1}, similarity: {similarity:.3f}")
else:
# Add to current section
current_section_chunks.append((chunk, metadata))
# Add final section
if current_section_chunks:
section = self._create_section(current_section_chunks, current_section_category)
sections.append(section)
log.info(f"Created {len(sections)} sections from {len(chunks)} chunks")
return sections
def _create_section(self, section_chunks: List[tuple], category: str) -> DocumentSection:
"""Create a DocumentSection from a list of (chunk, metadata) tuples"""
# Group chunks into subsections based on title similarity
subsections = []
current_subsection = []
current_title = section_chunks[0][1].title
for chunk, metadata in section_chunks:
if not current_subsection:
current_subsection = [(chunk, metadata)]
current_title = metadata.title
else:
# Check if we should group with current subsection
title_sim = self.calculate_similarity(metadata, current_subsection[-1][1])
if title_sim > 0.8: # High threshold for subsection grouping
current_subsection.append((chunk, metadata))
else:
# Create subsection from current group
subsection = self._create_subsection(current_subsection, current_title)
subsections.append(subsection)
# Start new subsection
current_subsection = [(chunk, metadata)]
current_title = metadata.title
# Add final subsection
if current_subsection:
subsection = self._create_subsection(current_subsection, current_title)
subsections.append(subsection)
# Create section title from most common category and representative title
section_title = self._generate_section_title(section_chunks)
return DocumentSection(
section_title=section_title,
category=category,
subsections=subsections,
total_chunks=len(section_chunks),
total_length=sum(len(chunk) for chunk, _ in section_chunks),
)
def _create_subsection(self, subsection_chunks: List[tuple], title: str) -> Subsection:
"""Create a Subsection from a list of (chunk, metadata) tuples"""
chunks = [chunk for chunk, _ in subsection_chunks]
combined_content = "\n\n".join(chunks)
return Subsection(
subsection_title=title,
chunks=chunks,
combined_content=combined_content,
chunk_count=len(chunks),
total_length=len(combined_content),
)
def _generate_section_title(self, section_chunks: List[tuple]) -> str:
"""Generate a representative title for the section"""
# Use the title from the first chunk, or create from category
if section_chunks:
first_metadata = section_chunks[0][1]
if len(section_chunks) == 1:
return first_metadata.title
else:
# For multi-chunk sections, use category-based title
return f"{first_metadata.category} Overview"
return "Untitled Section"
def pdf_to_markdown(pdf_file: Path) -> str:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert(pdf_file)
text_content = result.text_content
# Keep only ascii characters
text_content = "".join(c for c in text_content if ord(c) < 128)
# Remove leading and trailing whitespace
text_content = text_content.strip()
# Remove multiple newlines
text_content = "\n".join(line.strip() for line in text_content.splitlines() if line.strip())
return text_content
def get_pdf():
import requests
# Sample text chunks
url = "https://www.berkshirehathaway.com/letters/2024ltr.pdf"
response = requests.get(url)
if response.status_code != 200:
raise ValueError(f"Failed to download PDF: {response.status_code}")
pdf_file = Path("./example_cache/2024ltr.pdf")
pdf_file.parent.mkdir(parents=True, exist_ok=True)
pdf_file.write_bytes(response.content)
# Convert PDF to markdown text
text = pdf_to_markdown(pdf_file)
return text
# Usage example
def main():
# Example usage
sample_text = get_pdf()
# Initialize chunker
chunking_config = ChunkingConfig(
max_chunk_size=750,
min_chunk_size=150,
similarity_threshold=0.25,
ollama_model="nomic-embed-text:v1.5",
overlap_sentences=2,
)
# Initialize section grouper
sectioning_config = SectioningConfig(
ollama_model="gemma3:1b", section_merge_threshold=0.75, max_section_size=3000, min_section_size=300
)
# Stage 1: Semantic chunking
chunker = SemanticChunker(chunking_config)
sentences = chunker.split_into_sentences(sample_text)
chunks = chunker.chunk_sentences(sentences)
print(f"Stage 1 Complete - Generated {len(chunks)} semantic chunks")
# Stage 2: Section grouping
grouper = SectionGrouper(sectioning_config)
sections = grouper.group_chunks_into_sections(chunks)
# Write sections to JSON file
output_file = Path("./example_cache/sections.json")
output_file.parent.mkdir(parents=True, exist_ok=True)
with output_file.open("w", encoding="utf-8") as f:
json.dump([section.model_dump() for section in sections], f, indent=2, ensure_ascii=False)
log.info(f"Sections written to {output_file}")
print(f"Stage 2 Complete - Generated {len(sections)} sections")
# Display results
print("\n" + "=" * 80)
print("DOCUMENT STRUCTURE")
print("=" * 80)
for i, section in enumerate(sections, 1):
print(f"\nSECTION {i}: {section.section_title}")
print(f"Category: {section.category}")
print(f"Total chunks: {section.total_chunks} | Length: {section.total_length} chars")
print("-" * 60)
for j, subsection in enumerate(section.subsections, 1):
print(f" {i}.{j} {subsection.subsection_title}")
print(f" Chunks: {subsection.chunk_count} | Length: {subsection.total_length} chars")
# Show first chunk preview
if subsection.chunks:
preview = (
subsection.chunks[0][:200] + "..." if len(subsection.chunks[0]) > 200 else subsection.chunks[0]
)
print(f" Preview: {preview}")
print()