Document Chunking for LLMs

Chunking documents into smaller segments for LLM-based retrieval augmented generation and semantic search engines presents a challenge to building robust and useful systems. Here I outline some of these challenges, the competing priorities, and provide a simple but effective method to chunk documents for LLMs using recursive sentence embedding and semantic similarity.

2025-06-17

I’ve found that splitting large documents into smaller chunks requires some trial and error to find the right strategy.

There are basically three competing factors:

Semantic coherence within chunks - the chunk should contain related information. A small chunk size loses context but is more coherent while a large chunk size is less coherent but contains more context.
Semantic separation between chunks - the chunks should be distinct enough to avoid overlap but not so distinct that they lose context. This means we want to avoid splitting chunks at arbitrary points like sentences or paragraphs and instead split them at semantic boundaries of sections and topics. Even having some overlap between chunks can be helpful to keep context.
Information preservation - chunks should be self-contained enough to be useful on their own. This means that we need to split text at context boundaries and ensure that the similar concepts are grouped together in the same chunk and not split across chunks.

Ultimately, the appropriate startegy depends on the type of document and the intended use case. For example, a user guide is organized into chapters and sections. The beginning of a chapter introduces the topic, followed by sections that provide detailed workflows and instructions. In this case, it makes sense to keep the workflows and instructions together in the same chunk. We want to avoid splitting a step in a workflow into two independent chunks that loses it’s place in the workflow and more importantly, the context of the step. If the document is more structured, like an invoice or technical specifications with tables, than a more sophisticated chunking strategy may be needed to preserve the relationships between the data points. This article focuses on chunking textual documents, such as reports and manuals, where the text is more free-form and less structured.

Table of contents:

Algorithm Overview
Example: Berkshire Hathaway Annual Report
Implementation Code

Algorithm Overview

Lets build a chunking algorithm based on these principles from the ground up. Conceptually, we want to start with sentences since they represent the basic unit of a coherent meaning. And from there we need a strategy for iteratively merging sentences that achieves a balance between the three stated objectives: semantic coherence within chunks, semantic separation between chunks, and information preservation.

So the algorithm consists of two main stages:

Semantic chunking breaks the document into semantically coherent chunks by:
- Splitting into sentences
- Computing embeddings for each sentence
- Using cosine distance between consecutive sentence embeddings to detect topic shifts
- Enforcing size constraints while adding overlap
Section grouping organizes chunks into a hierarchical structure by:
- Generating descriptive metadata (title and category) for each chunk using an LLM
- Computing weighted similarity between chunk metadata
- Grouping similar chunks into sections based on similarity thresholds
- Creating subsections within sections for fine-grained organization

Example: Berkshire Hathaway Annual Report

As an example, I’ve taken the 2024 Berkshire Hathaway Annual Report and passed it through the chunking process. First I generates sentences from the document using regex and then computes embeddings and uses cosine similarity to merge sentences into partially overlapping chunks. Next I take each chunk and use an LLM to generate metadata including a title and category for each chunk. Notice how there is consistency in the titles and categories across chunks. This is because the LLM is prompted with the trailing 5 metadata entries to provide context for the current chunk. Finally, I use the embedding approach to merge chunks into sections based on the semantic similarity of the title and category of each chunk. A tunable similarity threshold of 0.75 is used to combine chunks into sections. The resulting sections are more coherent and self-contained, making them easier to use in a RAG system.

The table below shows the title and category generated by an LLM for each chunk. Splits happen when the chunk similarity between the current chunk and the previous chunk is below a threshold of 0.75 or when the chunk size exceeds a maximum character limit. When there is a split, the table below reports the similarity of the current chunk to the previous chunk.

In the beginning of the letter to the shareholders, the chunks are smaller and more granular because Buffet is introducing several concepts and providing an overview of the risks, management philosophy, and strategic focus. Chunks are smaller, spanning a few sentences and the similarity between chunks is lower. As the letter progresses, the chunks become larger and more coherent, with higher similarity scores as Buffet discusses specific topics in more detail. The final sections of the letter are larger and more comprehensive, covering specific companies, investments, and their performance in depth. Notice that even though the similarity between chunks is high and the title and category are the same, the chunk size grows larger and a new split is created.

Chunk	Title	Category	Previous Chunk Similarity
1	Berkshire Hathaway Annual Report	Financial Report
2	Berkshire Hathaway Annual Report	Financial Report
3	Company Transparency & Reporting	Corporate Communication & Reporting	0.541
4	Responsibility & Communication	Corporate Communication & Reporting	0.74
5	Strategic Review & Ownership Dialogue	Corporate Communication & Reporting	0.696
6	Communication Strategy	Business & Investor Relations	0.576
7	Berkshire Hathaway’s Risk Management Approach	Financial Risk Management	0.479
8	Mistakes and Strategic Assessment	Business Strategy	0.528
9	Mistakes in Berkshire Acquisitions	Business Risk & Management	0.641
10	Mistakes in Hiring, Impacting Berkshire	Corporate Management & Risk
11	Mistakes in Hiring Assessment	Personnel Management & Decision Making	0.668
12	Painful Mistakes, Diminishing Returns	Financial & Strategic	0.506
13	Mistakes and Delay	Strategic & Operational	0.679
14	Mistakes and Analysis	Business Strategy	0.643
15	Mistakes and Their Impact	Business Strategy
16	Word Frequency Analysis	Text Analysis	0.37
17	Company Observations	Business & Communication	0.405
18	Behavioral Observations	Business & Risk	0.652
19	CEO Succession & Risk	Corporate Management & Risk	0.568
20	CEO Transition & Risk	Corporate Management & Risk
21	CEO Succession & Berkshire’s Risk	Corporate Strategy & Risk
22	Berkshire CEO Philosophy	Corporate Strategy & Leadership
23	Pete Liegl’s Legacy	Business & Leadership	0.53
24	Pete Liegl - A Wealthing Story	Business & Financial
25	Pete - Forest River Founder	Business & Founding	0.572
26	Forest River Acquisition	Business & Financial	0.638
27	RV Deal - Initial Communication	Business & Communication & Financial	0.648
28	Berkshire Acquisition Deal	Business & Finance	0.646
29	Meeting Details & Price Discussion	Business & Communication & Strategy	0.513
30	Meeting & Deal	Business & Communication
31	Business Meeting & Financial Planning	Business & Strategy	0.664
32	Berkshire Hathaway Business Deal	Business & Financial	0.576
33	Real Estate Deal	Business & Financial
34	Real Estate Lease Dispute	Business & Financial
35	Meeting Dynamics	Business & Communication & Strategy	0.514
36	Compensation Structure	Financial & Human Resources	0.429
37	Compensation Structure	Financial & Strategy
38	Berkshire’s Compensation Offer	Financial & Business	0.669
39	Berkshire’s Financial Strategy	Financial Strategy & Risk	0.677
40	Berkshire’s Early Success	Business & Financial	0.741
41	Simple Success	Business & Financial	0.737
42	Berkshire’s Strategic Mistakes	Business & Risk & Strategy	0.492
43	Berkshire’s Performance	Financial & Strategic	0.736
44	Mistakes in Berkshire Acquisitions	Business Risk & Management	0.645
45	Berkshire’s Strategic Imperfections	Business Strategy & Risk
46	Strategic Focus & Partnership	Business & Strategy	0.624
47	Strategic Imperative	Business & Strategy
48	CEO Mistakes & Analysis	Business Strategy & Risk	0.614
49	Mistakes in Berkshire Acquisitions	Business Risk & Management	0.745
50	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
51	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
52	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
53	Berkshire’s Mistakes & Strategic Challenges	Business & Risk & Strategy
54	Mistakes in Berkshire Acquisitions	Business Risk & Management
55	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
56	Berkshire’s Strategic Mistakes	Business & Risk & Strategy	1
57	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
58	Pete Liegl’s Performance	Business & Financial	0.475
59	GEICO Restructuring	Business & Strategy	0.46
60	GEICO Transformation	Business Strategy & Operational
61	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.536
62	Property Casualty Pricing Surge	Financial & Strategic	0.522
63	Convective Storm Damage	Financial Risk & Strategic	0.652
64	Berkshire’s Recent Challenges	Business & Risk & Strategy	0.55
65	Berkshire’s Recent Financial Challenges	Financial & Strategic
66	Insurance Losses & Strategic Risks	Business & Risk & Strategy	0.559
67	Berkshire’s Strategic Mistakes	Business & Risk & Strategy	0.747
68	Berkshire’s Financial Performance	Financial & Strategic	0.74
69	Berkshire’s Recent Mistakes	Business & Risk & Strategy
70	Berkshire’s Financial Performance	Financial & Strategic
71	Berkshire’s Recent Mistakes	Business & Risk & Strategy
72	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
73	Berkshire’s Recent Mistakes	Business & Risk & Strategy
74	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
75	Berkshire’s Recent Mistakes & Challenges	Business & Risk & Strategy	0.923
76	Berkshire’s Recent Mistakes	Business & Risk & Strategy
77	Berkshire’s Recent Mistakes & Strategic Challenges	Business & Risk & Strategy
78	Berkshire’s Recent Mistakes & Challenges	Business & Risk & Strategy
79	Berkshire’s Recent Mistakes & Strategic Challenges	Business & Risk & Strategy	0.981
80	Berkshire’s Recent Mistakes	Business & Risk & Strategy
81	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
82	Berkshire’s Recent Mistakes	Business & Risk & Strategy
83	Berkshire’s Recent Mistakes	Business & Risk & Strategy
84	Berkshire’s Recent Mistakes	Business & Risk & Strategy
85	Berkshire’s Recent Mistakes	Business & Risk & Strategy
86	Berkshire’s Mistakes	Business & Risk & Strategy	0.982
87	Berkshire’s Early Mistakes	Business & Risk & Strategy
88	Berkshire’s Early Mistakes	Business & Risk & Strategy
89	Berkshire’s Early Mistakes	Business & Risk & Strategy
90	Berkshire’s Strategic Missteps	Business & Risk & Strategy
91	Berkshire’s Tax Burden	Financial & Strategic	0.697
92	Berkshire’s Tax Burden	Financial & Risk & Strategy
93	Berkshire’s Recent Mistakes	Business & Risk & Strategy
94	Berkshire’s Recent Mistakes	Business & Risk & Strategy
95	Berkshire’s Recent Mistakes	Business & Risk & Strategy
96	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
97	Berkshire’s Recent Mistakes	Business & Risk & Strategy
98	Berkshire’s Financial Challenges	Financial & Strategic	0.742
99	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.742
100	Berkshire’s Financial Mistakes	Financial & Strategic
101	Berkshire’s Recent Mistakes	Business & Risk & Strategy
102	Berkshire’s Recent Mistakes	Business & Risk & Strategy
103	Berkshire’s Recent Mistakes	Business & Risk & Strategy
104	Berkshire’s Recent Mistakes	Business & Risk & Strategy
105	Berkshire’s Recent Mistakes & Challenges	Business & Risk & Strategy
106	Berkshire’s Recent Mistakes & Challenges	Business & Risk & Strategy
107	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.964
108	Berkshire’s Recent Mistakes	Business & Risk & Strategy
109	Berkshire’s Recent Mistakes	Business & Risk & Strategy
110	Berkshire’s Recent Mistakes	Business & Risk & Strategy
111	Berkshire’s Recent Mistakes	Business & Risk & Strategy
112	Berkshire’s Recent Mistakes	Business & Risk & Strategy
113	Berkshire’s Recent Mistakes	Business & Risk & Strategy
114	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
115	Berkshire’s Strategic Setbacks	Business & Risk & Strategy
116	Berkshire’s Recent Mistakes	Business & Risk & Strategy
117	Berkshire’s Recent Mistakes	Business & Risk & Strategy
118	Berkshire’s Recent Mistakes	Business & Risk & Strategy
119	Berkshire’s Strategic Challenges	Business & Risk & Strategy
120	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
121	Berkshire’s Recent Mistakes	Business & Risk & Strategy
122	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
123	Berkshire’s Recent Mistakes	Business & Risk & Strategy
124	Berkshire’s Recent Mistakes	Business & Risk & Strategy
125	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
126	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
127	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
128	Berkshire’s Strategic Challenges	Business & Risk & Strategy
129	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.851
130	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
131	Berkshire’s Strategic Mistakes	Business & Risk & Strategy
132	Berkshire’s Recent Mistakes	Business & Risk & Strategy
133	Berkshire’s Recent Mistakes	Business & Risk & Strategy
134	Berkshire’s Recent Mistakes	Business & Risk & Strategy
135	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
136	Berkshire’s Early Struggles	Business & Risk & Strategy
137	Berkshire’s Recent Mistakes	Business & Risk & Strategy
138	Berkshire’s Recent Mistakes	Business & Risk & Strategy
139	Berkshire’s Recent Mistakes	Business & Risk & Strategy
140	Berkshire’s Recent Mistakes	Business & Risk & Strategy
141	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
142	Berkshire’s Recent Mistakes	Business & Risk & Strategy
143	Berkshire’s Recent Mistakes & Challenges	Business & Risk & Strategy
144	Berkshire’s Early Struggles	Business & Risk & Strategy
145	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.86
146	Berkshire’s Recent Mistakes	Business & Risk & Strategy
147	Berkshire’s Recent Mistakes	Business & Risk & Strategy
148	Berkshire’s Recent Mistakes	Business & Risk & Strategy
149	Berkshire’s Recent Mistakes	Business & Risk & Strategy
150	Berkshire’s Recent Mistakes	Business & Risk & Strategy
151	Berkshire’s Recent Mistakes	Business & Risk & Strategy
152	Berkshire’s Recent Mistakes	Business & Risk & Strategy
153	Berkshire’s Recent Mistakes	Business & Risk & Strategy
154	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
155	Berkshire’s Recent Mistakes	Business & Risk & Strategy
156	Berkshire’s Recent Mistakes	Business & Risk & Strategy
157	Berkshire’s Recent Mistakes	Business & Risk & Strategy
158	Berkshire’s Recent Mistakes	Business & Risk & Strategy
159	CEOs’ Mistakes	Business Strategy & Risk
160	Berkshire’s Recent Mistakes	Business & Risk & Strategy
161	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
162	Berkshire’s Recent Mistakes	Business & Risk & Strategy
163	Berkshire’s Recent Mistakes	Business & Risk & Strategy
164	Berkshire’s Strategic Challenges	Business & Risk & Strategy	0.851
165	Berkshire’s Recent Mistakes	Business & Risk & Strategy
166	Berkshire’s Recent Mistakes	Business & Risk & Strategy
167	Berkshire’s Recent Mistakes	Business & Risk & Strategy
168	Berkshire’s Recent Mistakes	Business & Risk & Strategy
169	Berkshire’s Recent Mistakes	Business & Risk & Strategy
170	Berkshire’s Recent Mistakes	Business & Risk & Strategy
171	Berkshire’s Recent Mistakes	Business & Risk & Strategy
172	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
173	Berkshire’s Recent Mistakes	Business & Risk & Strategy
174	Berkshire’s Recent Mistakes	Business & Risk & Strategy
175	Berkshire’s Recent Mistakes	Business & Risk & Strategy
176	Berkshire’s Recent Mistakes	Business & Risk & Strategy
177	Berkshire’s Recent Mistakes	Business & Risk & Strategy
178	Berkshire’s Recent Mistakes	Business & Risk & Strategy
179	Berkshire’s Recent Mistakes	Business & Risk & Strategy
180	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
181	Berkshire’s Recent Mistakes	Business & Risk & Strategy
182	Strategic Setbacks	Business & Risk & Strategy
183	Recent Mistakes in Berkshire’s Operations	Business & Risk & Strategy	0.743
184	Hurricane, Tornado, and Wildfire Risks	Financial & Strategic	0.547
185	Strategic Mistakes	Business & Risk & Strategy	0.55
186	Auto Insurance Transition	Financial & Strategic	0.481
187	Berkshire’s Recent Mistakes	Business & Risk & Strategy	0.49
188	Berkshire’s Recent Mistakes	Business & Risk & Strategy
189	Berkshire’s Recent Mistakes	Business & Risk & Strategy
190	Berkshire’s Recent Mistakes	Business & Risk & Strategy
191	Berkshire’s Recent Mistakes	Business & Risk & Strategy
192	Berkshire’s Recent Mistakes	Business & Risk & Strategy
193	Berkshire’s Recent Mistakes	Business & Risk & Strategy
194	Berkshire’s Recent Mistakes	Business & Risk & Strategy
195	Berkshire’s Recent Mistakes	Business & Risk & Strategy
196	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
197	Berkshire’s Recent Mistakes	Business & Risk & Strategy
198	Berkshire’s Recent Mistakes	Business & Risk & Strategy
199	Berkshire’s Strategic Shifts	Business & Strategy
200	Berkshire’s Recent Mistakes	Business & Risk & Strategy
201	Berkshire’s Recent Mistakes	Business & Risk & Strategy
202	Berkshire’s Recent Mistakes	Business & Risk & Strategy
203	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
204	Berkshire’s Recent Mistakes	Business & Risk & Strategy
205	Berkshire’s Recent Mistakes	Business & Risk & Strategy
206	Berkshire’s Recent Mistakes	Business & Risk & Strategy
207	Berkshire’s Recent Mistakes	Business & Risk & Strategy
208	Berkshire’s Recent Mistakes	Business & Risk & Strategy
209	Berkshire’s Recent Mistakes	Business & Risk & Strategy
210	Berkshire’s Recent Mistakes	Business & Risk & Strategy
211	Berkshire’s Recent Mistakes	Business & Risk & Strategy	1
212	Berkshire’s Recent Mistakes	Business & Risk & Strategy
213	Berkshire’s Recent Mistakes	Business & Risk & Strategy
214	Strategic Missteps	Business & Risk & Strategy
215	Strategic Missteps	Business & Risk & Strategy
216	Strategic Missteps	Business & Risk & Strategy
217	Strategic Missteps	Business & Risk & Strategy
218	Strategic Challenges	Business & Risk & Strategy	0.803
219	Strategic Missteps	Business & Risk & Strategy
220	Strategic Setbacks	Business & Risk & Strategy
221	Recent Mistakes	Business & Risk & Strategy
222	Recent Mistakes	Business & Risk & Strategy
223	Recent Mistakes	Business & Risk & Strategy
224	Recent Mistakes	Business & Risk & Strategy
225	Recent Mistakes	Business & Risk & Strategy
226	Recent Mistakes	Business & Risk & Strategy	1
227	Recent Mistakes	Business & Risk & Strategy
228	Strategic Missteps	Business & Risk & Strategy
229	Recent Mistakes	Business & Risk & Strategy
230	Strategic Mishaps	Business & Risk & Strategy
231	Recent Mistakes	Business & Risk & Strategy
232	Recent Mistakes	Business & Risk & Strategy
233	Recent Mistakes	Business & Risk & Strategy
234	Financial Risks & Challenges	Business & Risk & Strategy	0.671
235	Recent Business Mistakes	Business & Risk & Strategy	0.694
236	Recent Mistakes	Business & Risk & Strategy
237	Strategic Challenges	Business & Risk & Strategy	0.678
238	Strategic Challenges	Business & Risk & Strategy
239	Recent Business Mistakes	Business & Risk & Strategy	0.671
240	Strategic Challenges	Business & Risk & Strategy	0.671
241	Recent Business Mistakes	Business & Risk & Strategy	0.671
242	Strategic Challenges	Business & Risk & Strategy	0.671
243	Recent Business Mistakes	Business & Risk & Strategy	0.671
244	Recent Business Mistakes	Business & Risk & Strategy
245	Recent Business Mistakes	Business & Risk & Strategy
246	Recent Business Mistakes	Business & Risk & Strategy
247	Recent Business Mistakes	Business & Risk & Strategy
248	Recent Business Mistakes	Business & Risk & Strategy
249	Recent Business Mistakes	Business & Risk & Strategy	1
250	Recent Business Mistakes	Business & Risk & Strategy
251	Recent Business Mistakes	Business & Risk & Strategy
252	Strategic Challenges	Business & Risk & Strategy	0.671
253	Recent Business Mistakes	Business & Risk & Strategy	0.671
254	Recent Business Mistakes	Business & Risk & Strategy
255	Recent Business Mistakes	Business & Risk & Strategy
256	Recent Business Mistakes	Business & Risk & Strategy
257	Recent Business Mistakes	Business & Risk & Strategy
258	Recent Business Mistakes	Business & Risk & Strategy	1

Implementation Code

import re
import json
import ollama
import logging
import numpy as np
from typing import List
from pathlib import Path
from dataclasses import dataclass
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity

logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)

# Dont show logging for httpcore or httpx
logging.getLogger("httpcore").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)


@dataclass
class ChunkingConfig:
    max_chunk_size: int = 1000  # characters
    min_chunk_size: int = 100  # characters
    similarity_threshold: float = 0.15  # cosine distance threshold for splitting (higher means more splits)
    ollama_model: str = "nomic-embed-text:v1.5"  # Ollama model for embeddings
    overlap_sentences: int = 1  # sentences to overlap between chunks


@dataclass
class SectioningConfig:
    # Model settings
    ollama_model: str = "gemma3:1b"
    max_tokens: int = 200
    temperature: float = 0.1  # Low for consistency

    # Similarity thresholds
    section_merge_threshold: float = 0.75
    title_weight: float = 0.6
    category_weight: float = 0.4

    # Size constraints
    max_section_size: int = 3000
    min_section_size: int = 300


class SimpleChunkMetadata(BaseModel):
    title: str = Field(..., description="Concise descriptive title for this content")
    category: str = Field(..., description="Broad topic category")


class Subsection(BaseModel):
    subsection_title: str
    chunks: List[str]
    combined_content: str
    chunk_count: int
    total_length: int


class DocumentSection(BaseModel):
    section_title: str
    category: str
    subsections: List[Subsection]
    total_chunks: int
    total_length: int


class SemanticChunker:
    def __init__(self, config: ChunkingConfig):
        self.config = config

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding from Ollama using the Python client"""
        # Ensure the text is not empty, as empty strings might cause issues with embeddings
        if not text.strip():
            return np.zeros(768)  # Return a zero vector for empty text, assuming 768 dimensions for nomic-embed-text
        response = ollama.embeddings(model=self.config.ollama_model, prompt=text)
        return np.array(response["embedding"])

    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences using regex"""
        # Improved regex to handle common sentence endings while avoiding splitting on abbreviations (e.g., Mr. Smith)
        # It looks for . ! or ? followed by whitespace and an uppercase letter, but not if preceded by a common abbreviation pattern.
        # This is a common pattern for sentence splitting.
        sentence_pattern = r"(?<!\b[A-Z]\.)(?<!\b[A-Z][a-z]\.)(?<=[.!?])\s+(?=[A-Z])"
        sentences = re.split(sentence_pattern, text)
        return [s.strip() for s in sentences if s.strip()]

    def chunk_sentences(self, sentences: List[str]) -> List[str]:
        """
        Chunks sentences based on semantic similarity, respecting max_chunk_size,
        and adding overlap.
        """
        if not sentences:
            return []

        chunks: List[str] = []
        current_chunk_sentences: List[str] = []
        current_chunk_char_length = 0

        # Get embeddings for all sentences
        sentence_embeddings = [self.get_embedding(s) for s in sentences]

        for i, sentence in enumerate(sentences):
            sentence_length = len(sentence)

            # Check if adding the current sentence would exceed max_chunk_size
            # and if we have enough content to form a valid chunk
            if current_chunk_char_length + sentence_length > self.config.max_chunk_size and current_chunk_sentences:
                # If we're at the beginning of a potential new chunk and the first sentence itself is too long,
                # we'll add it as a standalone chunk and handle it.
                if current_chunk_char_length == 0:
                    chunks.append(sentence)
                    current_chunk_sentences = []
                    current_chunk_char_length = 0
                    continue  # Move to the next sentence
                else:
                    # Finalize the current chunk
                    chunks.append(" ".join(current_chunk_sentences))
                    # Reset for the next chunk, adding overlap
                    current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i]
                    current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)

            # Add sentence to current chunk
            current_chunk_sentences.append(sentence)
            current_chunk_char_length += sentence_length

            # Semantic split check (only if there's more than one sentence in current_chunk_sentences)
            if len(current_chunk_sentences) > 1:
                # Compare the last two sentences in the current_chunk_sentences
                # We're interested in the similarity between the last added sentence and the one before it
                # to detect a potential topic shift at the boundary.
                embed1 = sentence_embeddings[i - 1]  # Embedding of the sentence before the current one
                embed2 = sentence_embeddings[i]  # Embedding of the current sentence

                # Calculate cosine distance (1 - cosine_similarity)
                # A higher distance means less similarity.
                if np.dot(embed1, embed2) == 0 and np.linalg.norm(embed1) == 0 and np.linalg.norm(embed2) == 0:
                    # Handle case where both embeddings are zero vectors (e.g., from empty strings), distance is 0
                    distance = 0
                elif np.linalg.norm(embed1) == 0 or np.linalg.norm(embed2) == 0:
                    # If one is zero and other is not, they are dissimilar
                    distance = 1.0
                else:
                    distance = 1 - cosine_similarity(embed1.reshape(1, -1), embed2.reshape(1, -1))[0][0]

                if distance > self.config.similarity_threshold:
                    log.debug(f"Semantic split detected between sentences {i - 1} and {i} with distance {distance:.4f}")
                    # Semantic split detected!
                    # If the chunk is long enough, finalize it before the current sentence.
                    if current_chunk_char_length - sentence_length >= self.config.min_chunk_size:
                        # Append chunk excluding the current sentence, which starts a new one
                        chunks.append(" ".join(current_chunk_sentences[:-1]))
                        # Reset for the next chunk, adding overlap
                        current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i + 1]
                        current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)
                    # If the chunk is too short, we don't split yet and continue to build it.
                    # This prevents very small chunks due to minor semantic shifts.

        # Add any remaining sentences as the last chunk
        if current_chunk_sentences:
            chunks.append(" ".join(current_chunk_sentences))

        # Post-processing: Ensure no chunks are empty and optionally merge very small chunks
        final_chunks = [chunk for chunk in chunks if chunk.strip()]
        return final_chunks


class SectionGrouper:
    def __init__(self, sectioning_config: SectioningConfig, embedding_model: str = "nomic-embed-text:v1.5"):
        self.config = sectioning_config
        self.embedding_model = embedding_model

    def generate_chunk_metadata(self, chunk: str, previous_metadata) -> SimpleChunkMetadata:
        """Generate structured metadata for a chunk using Ollama"""

        # Prepare previous metadata as a string for context
        metadata = ""
        if len(previous_metadata) > 0:
            metadata = (
                "\n".join(
                    f"{i + 1}. Title: {meta.title}, Category: {meta.category}"
                    for i, (chunk, meta) in enumerate(previous_metadata)
                )
                if previous_metadata
                else ""
            )

            if metadata != "":
                metadata = f"Previous metadata:\n{metadata}\n\n"

        prompt = f"""Analyze the following text and provide structured metadata:

Text: {chunk}

{metadata}

Generate a JSON response with:
- title: A concise 3-8 word descriptive title
- category: A broad 1-3 word topic category

Be specific and descriptive but concise. Keep consistency in section titles and categories across similar content."""

        try:
            response = ollama.generate(
                model=self.config.ollama_model,
                prompt=prompt,
                format=SimpleChunkMetadata.model_json_schema(),
                options={"temperature": self.config.temperature, "num_predict": self.config.max_tokens},
                keep_alive=True,
            )

            # Parse the JSON response
            metadata_dict = json.loads(response["response"])
            return SimpleChunkMetadata(**metadata_dict)

        except Exception as e:
            log.warning(f"Failed to generate metadata for chunk, using fallback: {e}")
            # Fallback metadata
            return SimpleChunkMetadata(
                title=f"Section {hash(chunk[:100]) % 1000}",
                category="General",
            )

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding for text"""
        if not text.strip():
            return np.zeros(768)
        response = ollama.embeddings(model=self.embedding_model, prompt=text)
        return np.array(response["embedding"])

    def calculate_similarity(self, metadata1: SimpleChunkMetadata, metadata2: SimpleChunkMetadata) -> float:
        """Calculate weighted similarity between two chunk metadata objects"""

        # Get embeddings for each field
        title1_emb = self.get_embedding(metadata1.title)
        title2_emb = self.get_embedding(metadata2.title)

        category1_emb = self.get_embedding(metadata1.category)
        category2_emb = self.get_embedding(metadata2.category)

        # Calculate cosine similarities
        def safe_cosine_similarity(emb1, emb2):
            if np.linalg.norm(emb1) == 0 or np.linalg.norm(emb2) == 0:
                return 0.0
            return cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]

        title_sim = safe_cosine_similarity(title1_emb, title2_emb)
        category_sim = safe_cosine_similarity(category1_emb, category2_emb)

        # Weighted similarity
        weighted_sim = self.config.title_weight * title_sim + self.config.category_weight * category_sim

        return weighted_sim

    def group_chunks_into_sections(self, chunks: List[str]) -> List[DocumentSection]:
        """Group chunks into coherent sections based on metadata similarity"""

        if not chunks:
            return []

        log.info(f"Generating metadata for {len(chunks)} chunks...")

        # Stage 1: Generate metadata for all chunks
        chunk_metadata = []
        for i, chunk in enumerate(chunks):
            # We pass in the last 5 metadata items as context for the next chunk
            metadata = self.generate_chunk_metadata(
                chunk, chunk_metadata[:-5] if len(chunk_metadata) >= 5 else chunk_metadata
            )
            chunk_metadata.append((chunk, metadata))
            log.debug(f"Chunk {i + 1}: {metadata.title} | {metadata.category}")

        log.info("Grouping chunks into sections...")

        # Stage 2: Group chunks based on similarity
        sections = []
        current_section_chunks = [(chunk_metadata[0][0], chunk_metadata[0][1])]
        current_section_category = chunk_metadata[0][1].category

        for i in range(1, len(chunk_metadata)):
            chunk, metadata = chunk_metadata[i]
            prev_metadata = chunk_metadata[i - 1][1]

            # Calculate similarity with previous chunk
            similarity = self.calculate_similarity(metadata, prev_metadata)

            # Check if we should start a new section
            should_split = (
                # Similarity is below threshold so the current chunk is likely different
                # from the previous section
                similarity < self.config.section_merge_threshold
                # Or if the current section is too long
                or sum(len(c[0]) for c in current_section_chunks) + len(chunk) > self.config.max_section_size
            )

            if should_split and len(current_section_chunks) > 0:
                # Finalize current section
                section = self._create_section(current_section_chunks, current_section_category)
                sections.append(section)

                # Start new section
                current_section_chunks = [(chunk, metadata)]
                current_section_category = metadata.category

                log.debug(f"New section started at chunk {i + 1}, similarity: {similarity:.3f}")
            else:
                # Add to current section
                current_section_chunks.append((chunk, metadata))

        # Add final section
        if current_section_chunks:
            section = self._create_section(current_section_chunks, current_section_category)
            sections.append(section)

        log.info(f"Created {len(sections)} sections from {len(chunks)} chunks")
        return sections

    def _create_section(self, section_chunks: List[tuple], category: str) -> DocumentSection:
        """Create a DocumentSection from a list of (chunk, metadata) tuples"""

        # Group chunks into subsections based on title similarity
        subsections = []
        current_subsection = []
        current_title = section_chunks[0][1].title

        for chunk, metadata in section_chunks:
            if not current_subsection:
                current_subsection = [(chunk, metadata)]
                current_title = metadata.title
            else:
                # Check if we should group with current subsection
                title_sim = self.calculate_similarity(metadata, current_subsection[-1][1])

                if title_sim > 0.8:  # High threshold for subsection grouping
                    current_subsection.append((chunk, metadata))
                else:
                    # Create subsection from current group
                    subsection = self._create_subsection(current_subsection, current_title)
                    subsections.append(subsection)

                    # Start new subsection
                    current_subsection = [(chunk, metadata)]
                    current_title = metadata.title

        # Add final subsection
        if current_subsection:
            subsection = self._create_subsection(current_subsection, current_title)
            subsections.append(subsection)

        # Create section title from most common category and representative title
        section_title = self._generate_section_title(section_chunks)

        return DocumentSection(
            section_title=section_title,
            category=category,
            subsections=subsections,
            total_chunks=len(section_chunks),
            total_length=sum(len(chunk) for chunk, _ in section_chunks),
        )

    def _create_subsection(self, subsection_chunks: List[tuple], title: str) -> Subsection:
        """Create a Subsection from a list of (chunk, metadata) tuples"""
        chunks = [chunk for chunk, _ in subsection_chunks]
        combined_content = "\n\n".join(chunks)

        return Subsection(
            subsection_title=title,
            chunks=chunks,
            combined_content=combined_content,
            chunk_count=len(chunks),
            total_length=len(combined_content),
        )

    def _generate_section_title(self, section_chunks: List[tuple]) -> str:
        """Generate a representative title for the section"""
        # Use the title from the first chunk, or create from category
        if section_chunks:
            first_metadata = section_chunks[0][1]
            if len(section_chunks) == 1:
                return first_metadata.title
            else:
                # For multi-chunk sections, use category-based title
                return f"{first_metadata.category} Overview"
        return "Untitled Section"


def pdf_to_markdown(pdf_file: Path) -> str:
    from markitdown import MarkItDown

    md = MarkItDown()
    result = md.convert(pdf_file)
    text_content = result.text_content
    # Keep only ascii characters
    text_content = "".join(c for c in text_content if ord(c) < 128)
    # Remove leading and trailing whitespace
    text_content = text_content.strip()
    # Remove multiple newlines
    text_content = "\n".join(line.strip() for line in text_content.splitlines() if line.strip())
    return text_content


def get_pdf():
    import requests

    # Sample text chunks
    url = "https://www.berkshirehathaway.com/letters/2024ltr.pdf"
    response = requests.get(url)
    if response.status_code != 200:
        raise ValueError(f"Failed to download PDF: {response.status_code}")
    pdf_file = Path("./example_cache/2024ltr.pdf")
    pdf_file.parent.mkdir(parents=True, exist_ok=True)
    pdf_file.write_bytes(response.content)
    # Convert PDF to markdown text
    text = pdf_to_markdown(pdf_file)
    return text


# Usage example
def main():
    # Example usage
    sample_text = get_pdf()

    # Initialize chunker
    chunking_config = ChunkingConfig(
        max_chunk_size=750,
        min_chunk_size=150,
        similarity_threshold=0.25,
        ollama_model="nomic-embed-text:v1.5",
        overlap_sentences=2,
    )

    # Initialize section grouper
    sectioning_config = SectioningConfig(
        ollama_model="gemma3:1b", section_merge_threshold=0.75, max_section_size=3000, min_section_size=300
    )

    # Stage 1: Semantic chunking
    chunker = SemanticChunker(chunking_config)
    sentences = chunker.split_into_sentences(sample_text)
    chunks = chunker.chunk_sentences(sentences)

    print(f"Stage 1 Complete - Generated {len(chunks)} semantic chunks")

    # Stage 2: Section grouping
    grouper = SectionGrouper(sectioning_config)
    sections = grouper.group_chunks_into_sections(chunks)

    # Write sections to JSON file
    output_file = Path("./example_cache/sections.json")
    output_file.parent.mkdir(parents=True, exist_ok=True)
    with output_file.open("w", encoding="utf-8") as f:
        json.dump([section.model_dump() for section in sections], f, indent=2, ensure_ascii=False)
    log.info(f"Sections written to {output_file}")

    print(f"Stage 2 Complete - Generated {len(sections)} sections")

    # Display results
    print("\n" + "=" * 80)
    print("DOCUMENT STRUCTURE")
    print("=" * 80)

    for i, section in enumerate(sections, 1):
        print(f"\nSECTION {i}: {section.section_title}")
        print(f"Category: {section.category}")
        print(f"Total chunks: {section.total_chunks} | Length: {section.total_length} chars")
        print("-" * 60)

        for j, subsection in enumerate(section.subsections, 1):
            print(f"  {i}.{j} {subsection.subsection_title}")
            print(f"       Chunks: {subsection.chunk_count} | Length: {subsection.total_length} chars")

            # Show first chunk preview
            if subsection.chunks:
                preview = (
                    subsection.chunks[0][:200] + "..." if len(subsection.chunks[0]) > 200 else subsection.chunks[0]
                )
                print(f"       Preview: {preview}")
            print()

#LLM