Asif Rahman

Document Chunking for LLMs

Chunking documents into smaller segments for LLM-based retrieval augmented generation and semantic search engines presents a challenge to building robust and useful systems. Here I outline some of these challenges, the competing priorities, and provide a simple but effective method to chunk documents for LLMs using recursive sentence embedding and semantic similarity.

I’ve found that splitting large documents into smaller chunks requires some trial and error to find the right strategy.

There are basically three competing factors:

  1. Semantic coherence within chunks - the chunk should contain related information. A small chunk size loses context but is more coherent while a large chunk size is less coherent but contains more context.
  2. Semantic separation between chunks - the chunks should be distinct enough to avoid overlap but not so distinct that they lose context. This means we want to avoid splitting chunks at arbitrary points like sentences or paragraphs and instead split them at semantic boundaries of sections and topics. Even having some overlap between chunks can be helpful to keep context.
  3. Information preservation - chunks should be self-contained enough to be useful on their own. This means that we need to split text at context boundaries and ensure that the similar concepts are grouped together in the same chunk and not split across chunks.

Ultimately, the appropriate startegy depends on the type of document and the intended use case. For example, a user guide is organized into chapters and sections. The beginning of a chapter introduces the topic, followed by sections that provide detailed workflows and instructions. In this case, it makes sense to keep the workflows and instructions together in the same chunk. We want to avoid splitting a step in a workflow into two independent chunks that loses it’s place in the workflow and more importantly, the context of the step. If the document is more structured, like an invoice or technical specifications with tables, than a more sophisticated chunking strategy may be needed to preserve the relationships between the data points. This article focuses on chunking textual documents, such as reports and manuals, where the text is more free-form and less structured.

Table of contents:

Algorithm Overview

Lets build a chunking algorithm based on these principles from the ground up. Conceptually, we want to start with sentences since they represent the basic unit of a coherent meaning. And from there we need a strategy for iteratively merging sentences that achieves a balance between the three stated objectives: semantic coherence within chunks, semantic separation between chunks, and information preservation.

So the algorithm consists of two main stages:

  1. Semantic chunking breaks the document into semantically coherent chunks by:
    • Splitting into sentences
    • Computing embeddings for each sentence
    • Using cosine distance between consecutive sentence embeddings to detect topic shifts
    • Enforcing size constraints while adding overlap
  2. Section grouping organizes chunks into a hierarchical structure by:
    • Generating descriptive metadata (title and category) for each chunk using an LLM
    • Computing weighted similarity between chunk metadata
    • Grouping similar chunks into sections based on similarity thresholds
    • Creating subsections within sections for fine-grained organization

Example: Berkshire Hathaway Annual Report

As an example, I’ve taken the 2024 Berkshire Hathaway Annual Report and passed it through the chunking process. First I generates sentences from the document using regex and then computes embeddings and uses cosine similarity to merge sentences into partially overlapping chunks. Next I take each chunk and use an LLM to generate metadata including a title and category for each chunk. Notice how there is consistency in the titles and categories across chunks. This is because the LLM is prompted with the trailing 5 metadata entries to provide context for the current chunk. Finally, I use the embedding approach to merge chunks into sections based on the semantic similarity of the title and category of each chunk. A tunable similarity threshold of 0.75 is used to combine chunks into sections. The resulting sections are more coherent and self-contained, making them easier to use in a RAG system.

The table below shows the title and category generated by an LLM for each chunk. Splits happen when the chunk similarity between the current chunk and the previous chunk is below a threshold of 0.75 or when the chunk size exceeds a maximum character limit. When there is a split, the table below reports the similarity of the current chunk to the previous chunk.

In the beginning of the letter to the shareholders, the chunks are smaller and more granular because Buffet is introducing several concepts and providing an overview of the risks, management philosophy, and strategic focus. Chunks are smaller, spanning a few sentences and the similarity between chunks is lower. As the letter progresses, the chunks become larger and more coherent, with higher similarity scores as Buffet discusses specific topics in more detail. The final sections of the letter are larger and more comprehensive, covering specific companies, investments, and their performance in depth. Notice that even though the similarity between chunks is high and the title and category are the same, the chunk size grows larger and a new split is created.

Chunk Title Category Previous
Chunk
Similarity
1 Berkshire Hathaway Annual Report Financial Report
2 Berkshire Hathaway Annual Report Financial Report
3 Company Transparency & Reporting Corporate Communication & Reporting 0.541
4 Responsibility & Communication Corporate Communication & Reporting 0.74
5 Strategic Review & Ownership Dialogue Corporate Communication & Reporting 0.696
6 Communication Strategy Business & Investor Relations 0.576
7 Berkshire Hathaway’s Risk Management Approach Financial Risk Management 0.479
8 Mistakes and Strategic Assessment Business Strategy 0.528
9 Mistakes in Berkshire Acquisitions Business Risk & Management 0.641
10 Mistakes in Hiring, Impacting Berkshire Corporate Management & Risk
11 Mistakes in Hiring Assessment Personnel Management & Decision Making 0.668
12 Painful Mistakes, Diminishing Returns Financial & Strategic 0.506
13 Mistakes and Delay Strategic & Operational 0.679
14 Mistakes and Analysis Business Strategy 0.643
15 Mistakes and Their Impact Business Strategy
16 Word Frequency Analysis Text Analysis 0.37
17 Company Observations Business & Communication 0.405
18 Behavioral Observations Business & Risk 0.652
19 CEO Succession & Risk Corporate Management & Risk 0.568
20 CEO Transition & Risk Corporate Management & Risk
21 CEO Succession & Berkshire’s Risk Corporate Strategy & Risk
22 Berkshire CEO Philosophy Corporate Strategy & Leadership
23 Pete Liegl’s Legacy Business & Leadership 0.53
24 Pete Liegl - A Wealthing Story Business & Financial
25 Pete - Forest River Founder Business & Founding 0.572
26 Forest River Acquisition Business & Financial 0.638
27 RV Deal - Initial Communication Business & Communication & Financial 0.648
28 Berkshire Acquisition Deal Business & Finance 0.646
29 Meeting Details & Price Discussion Business & Communication & Strategy 0.513
30 Meeting & Deal Business & Communication
31 Business Meeting & Financial Planning Business & Strategy 0.664
32 Berkshire Hathaway Business Deal Business & Financial 0.576
33 Real Estate Deal Business & Financial
34 Real Estate Lease Dispute Business & Financial
35 Meeting Dynamics Business & Communication & Strategy 0.514
36 Compensation Structure Financial & Human Resources 0.429
37 Compensation Structure Financial & Strategy
38 Berkshire’s Compensation Offer Financial & Business 0.669
39 Berkshire’s Financial Strategy Financial Strategy & Risk 0.677
40 Berkshire’s Early Success Business & Financial 0.741
41 Simple Success Business & Financial 0.737
42 Berkshire’s Strategic Mistakes Business & Risk & Strategy 0.492
43 Berkshire’s Performance Financial & Strategic 0.736
44 Mistakes in Berkshire Acquisitions Business Risk & Management 0.645
45 Berkshire’s Strategic Imperfections Business Strategy & Risk
46 Strategic Focus & Partnership Business & Strategy 0.624
47 Strategic Imperative Business & Strategy
48 CEO Mistakes & Analysis Business Strategy & Risk 0.614
49 Mistakes in Berkshire Acquisitions Business Risk & Management 0.745
50 Berkshire’s Strategic Mistakes Business & Risk & Strategy
51 Berkshire’s Strategic Mistakes Business & Risk & Strategy
52 Berkshire’s Strategic Mistakes Business & Risk & Strategy
53 Berkshire’s Mistakes & Strategic Challenges Business & Risk & Strategy
54 Mistakes in Berkshire Acquisitions Business Risk & Management
55 Berkshire’s Strategic Mistakes Business & Risk & Strategy
56 Berkshire’s Strategic Mistakes Business & Risk & Strategy 1
57 Berkshire’s Strategic Mistakes Business & Risk & Strategy
58 Pete Liegl’s Performance Business & Financial 0.475
59 GEICO Restructuring Business & Strategy 0.46
60 GEICO Transformation Business Strategy & Operational
61 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.536
62 Property Casualty Pricing Surge Financial & Strategic 0.522
63 Convective Storm Damage Financial Risk & Strategic 0.652
64 Berkshire’s Recent Challenges Business & Risk & Strategy 0.55
65 Berkshire’s Recent Financial Challenges Financial & Strategic
66 Insurance Losses & Strategic Risks Business & Risk & Strategy 0.559
67 Berkshire’s Strategic Mistakes Business & Risk & Strategy 0.747
68 Berkshire’s Financial Performance Financial & Strategic 0.74
69 Berkshire’s Recent Mistakes Business & Risk & Strategy
70 Berkshire’s Financial Performance Financial & Strategic
71 Berkshire’s Recent Mistakes Business & Risk & Strategy
72 Berkshire’s Strategic Mistakes Business & Risk & Strategy
73 Berkshire’s Recent Mistakes Business & Risk & Strategy
74 Berkshire’s Strategic Mistakes Business & Risk & Strategy
75 Berkshire’s Recent Mistakes & Challenges Business & Risk & Strategy 0.923
76 Berkshire’s Recent Mistakes Business & Risk & Strategy
77 Berkshire’s Recent Mistakes & Strategic Challenges Business & Risk & Strategy
78 Berkshire’s Recent Mistakes & Challenges Business & Risk & Strategy
79 Berkshire’s Recent Mistakes & Strategic Challenges Business & Risk & Strategy 0.981
80 Berkshire’s Recent Mistakes Business & Risk & Strategy
81 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
82 Berkshire’s Recent Mistakes Business & Risk & Strategy
83 Berkshire’s Recent Mistakes Business & Risk & Strategy
84 Berkshire’s Recent Mistakes Business & Risk & Strategy
85 Berkshire’s Recent Mistakes Business & Risk & Strategy
86 Berkshire’s Mistakes Business & Risk & Strategy 0.982
87 Berkshire’s Early Mistakes Business & Risk & Strategy
88 Berkshire’s Early Mistakes Business & Risk & Strategy
89 Berkshire’s Early Mistakes Business & Risk & Strategy
90 Berkshire’s Strategic Missteps Business & Risk & Strategy
91 Berkshire’s Tax Burden Financial & Strategic 0.697
92 Berkshire’s Tax Burden Financial & Risk & Strategy
93 Berkshire’s Recent Mistakes Business & Risk & Strategy
94 Berkshire’s Recent Mistakes Business & Risk & Strategy
95 Berkshire’s Recent Mistakes Business & Risk & Strategy
96 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
97 Berkshire’s Recent Mistakes Business & Risk & Strategy
98 Berkshire’s Financial Challenges Financial & Strategic 0.742
99 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.742
100 Berkshire’s Financial Mistakes Financial & Strategic
101 Berkshire’s Recent Mistakes Business & Risk & Strategy
102 Berkshire’s Recent Mistakes Business & Risk & Strategy
103 Berkshire’s Recent Mistakes Business & Risk & Strategy
104 Berkshire’s Recent Mistakes Business & Risk & Strategy
105 Berkshire’s Recent Mistakes & Challenges Business & Risk & Strategy
106 Berkshire’s Recent Mistakes & Challenges Business & Risk & Strategy
107 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.964
108 Berkshire’s Recent Mistakes Business & Risk & Strategy
109 Berkshire’s Recent Mistakes Business & Risk & Strategy
110 Berkshire’s Recent Mistakes Business & Risk & Strategy
111 Berkshire’s Recent Mistakes Business & Risk & Strategy
112 Berkshire’s Recent Mistakes Business & Risk & Strategy
113 Berkshire’s Recent Mistakes Business & Risk & Strategy
114 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
115 Berkshire’s Strategic Setbacks Business & Risk & Strategy
116 Berkshire’s Recent Mistakes Business & Risk & Strategy
117 Berkshire’s Recent Mistakes Business & Risk & Strategy
118 Berkshire’s Recent Mistakes Business & Risk & Strategy
119 Berkshire’s Strategic Challenges Business & Risk & Strategy
120 Berkshire’s Strategic Mistakes Business & Risk & Strategy
121 Berkshire’s Recent Mistakes Business & Risk & Strategy
122 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
123 Berkshire’s Recent Mistakes Business & Risk & Strategy
124 Berkshire’s Recent Mistakes Business & Risk & Strategy
125 Berkshire’s Strategic Mistakes Business & Risk & Strategy
126 Berkshire’s Strategic Mistakes Business & Risk & Strategy
127 Berkshire’s Strategic Mistakes Business & Risk & Strategy
128 Berkshire’s Strategic Challenges Business & Risk & Strategy
129 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.851
130 Berkshire’s Strategic Mistakes Business & Risk & Strategy
131 Berkshire’s Strategic Mistakes Business & Risk & Strategy
132 Berkshire’s Recent Mistakes Business & Risk & Strategy
133 Berkshire’s Recent Mistakes Business & Risk & Strategy
134 Berkshire’s Recent Mistakes Business & Risk & Strategy
135 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
136 Berkshire’s Early Struggles Business & Risk & Strategy
137 Berkshire’s Recent Mistakes Business & Risk & Strategy
138 Berkshire’s Recent Mistakes Business & Risk & Strategy
139 Berkshire’s Recent Mistakes Business & Risk & Strategy
140 Berkshire’s Recent Mistakes Business & Risk & Strategy
141 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
142 Berkshire’s Recent Mistakes Business & Risk & Strategy
143 Berkshire’s Recent Mistakes & Challenges Business & Risk & Strategy
144 Berkshire’s Early Struggles Business & Risk & Strategy
145 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.86
146 Berkshire’s Recent Mistakes Business & Risk & Strategy
147 Berkshire’s Recent Mistakes Business & Risk & Strategy
148 Berkshire’s Recent Mistakes Business & Risk & Strategy
149 Berkshire’s Recent Mistakes Business & Risk & Strategy
150 Berkshire’s Recent Mistakes Business & Risk & Strategy
151 Berkshire’s Recent Mistakes Business & Risk & Strategy
152 Berkshire’s Recent Mistakes Business & Risk & Strategy
153 Berkshire’s Recent Mistakes Business & Risk & Strategy
154 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
155 Berkshire’s Recent Mistakes Business & Risk & Strategy
156 Berkshire’s Recent Mistakes Business & Risk & Strategy
157 Berkshire’s Recent Mistakes Business & Risk & Strategy
158 Berkshire’s Recent Mistakes Business & Risk & Strategy
159 CEOs’ Mistakes Business Strategy & Risk
160 Berkshire’s Recent Mistakes Business & Risk & Strategy
161 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
162 Berkshire’s Recent Mistakes Business & Risk & Strategy
163 Berkshire’s Recent Mistakes Business & Risk & Strategy
164 Berkshire’s Strategic Challenges Business & Risk & Strategy 0.851
165 Berkshire’s Recent Mistakes Business & Risk & Strategy
166 Berkshire’s Recent Mistakes Business & Risk & Strategy
167 Berkshire’s Recent Mistakes Business & Risk & Strategy
168 Berkshire’s Recent Mistakes Business & Risk & Strategy
169 Berkshire’s Recent Mistakes Business & Risk & Strategy
170 Berkshire’s Recent Mistakes Business & Risk & Strategy
171 Berkshire’s Recent Mistakes Business & Risk & Strategy
172 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
173 Berkshire’s Recent Mistakes Business & Risk & Strategy
174 Berkshire’s Recent Mistakes Business & Risk & Strategy
175 Berkshire’s Recent Mistakes Business & Risk & Strategy
176 Berkshire’s Recent Mistakes Business & Risk & Strategy
177 Berkshire’s Recent Mistakes Business & Risk & Strategy
178 Berkshire’s Recent Mistakes Business & Risk & Strategy
179 Berkshire’s Recent Mistakes Business & Risk & Strategy
180 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
181 Berkshire’s Recent Mistakes Business & Risk & Strategy
182 Strategic Setbacks Business & Risk & Strategy
183 Recent Mistakes in Berkshire’s Operations Business & Risk & Strategy 0.743
184 Hurricane, Tornado, and Wildfire Risks Financial & Strategic 0.547
185 Strategic Mistakes Business & Risk & Strategy 0.55
186 Auto Insurance Transition Financial & Strategic 0.481
187 Berkshire’s Recent Mistakes Business & Risk & Strategy 0.49
188 Berkshire’s Recent Mistakes Business & Risk & Strategy
189 Berkshire’s Recent Mistakes Business & Risk & Strategy
190 Berkshire’s Recent Mistakes Business & Risk & Strategy
191 Berkshire’s Recent Mistakes Business & Risk & Strategy
192 Berkshire’s Recent Mistakes Business & Risk & Strategy
193 Berkshire’s Recent Mistakes Business & Risk & Strategy
194 Berkshire’s Recent Mistakes Business & Risk & Strategy
195 Berkshire’s Recent Mistakes Business & Risk & Strategy
196 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
197 Berkshire’s Recent Mistakes Business & Risk & Strategy
198 Berkshire’s Recent Mistakes Business & Risk & Strategy
199 Berkshire’s Strategic Shifts Business & Strategy
200 Berkshire’s Recent Mistakes Business & Risk & Strategy
201 Berkshire’s Recent Mistakes Business & Risk & Strategy
202 Berkshire’s Recent Mistakes Business & Risk & Strategy
203 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
204 Berkshire’s Recent Mistakes Business & Risk & Strategy
205 Berkshire’s Recent Mistakes Business & Risk & Strategy
206 Berkshire’s Recent Mistakes Business & Risk & Strategy
207 Berkshire’s Recent Mistakes Business & Risk & Strategy
208 Berkshire’s Recent Mistakes Business & Risk & Strategy
209 Berkshire’s Recent Mistakes Business & Risk & Strategy
210 Berkshire’s Recent Mistakes Business & Risk & Strategy
211 Berkshire’s Recent Mistakes Business & Risk & Strategy 1
212 Berkshire’s Recent Mistakes Business & Risk & Strategy
213 Berkshire’s Recent Mistakes Business & Risk & Strategy
214 Strategic Missteps Business & Risk & Strategy
215 Strategic Missteps Business & Risk & Strategy
216 Strategic Missteps Business & Risk & Strategy
217 Strategic Missteps Business & Risk & Strategy
218 Strategic Challenges Business & Risk & Strategy 0.803
219 Strategic Missteps Business & Risk & Strategy
220 Strategic Setbacks Business & Risk & Strategy
221 Recent Mistakes Business & Risk & Strategy
222 Recent Mistakes Business & Risk & Strategy
223 Recent Mistakes Business & Risk & Strategy
224 Recent Mistakes Business & Risk & Strategy
225 Recent Mistakes Business & Risk & Strategy
226 Recent Mistakes Business & Risk & Strategy 1
227 Recent Mistakes Business & Risk & Strategy
228 Strategic Missteps Business & Risk & Strategy
229 Recent Mistakes Business & Risk & Strategy
230 Strategic Mishaps Business & Risk & Strategy
231 Recent Mistakes Business & Risk & Strategy
232 Recent Mistakes Business & Risk & Strategy
233 Recent Mistakes Business & Risk & Strategy
234 Financial Risks & Challenges Business & Risk & Strategy 0.671
235 Recent Business Mistakes Business & Risk & Strategy 0.694
236 Recent Mistakes Business & Risk & Strategy
237 Strategic Challenges Business & Risk & Strategy 0.678
238 Strategic Challenges Business & Risk & Strategy
239 Recent Business Mistakes Business & Risk & Strategy 0.671
240 Strategic Challenges Business & Risk & Strategy 0.671
241 Recent Business Mistakes Business & Risk & Strategy 0.671
242 Strategic Challenges Business & Risk & Strategy 0.671
243 Recent Business Mistakes Business & Risk & Strategy 0.671
244 Recent Business Mistakes Business & Risk & Strategy
245 Recent Business Mistakes Business & Risk & Strategy
246 Recent Business Mistakes Business & Risk & Strategy
247 Recent Business Mistakes Business & Risk & Strategy
248 Recent Business Mistakes Business & Risk & Strategy
249 Recent Business Mistakes Business & Risk & Strategy 1
250 Recent Business Mistakes Business & Risk & Strategy
251 Recent Business Mistakes Business & Risk & Strategy
252 Strategic Challenges Business & Risk & Strategy 0.671
253 Recent Business Mistakes Business & Risk & Strategy 0.671
254 Recent Business Mistakes Business & Risk & Strategy
255 Recent Business Mistakes Business & Risk & Strategy
256 Recent Business Mistakes Business & Risk & Strategy
257 Recent Business Mistakes Business & Risk & Strategy
258 Recent Business Mistakes Business & Risk & Strategy 1

Implementation Code

import re
import json
import ollama
import logging
import numpy as np
from typing import List
from pathlib import Path
from dataclasses import dataclass
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity

logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)

# Dont show logging for httpcore or httpx
logging.getLogger("httpcore").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)


@dataclass
class ChunkingConfig:
    max_chunk_size: int = 1000  # characters
    min_chunk_size: int = 100  # characters
    similarity_threshold: float = 0.15  # cosine distance threshold for splitting (higher means more splits)
    ollama_model: str = "nomic-embed-text:v1.5"  # Ollama model for embeddings
    overlap_sentences: int = 1  # sentences to overlap between chunks


@dataclass
class SectioningConfig:
    # Model settings
    ollama_model: str = "gemma3:1b"
    max_tokens: int = 200
    temperature: float = 0.1  # Low for consistency

    # Similarity thresholds
    section_merge_threshold: float = 0.75
    title_weight: float = 0.6
    category_weight: float = 0.4

    # Size constraints
    max_section_size: int = 3000
    min_section_size: int = 300


class SimpleChunkMetadata(BaseModel):
    title: str = Field(..., description="Concise descriptive title for this content")
    category: str = Field(..., description="Broad topic category")


class Subsection(BaseModel):
    subsection_title: str
    chunks: List[str]
    combined_content: str
    chunk_count: int
    total_length: int


class DocumentSection(BaseModel):
    section_title: str
    category: str
    subsections: List[Subsection]
    total_chunks: int
    total_length: int


class SemanticChunker:
    def __init__(self, config: ChunkingConfig):
        self.config = config

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding from Ollama using the Python client"""
        # Ensure the text is not empty, as empty strings might cause issues with embeddings
        if not text.strip():
            return np.zeros(768)  # Return a zero vector for empty text, assuming 768 dimensions for nomic-embed-text
        response = ollama.embeddings(model=self.config.ollama_model, prompt=text)
        return np.array(response["embedding"])

    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences using regex"""
        # Improved regex to handle common sentence endings while avoiding splitting on abbreviations (e.g., Mr. Smith)
        # It looks for . ! or ? followed by whitespace and an uppercase letter, but not if preceded by a common abbreviation pattern.
        # This is a common pattern for sentence splitting.
        sentence_pattern = r"(?<!\b[A-Z]\.)(?<!\b[A-Z][a-z]\.)(?<=[.!?])\s+(?=[A-Z])"
        sentences = re.split(sentence_pattern, text)
        return [s.strip() for s in sentences if s.strip()]

    def chunk_sentences(self, sentences: List[str]) -> List[str]:
        """
        Chunks sentences based on semantic similarity, respecting max_chunk_size,
        and adding overlap.
        """
        if not sentences:
            return []

        chunks: List[str] = []
        current_chunk_sentences: List[str] = []
        current_chunk_char_length = 0

        # Get embeddings for all sentences
        sentence_embeddings = [self.get_embedding(s) for s in sentences]

        for i, sentence in enumerate(sentences):
            sentence_length = len(sentence)

            # Check if adding the current sentence would exceed max_chunk_size
            # and if we have enough content to form a valid chunk
            if current_chunk_char_length + sentence_length > self.config.max_chunk_size and current_chunk_sentences:
                # If we're at the beginning of a potential new chunk and the first sentence itself is too long,
                # we'll add it as a standalone chunk and handle it.
                if current_chunk_char_length == 0:
                    chunks.append(sentence)
                    current_chunk_sentences = []
                    current_chunk_char_length = 0
                    continue  # Move to the next sentence
                else:
                    # Finalize the current chunk
                    chunks.append(" ".join(current_chunk_sentences))
                    # Reset for the next chunk, adding overlap
                    current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i]
                    current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)

            # Add sentence to current chunk
            current_chunk_sentences.append(sentence)
            current_chunk_char_length += sentence_length

            # Semantic split check (only if there's more than one sentence in current_chunk_sentences)
            if len(current_chunk_sentences) > 1:
                # Compare the last two sentences in the current_chunk_sentences
                # We're interested in the similarity between the last added sentence and the one before it
                # to detect a potential topic shift at the boundary.
                embed1 = sentence_embeddings[i - 1]  # Embedding of the sentence before the current one
                embed2 = sentence_embeddings[i]  # Embedding of the current sentence

                # Calculate cosine distance (1 - cosine_similarity)
                # A higher distance means less similarity.
                if np.dot(embed1, embed2) == 0 and np.linalg.norm(embed1) == 0 and np.linalg.norm(embed2) == 0:
                    # Handle case where both embeddings are zero vectors (e.g., from empty strings), distance is 0
                    distance = 0
                elif np.linalg.norm(embed1) == 0 or np.linalg.norm(embed2) == 0:
                    # If one is zero and other is not, they are dissimilar
                    distance = 1.0
                else:
                    distance = 1 - cosine_similarity(embed1.reshape(1, -1), embed2.reshape(1, -1))[0][0]

                if distance > self.config.similarity_threshold:
                    log.debug(f"Semantic split detected between sentences {i - 1} and {i} with distance {distance:.4f}")
                    # Semantic split detected!
                    # If the chunk is long enough, finalize it before the current sentence.
                    if current_chunk_char_length - sentence_length >= self.config.min_chunk_size:
                        # Append chunk excluding the current sentence, which starts a new one
                        chunks.append(" ".join(current_chunk_sentences[:-1]))
                        # Reset for the next chunk, adding overlap
                        current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i + 1]
                        current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)
                    # If the chunk is too short, we don't split yet and continue to build it.
                    # This prevents very small chunks due to minor semantic shifts.

        # Add any remaining sentences as the last chunk
        if current_chunk_sentences:
            chunks.append(" ".join(current_chunk_sentences))

        # Post-processing: Ensure no chunks are empty and optionally merge very small chunks
        final_chunks = [chunk for chunk in chunks if chunk.strip()]
        return final_chunks


class SectionGrouper:
    def __init__(self, sectioning_config: SectioningConfig, embedding_model: str = "nomic-embed-text:v1.5"):
        self.config = sectioning_config
        self.embedding_model = embedding_model

    def generate_chunk_metadata(self, chunk: str, previous_metadata) -> SimpleChunkMetadata:
        """Generate structured metadata for a chunk using Ollama"""

        # Prepare previous metadata as a string for context
        metadata = ""
        if len(previous_metadata) > 0:
            metadata = (
                "\n".join(
                    f"{i + 1}. Title: {meta.title}, Category: {meta.category}"
                    for i, (chunk, meta) in enumerate(previous_metadata)
                )
                if previous_metadata
                else ""
            )

            if metadata != "":
                metadata = f"Previous metadata:\n{metadata}\n\n"

        prompt = f"""Analyze the following text and provide structured metadata:

Text: {chunk}

{metadata}

Generate a JSON response with:
- title: A concise 3-8 word descriptive title
- category: A broad 1-3 word topic category

Be specific and descriptive but concise. Keep consistency in section titles and categories across similar content."""

        try:
            response = ollama.generate(
                model=self.config.ollama_model,
                prompt=prompt,
                format=SimpleChunkMetadata.model_json_schema(),
                options={"temperature": self.config.temperature, "num_predict": self.config.max_tokens},
                keep_alive=True,
            )

            # Parse the JSON response
            metadata_dict = json.loads(response["response"])
            return SimpleChunkMetadata(**metadata_dict)

        except Exception as e:
            log.warning(f"Failed to generate metadata for chunk, using fallback: {e}")
            # Fallback metadata
            return SimpleChunkMetadata(
                title=f"Section {hash(chunk[:100]) % 1000}",
                category="General",
            )

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding for text"""
        if not text.strip():
            return np.zeros(768)
        response = ollama.embeddings(model=self.embedding_model, prompt=text)
        return np.array(response["embedding"])

    def calculate_similarity(self, metadata1: SimpleChunkMetadata, metadata2: SimpleChunkMetadata) -> float:
        """Calculate weighted similarity between two chunk metadata objects"""

        # Get embeddings for each field
        title1_emb = self.get_embedding(metadata1.title)
        title2_emb = self.get_embedding(metadata2.title)

        category1_emb = self.get_embedding(metadata1.category)
        category2_emb = self.get_embedding(metadata2.category)

        # Calculate cosine similarities
        def safe_cosine_similarity(emb1, emb2):
            if np.linalg.norm(emb1) == 0 or np.linalg.norm(emb2) == 0:
                return 0.0
            return cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]

        title_sim = safe_cosine_similarity(title1_emb, title2_emb)
        category_sim = safe_cosine_similarity(category1_emb, category2_emb)

        # Weighted similarity
        weighted_sim = self.config.title_weight * title_sim + self.config.category_weight * category_sim

        return weighted_sim

    def group_chunks_into_sections(self, chunks: List[str]) -> List[DocumentSection]:
        """Group chunks into coherent sections based on metadata similarity"""

        if not chunks:
            return []

        log.info(f"Generating metadata for {len(chunks)} chunks...")

        # Stage 1: Generate metadata for all chunks
        chunk_metadata = []
        for i, chunk in enumerate(chunks):
            # We pass in the last 5 metadata items as context for the next chunk
            metadata = self.generate_chunk_metadata(
                chunk, chunk_metadata[:-5] if len(chunk_metadata) >= 5 else chunk_metadata
            )
            chunk_metadata.append((chunk, metadata))
            log.debug(f"Chunk {i + 1}: {metadata.title} | {metadata.category}")

        log.info("Grouping chunks into sections...")

        # Stage 2: Group chunks based on similarity
        sections = []
        current_section_chunks = [(chunk_metadata[0][0], chunk_metadata[0][1])]
        current_section_category = chunk_metadata[0][1].category

        for i in range(1, len(chunk_metadata)):
            chunk, metadata = chunk_metadata[i]
            prev_metadata = chunk_metadata[i - 1][1]

            # Calculate similarity with previous chunk
            similarity = self.calculate_similarity(metadata, prev_metadata)

            # Check if we should start a new section
            should_split = (
                # Similarity is below threshold so the current chunk is likely different
                # from the previous section
                similarity < self.config.section_merge_threshold
                # Or if the current section is too long
                or sum(len(c[0]) for c in current_section_chunks) + len(chunk) > self.config.max_section_size
            )

            if should_split and len(current_section_chunks) > 0:
                # Finalize current section
                section = self._create_section(current_section_chunks, current_section_category)
                sections.append(section)

                # Start new section
                current_section_chunks = [(chunk, metadata)]
                current_section_category = metadata.category

                log.debug(f"New section started at chunk {i + 1}, similarity: {similarity:.3f}")
            else:
                # Add to current section
                current_section_chunks.append((chunk, metadata))

        # Add final section
        if current_section_chunks:
            section = self._create_section(current_section_chunks, current_section_category)
            sections.append(section)

        log.info(f"Created {len(sections)} sections from {len(chunks)} chunks")
        return sections

    def _create_section(self, section_chunks: List[tuple], category: str) -> DocumentSection:
        """Create a DocumentSection from a list of (chunk, metadata) tuples"""

        # Group chunks into subsections based on title similarity
        subsections = []
        current_subsection = []
        current_title = section_chunks[0][1].title

        for chunk, metadata in section_chunks:
            if not current_subsection:
                current_subsection = [(chunk, metadata)]
                current_title = metadata.title
            else:
                # Check if we should group with current subsection
                title_sim = self.calculate_similarity(metadata, current_subsection[-1][1])

                if title_sim > 0.8:  # High threshold for subsection grouping
                    current_subsection.append((chunk, metadata))
                else:
                    # Create subsection from current group
                    subsection = self._create_subsection(current_subsection, current_title)
                    subsections.append(subsection)

                    # Start new subsection
                    current_subsection = [(chunk, metadata)]
                    current_title = metadata.title

        # Add final subsection
        if current_subsection:
            subsection = self._create_subsection(current_subsection, current_title)
            subsections.append(subsection)

        # Create section title from most common category and representative title
        section_title = self._generate_section_title(section_chunks)

        return DocumentSection(
            section_title=section_title,
            category=category,
            subsections=subsections,
            total_chunks=len(section_chunks),
            total_length=sum(len(chunk) for chunk, _ in section_chunks),
        )

    def _create_subsection(self, subsection_chunks: List[tuple], title: str) -> Subsection:
        """Create a Subsection from a list of (chunk, metadata) tuples"""
        chunks = [chunk for chunk, _ in subsection_chunks]
        combined_content = "\n\n".join(chunks)

        return Subsection(
            subsection_title=title,
            chunks=chunks,
            combined_content=combined_content,
            chunk_count=len(chunks),
            total_length=len(combined_content),
        )

    def _generate_section_title(self, section_chunks: List[tuple]) -> str:
        """Generate a representative title for the section"""
        # Use the title from the first chunk, or create from category
        if section_chunks:
            first_metadata = section_chunks[0][1]
            if len(section_chunks) == 1:
                return first_metadata.title
            else:
                # For multi-chunk sections, use category-based title
                return f"{first_metadata.category} Overview"
        return "Untitled Section"


def pdf_to_markdown(pdf_file: Path) -> str:
    from markitdown import MarkItDown

    md = MarkItDown()
    result = md.convert(pdf_file)
    text_content = result.text_content
    # Keep only ascii characters
    text_content = "".join(c for c in text_content if ord(c) < 128)
    # Remove leading and trailing whitespace
    text_content = text_content.strip()
    # Remove multiple newlines
    text_content = "\n".join(line.strip() for line in text_content.splitlines() if line.strip())
    return text_content


def get_pdf():
    import requests

    # Sample text chunks
    url = "https://www.berkshirehathaway.com/letters/2024ltr.pdf"
    response = requests.get(url)
    if response.status_code != 200:
        raise ValueError(f"Failed to download PDF: {response.status_code}")
    pdf_file = Path("./example_cache/2024ltr.pdf")
    pdf_file.parent.mkdir(parents=True, exist_ok=True)
    pdf_file.write_bytes(response.content)
    # Convert PDF to markdown text
    text = pdf_to_markdown(pdf_file)
    return text


# Usage example
def main():
    # Example usage
    sample_text = get_pdf()

    # Initialize chunker
    chunking_config = ChunkingConfig(
        max_chunk_size=750,
        min_chunk_size=150,
        similarity_threshold=0.25,
        ollama_model="nomic-embed-text:v1.5",
        overlap_sentences=2,
    )

    # Initialize section grouper
    sectioning_config = SectioningConfig(
        ollama_model="gemma3:1b", section_merge_threshold=0.75, max_section_size=3000, min_section_size=300
    )

    # Stage 1: Semantic chunking
    chunker = SemanticChunker(chunking_config)
    sentences = chunker.split_into_sentences(sample_text)
    chunks = chunker.chunk_sentences(sentences)

    print(f"Stage 1 Complete - Generated {len(chunks)} semantic chunks")

    # Stage 2: Section grouping
    grouper = SectionGrouper(sectioning_config)
    sections = grouper.group_chunks_into_sections(chunks)

    # Write sections to JSON file
    output_file = Path("./example_cache/sections.json")
    output_file.parent.mkdir(parents=True, exist_ok=True)
    with output_file.open("w", encoding="utf-8") as f:
        json.dump([section.model_dump() for section in sections], f, indent=2, ensure_ascii=False)
    log.info(f"Sections written to {output_file}")

    print(f"Stage 2 Complete - Generated {len(sections)} sections")

    # Display results
    print("\n" + "=" * 80)
    print("DOCUMENT STRUCTURE")
    print("=" * 80)

    for i, section in enumerate(sections, 1):
        print(f"\nSECTION {i}: {section.section_title}")
        print(f"Category: {section.category}")
        print(f"Total chunks: {section.total_chunks} | Length: {section.total_length} chars")
        print("-" * 60)

        for j, subsection in enumerate(section.subsections, 1):
            print(f"  {i}.{j} {subsection.subsection_title}")
            print(f"       Chunks: {subsection.chunk_count} | Length: {subsection.total_length} chars")

            # Show first chunk preview
            if subsection.chunks:
                preview = (
                    subsection.chunks[0][:200] + "..." if len(subsection.chunks[0]) > 200 else subsection.chunks[0]
                )
                print(f"       Preview: {preview}")
            print()

#LLM