Asif Rahman

Document Chunking for LLMs

Chunking documents into smaller segments for LLM-based retrieval augmented generation and semantic search engines presents a challenge to building robust and useful systems. Here I outline some of these challenges, the competing priorities, and provide a simple but effective method to chunk documents for LLMs using recursive sentence embedding and semantic similarity.

I’ve found that splitting large documents into smaller chunks requires some trial and error to find the right strategy.

There are basically three competing factors:

  1. Semantic coherence within chunks - the chunk should contain related information. A small chunk size loses context but is more coherent while a large chunk size is less coherent but contains more context.
  2. Semantic separation between chunks - the chunks should be distinct enough to avoid overlap but not so distinct that they lose context. This means we want to avoid splitting chunks at arbitrary points like sentences or paragraphs and instead split them at semantic boundaries of sections and topics. Even having some overlap between chunks can be helpful to keep context.
  3. Information preservation - chunks should be self-contained enough to be useful on their own. This means that we need to split text at context boundaries and ensure that the similar concepts are grouped together in the same chunk and not split across chunks.

Ultimately, the appropriate startegy depends on the type of document and the intended use case. For example, a user guide is organized into chapters and sections. The beginning of a chapter introduces the topic, followed by sections that provide detailed workflows and instructions. In this case, it makes sense to keep the workflows and instructions together in the same chunk. We want to avoid splitting a step in a workflow into two independent chunks that loses it’s place in the workflow and more importantly, the context of the step. If the document is more structured, like an invoice or technical specifications with tables, than a more sophisticated chunking strategy may be needed to preserve the relationships between the data points. This article focuses on chunking textual documents, such as reports and manuals, where the text is more free-form and less structured.

Table of contents:

Algorithm Overview

Lets build a chunking algorithm based on these principles from the ground up. Conceptually, we want to start with sentences since they represent the basic unit of a coherent meaning. And from there we need a strategy for iteratively merging sentences that achieves a balance between the three stated objectives: semantic coherence within chunks, semantic separation between chunks, and information preservation.

So the algorithm consists of two main stages:

  1. Semantic chunking breaks the document into semantically coherent chunks by:
    • Splitting into sentences
    • Computing embeddings for each sentence
    • Using cosine distance between consecutive sentence embeddings to detect topic shifts
    • Enforcing size constraints while adding overlap
  2. Section grouping organizes chunks into a hierarchical structure by:
    • Generating descriptive metadata (title and category) for each chunk using an LLM
    • Computing weighted similarity between chunk metadata
    • Grouping similar chunks into sections based on similarity thresholds
    • Creating subsections within sections for fine-grained organization

Example: Berkshire Hathaway Annual Report

As an example, I’ve taken the 2024 Berkshire Hathaway Annual Report and passed it through the chunking process. First I generates sentences from the document using regex and then computes embeddings and uses cosine similarity to merge sentences into partially overlapping chunks. Next I take each chunk and use an LLM to generate metadata including a title and category for each chunk. Notice how there is consistency in the titles and categories across chunks. This is because the LLM is prompted with the trailing 5 metadata entries to provide context for the current chunk. Finally, I use the embedding approach to merge chunks into sections based on the semantic similarity of the title and category of each chunk. A tunable similarity threshold of 0.75 is used to combine chunks into sections. The resulting sections are more coherent and self-contained, making them easier to use in a RAG system.

The table below shows the title and category generated by an LLM for each chunk. Splits happen when the chunk similarity between the current chunk and the previous chunk is below a threshold of 0.75 or when the chunk size exceeds a maximum character limit. When there is a split, the table below reports the similarity of the current chunk to the previous chunk.

In the beginning of the letter to the shareholders, the chunks are smaller and more granular because Buffet is introducing several concepts and providing an overview of the risks, management philosophy, and strategic focus. Chunks are smaller, spanning a few sentences and the similarity between chunks is lower. As the letter progresses, the chunks become larger and more coherent, with higher similarity scores as Buffet discusses specific topics in more detail. The final sections of the letter are larger and more comprehensive, covering specific companies, investments, and their performance in depth. Notice that even though the similarity between chunks is high and the title and category are the same, the chunk size grows larger and a new split is created.

ChunkTitleCategoryPrevious
Chunk
Similarity
1Berkshire Hathaway Annual ReportFinancial Report
2Berkshire Hathaway Annual ReportFinancial Report
3Company Transparency & ReportingCorporate Communication & Reporting0.541
4Responsibility & CommunicationCorporate Communication & Reporting0.74
5Strategic Review & Ownership DialogueCorporate Communication & Reporting0.696
6Communication StrategyBusiness & Investor Relations0.576
7Berkshire Hathaway’s Risk Management ApproachFinancial Risk Management0.479
8Mistakes and Strategic AssessmentBusiness Strategy0.528
9Mistakes in Berkshire AcquisitionsBusiness Risk & Management0.641
10Mistakes in Hiring, Impacting BerkshireCorporate Management & Risk
11Mistakes in Hiring AssessmentPersonnel Management & Decision Making0.668
12Painful Mistakes, Diminishing ReturnsFinancial & Strategic0.506
13Mistakes and DelayStrategic & Operational0.679
14Mistakes and AnalysisBusiness Strategy0.643
15Mistakes and Their ImpactBusiness Strategy
16Word Frequency AnalysisText Analysis0.37
17Company ObservationsBusiness & Communication0.405
18Behavioral ObservationsBusiness & Risk0.652
19CEO Succession & RiskCorporate Management & Risk0.568
20CEO Transition & RiskCorporate Management & Risk
21CEO Succession & Berkshire’s RiskCorporate Strategy & Risk
22Berkshire CEO PhilosophyCorporate Strategy & Leadership
23Pete Liegl’s LegacyBusiness & Leadership0.53
24Pete Liegl - A Wealthing StoryBusiness & Financial
25Pete - Forest River FounderBusiness & Founding0.572
26Forest River AcquisitionBusiness & Financial0.638
27RV Deal - Initial CommunicationBusiness & Communication & Financial0.648
28Berkshire Acquisition DealBusiness & Finance0.646
29Meeting Details & Price DiscussionBusiness & Communication & Strategy0.513
30Meeting & DealBusiness & Communication
31Business Meeting & Financial PlanningBusiness & Strategy0.664
32Berkshire Hathaway Business DealBusiness & Financial0.576
33Real Estate DealBusiness & Financial
34Real Estate Lease DisputeBusiness & Financial
35Meeting DynamicsBusiness & Communication & Strategy0.514
36Compensation StructureFinancial & Human Resources0.429
37Compensation StructureFinancial & Strategy
38Berkshire’s Compensation OfferFinancial & Business0.669
39Berkshire’s Financial StrategyFinancial Strategy & Risk0.677
40Berkshire’s Early SuccessBusiness & Financial0.741
41Simple SuccessBusiness & Financial0.737
42Berkshire’s Strategic MistakesBusiness & Risk & Strategy0.492
43Berkshire’s PerformanceFinancial & Strategic0.736
44Mistakes in Berkshire AcquisitionsBusiness Risk & Management0.645
45Berkshire’s Strategic ImperfectionsBusiness Strategy & Risk
46Strategic Focus & PartnershipBusiness & Strategy0.624
47Strategic ImperativeBusiness & Strategy
48CEO Mistakes & AnalysisBusiness Strategy & Risk0.614
49Mistakes in Berkshire AcquisitionsBusiness Risk & Management0.745
50Berkshire’s Strategic MistakesBusiness & Risk & Strategy
51Berkshire’s Strategic MistakesBusiness & Risk & Strategy
52Berkshire’s Strategic MistakesBusiness & Risk & Strategy
53Berkshire’s Mistakes & Strategic ChallengesBusiness & Risk & Strategy
54Mistakes in Berkshire AcquisitionsBusiness Risk & Management
55Berkshire’s Strategic MistakesBusiness & Risk & Strategy
56Berkshire’s Strategic MistakesBusiness & Risk & Strategy1
57Berkshire’s Strategic MistakesBusiness & Risk & Strategy
58Pete Liegl’s PerformanceBusiness & Financial0.475
59GEICO RestructuringBusiness & Strategy0.46
60GEICO TransformationBusiness Strategy & Operational
61Berkshire’s Recent MistakesBusiness & Risk & Strategy0.536
62Property Casualty Pricing SurgeFinancial & Strategic0.522
63Convective Storm DamageFinancial Risk & Strategic0.652
64Berkshire’s Recent ChallengesBusiness & Risk & Strategy0.55
65Berkshire’s Recent Financial ChallengesFinancial & Strategic
66Insurance Losses & Strategic RisksBusiness & Risk & Strategy0.559
67Berkshire’s Strategic MistakesBusiness & Risk & Strategy0.747
68Berkshire’s Financial PerformanceFinancial & Strategic0.74
69Berkshire’s Recent MistakesBusiness & Risk & Strategy
70Berkshire’s Financial PerformanceFinancial & Strategic
71Berkshire’s Recent MistakesBusiness & Risk & Strategy
72Berkshire’s Strategic MistakesBusiness & Risk & Strategy
73Berkshire’s Recent MistakesBusiness & Risk & Strategy
74Berkshire’s Strategic MistakesBusiness & Risk & Strategy
75Berkshire’s Recent Mistakes & ChallengesBusiness & Risk & Strategy0.923
76Berkshire’s Recent MistakesBusiness & Risk & Strategy
77Berkshire’s Recent Mistakes & Strategic ChallengesBusiness & Risk & Strategy
78Berkshire’s Recent Mistakes & ChallengesBusiness & Risk & Strategy
79Berkshire’s Recent Mistakes & Strategic ChallengesBusiness & Risk & Strategy0.981
80Berkshire’s Recent MistakesBusiness & Risk & Strategy
81Berkshire’s Recent MistakesBusiness & Risk & Strategy1
82Berkshire’s Recent MistakesBusiness & Risk & Strategy
83Berkshire’s Recent MistakesBusiness & Risk & Strategy
84Berkshire’s Recent MistakesBusiness & Risk & Strategy
85Berkshire’s Recent MistakesBusiness & Risk & Strategy
86Berkshire’s MistakesBusiness & Risk & Strategy0.982
87Berkshire’s Early MistakesBusiness & Risk & Strategy
88Berkshire’s Early MistakesBusiness & Risk & Strategy
89Berkshire’s Early MistakesBusiness & Risk & Strategy
90Berkshire’s Strategic MisstepsBusiness & Risk & Strategy
91Berkshire’s Tax BurdenFinancial & Strategic0.697
92Berkshire’s Tax BurdenFinancial & Risk & Strategy
93Berkshire’s Recent MistakesBusiness & Risk & Strategy
94Berkshire’s Recent MistakesBusiness & Risk & Strategy
95Berkshire’s Recent MistakesBusiness & Risk & Strategy
96Berkshire’s Recent MistakesBusiness & Risk & Strategy1
97Berkshire’s Recent MistakesBusiness & Risk & Strategy
98Berkshire’s Financial ChallengesFinancial & Strategic0.742
99Berkshire’s Recent MistakesBusiness & Risk & Strategy0.742
100Berkshire’s Financial MistakesFinancial & Strategic
101Berkshire’s Recent MistakesBusiness & Risk & Strategy
102Berkshire’s Recent MistakesBusiness & Risk & Strategy
103Berkshire’s Recent MistakesBusiness & Risk & Strategy
104Berkshire’s Recent MistakesBusiness & Risk & Strategy
105Berkshire’s Recent Mistakes & ChallengesBusiness & Risk & Strategy
106Berkshire’s Recent Mistakes & ChallengesBusiness & Risk & Strategy
107Berkshire’s Recent MistakesBusiness & Risk & Strategy0.964
108Berkshire’s Recent MistakesBusiness & Risk & Strategy
109Berkshire’s Recent MistakesBusiness & Risk & Strategy
110Berkshire’s Recent MistakesBusiness & Risk & Strategy
111Berkshire’s Recent MistakesBusiness & Risk & Strategy
112Berkshire’s Recent MistakesBusiness & Risk & Strategy
113Berkshire’s Recent MistakesBusiness & Risk & Strategy
114Berkshire’s Recent MistakesBusiness & Risk & Strategy1
115Berkshire’s Strategic SetbacksBusiness & Risk & Strategy
116Berkshire’s Recent MistakesBusiness & Risk & Strategy
117Berkshire’s Recent MistakesBusiness & Risk & Strategy
118Berkshire’s Recent MistakesBusiness & Risk & Strategy
119Berkshire’s Strategic ChallengesBusiness & Risk & Strategy
120Berkshire’s Strategic MistakesBusiness & Risk & Strategy
121Berkshire’s Recent MistakesBusiness & Risk & Strategy
122Berkshire’s Recent MistakesBusiness & Risk & Strategy1
123Berkshire’s Recent MistakesBusiness & Risk & Strategy
124Berkshire’s Recent MistakesBusiness & Risk & Strategy
125Berkshire’s Strategic MistakesBusiness & Risk & Strategy
126Berkshire’s Strategic MistakesBusiness & Risk & Strategy
127Berkshire’s Strategic MistakesBusiness & Risk & Strategy
128Berkshire’s Strategic ChallengesBusiness & Risk & Strategy
129Berkshire’s Recent MistakesBusiness & Risk & Strategy0.851
130Berkshire’s Strategic MistakesBusiness & Risk & Strategy
131Berkshire’s Strategic MistakesBusiness & Risk & Strategy
132Berkshire’s Recent MistakesBusiness & Risk & Strategy
133Berkshire’s Recent MistakesBusiness & Risk & Strategy
134Berkshire’s Recent MistakesBusiness & Risk & Strategy
135Berkshire’s Recent MistakesBusiness & Risk & Strategy1
136Berkshire’s Early StrugglesBusiness & Risk & Strategy
137Berkshire’s Recent MistakesBusiness & Risk & Strategy
138Berkshire’s Recent MistakesBusiness & Risk & Strategy
139Berkshire’s Recent MistakesBusiness & Risk & Strategy
140Berkshire’s Recent MistakesBusiness & Risk & Strategy
141Berkshire’s Recent MistakesBusiness & Risk & Strategy1
142Berkshire’s Recent MistakesBusiness & Risk & Strategy
143Berkshire’s Recent Mistakes & ChallengesBusiness & Risk & Strategy
144Berkshire’s Early StrugglesBusiness & Risk & Strategy
145Berkshire’s Recent MistakesBusiness & Risk & Strategy0.86
146Berkshire’s Recent MistakesBusiness & Risk & Strategy
147Berkshire’s Recent MistakesBusiness & Risk & Strategy
148Berkshire’s Recent MistakesBusiness & Risk & Strategy
149Berkshire’s Recent MistakesBusiness & Risk & Strategy
150Berkshire’s Recent MistakesBusiness & Risk & Strategy
151Berkshire’s Recent MistakesBusiness & Risk & Strategy
152Berkshire’s Recent MistakesBusiness & Risk & Strategy
153Berkshire’s Recent MistakesBusiness & Risk & Strategy
154Berkshire’s Recent MistakesBusiness & Risk & Strategy1
155Berkshire’s Recent MistakesBusiness & Risk & Strategy
156Berkshire’s Recent MistakesBusiness & Risk & Strategy
157Berkshire’s Recent MistakesBusiness & Risk & Strategy
158Berkshire’s Recent MistakesBusiness & Risk & Strategy
159CEOs’ MistakesBusiness Strategy & Risk
160Berkshire’s Recent MistakesBusiness & Risk & Strategy
161Berkshire’s Recent MistakesBusiness & Risk & Strategy1
162Berkshire’s Recent MistakesBusiness & Risk & Strategy
163Berkshire’s Recent MistakesBusiness & Risk & Strategy
164Berkshire’s Strategic ChallengesBusiness & Risk & Strategy0.851
165Berkshire’s Recent MistakesBusiness & Risk & Strategy
166Berkshire’s Recent MistakesBusiness & Risk & Strategy
167Berkshire’s Recent MistakesBusiness & Risk & Strategy
168Berkshire’s Recent MistakesBusiness & Risk & Strategy
169Berkshire’s Recent MistakesBusiness & Risk & Strategy
170Berkshire’s Recent MistakesBusiness & Risk & Strategy
171Berkshire’s Recent MistakesBusiness & Risk & Strategy
172Berkshire’s Recent MistakesBusiness & Risk & Strategy1
173Berkshire’s Recent MistakesBusiness & Risk & Strategy
174Berkshire’s Recent MistakesBusiness & Risk & Strategy
175Berkshire’s Recent MistakesBusiness & Risk & Strategy
176Berkshire’s Recent MistakesBusiness & Risk & Strategy
177Berkshire’s Recent MistakesBusiness & Risk & Strategy
178Berkshire’s Recent MistakesBusiness & Risk & Strategy
179Berkshire’s Recent MistakesBusiness & Risk & Strategy
180Berkshire’s Recent MistakesBusiness & Risk & Strategy1
181Berkshire’s Recent MistakesBusiness & Risk & Strategy
182Strategic SetbacksBusiness & Risk & Strategy
183Recent Mistakes in Berkshire’s OperationsBusiness & Risk & Strategy0.743
184Hurricane, Tornado, and Wildfire RisksFinancial & Strategic0.547
185Strategic MistakesBusiness & Risk & Strategy0.55
186Auto Insurance TransitionFinancial & Strategic0.481
187Berkshire’s Recent MistakesBusiness & Risk & Strategy0.49
188Berkshire’s Recent MistakesBusiness & Risk & Strategy
189Berkshire’s Recent MistakesBusiness & Risk & Strategy
190Berkshire’s Recent MistakesBusiness & Risk & Strategy
191Berkshire’s Recent MistakesBusiness & Risk & Strategy
192Berkshire’s Recent MistakesBusiness & Risk & Strategy
193Berkshire’s Recent MistakesBusiness & Risk & Strategy
194Berkshire’s Recent MistakesBusiness & Risk & Strategy
195Berkshire’s Recent MistakesBusiness & Risk & Strategy
196Berkshire’s Recent MistakesBusiness & Risk & Strategy1
197Berkshire’s Recent MistakesBusiness & Risk & Strategy
198Berkshire’s Recent MistakesBusiness & Risk & Strategy
199Berkshire’s Strategic ShiftsBusiness & Strategy
200Berkshire’s Recent MistakesBusiness & Risk & Strategy
201Berkshire’s Recent MistakesBusiness & Risk & Strategy
202Berkshire’s Recent MistakesBusiness & Risk & Strategy
203Berkshire’s Recent MistakesBusiness & Risk & Strategy1
204Berkshire’s Recent MistakesBusiness & Risk & Strategy
205Berkshire’s Recent MistakesBusiness & Risk & Strategy
206Berkshire’s Recent MistakesBusiness & Risk & Strategy
207Berkshire’s Recent MistakesBusiness & Risk & Strategy
208Berkshire’s Recent MistakesBusiness & Risk & Strategy
209Berkshire’s Recent MistakesBusiness & Risk & Strategy
210Berkshire’s Recent MistakesBusiness & Risk & Strategy
211Berkshire’s Recent MistakesBusiness & Risk & Strategy1
212Berkshire’s Recent MistakesBusiness & Risk & Strategy
213Berkshire’s Recent MistakesBusiness & Risk & Strategy
214Strategic MisstepsBusiness & Risk & Strategy
215Strategic MisstepsBusiness & Risk & Strategy
216Strategic MisstepsBusiness & Risk & Strategy
217Strategic MisstepsBusiness & Risk & Strategy
218Strategic ChallengesBusiness & Risk & Strategy0.803
219Strategic MisstepsBusiness & Risk & Strategy
220Strategic SetbacksBusiness & Risk & Strategy
221Recent MistakesBusiness & Risk & Strategy
222Recent MistakesBusiness & Risk & Strategy
223Recent MistakesBusiness & Risk & Strategy
224Recent MistakesBusiness & Risk & Strategy
225Recent MistakesBusiness & Risk & Strategy
226Recent MistakesBusiness & Risk & Strategy1
227Recent MistakesBusiness & Risk & Strategy
228Strategic MisstepsBusiness & Risk & Strategy
229Recent MistakesBusiness & Risk & Strategy
230Strategic MishapsBusiness & Risk & Strategy
231Recent MistakesBusiness & Risk & Strategy
232Recent MistakesBusiness & Risk & Strategy
233Recent MistakesBusiness & Risk & Strategy
234Financial Risks & ChallengesBusiness & Risk & Strategy0.671
235Recent Business MistakesBusiness & Risk & Strategy0.694
236Recent MistakesBusiness & Risk & Strategy
237Strategic ChallengesBusiness & Risk & Strategy0.678
238Strategic ChallengesBusiness & Risk & Strategy
239Recent Business MistakesBusiness & Risk & Strategy0.671
240Strategic ChallengesBusiness & Risk & Strategy0.671
241Recent Business MistakesBusiness & Risk & Strategy0.671
242Strategic ChallengesBusiness & Risk & Strategy0.671
243Recent Business MistakesBusiness & Risk & Strategy0.671
244Recent Business MistakesBusiness & Risk & Strategy
245Recent Business MistakesBusiness & Risk & Strategy
246Recent Business MistakesBusiness & Risk & Strategy
247Recent Business MistakesBusiness & Risk & Strategy
248Recent Business MistakesBusiness & Risk & Strategy
249Recent Business MistakesBusiness & Risk & Strategy1
250Recent Business MistakesBusiness & Risk & Strategy
251Recent Business MistakesBusiness & Risk & Strategy
252Strategic ChallengesBusiness & Risk & Strategy0.671
253Recent Business MistakesBusiness & Risk & Strategy0.671
254Recent Business MistakesBusiness & Risk & Strategy
255Recent Business MistakesBusiness & Risk & Strategy
256Recent Business MistakesBusiness & Risk & Strategy
257Recent Business MistakesBusiness & Risk & Strategy
258Recent Business MistakesBusiness & Risk & Strategy1

Implementation Code

import re
import json
import ollama
import logging
import numpy as np
from typing import List
from pathlib import Path
from dataclasses import dataclass
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity

logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)

# Dont show logging for httpcore or httpx
logging.getLogger("httpcore").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)


@dataclass
class ChunkingConfig:
    max_chunk_size: int = 1000  # characters
    min_chunk_size: int = 100  # characters
    similarity_threshold: float = 0.15  # cosine distance threshold for splitting (higher means more splits)
    ollama_model: str = "nomic-embed-text:v1.5"  # Ollama model for embeddings
    overlap_sentences: int = 1  # sentences to overlap between chunks


@dataclass
class SectioningConfig:
    # Model settings
    ollama_model: str = "gemma3:1b"
    max_tokens: int = 200
    temperature: float = 0.1  # Low for consistency

    # Similarity thresholds
    section_merge_threshold: float = 0.75
    title_weight: float = 0.6
    category_weight: float = 0.4

    # Size constraints
    max_section_size: int = 3000
    min_section_size: int = 300


class SimpleChunkMetadata(BaseModel):
    title: str = Field(..., description="Concise descriptive title for this content")
    category: str = Field(..., description="Broad topic category")


class Subsection(BaseModel):
    subsection_title: str
    chunks: List[str]
    combined_content: str
    chunk_count: int
    total_length: int


class DocumentSection(BaseModel):
    section_title: str
    category: str
    subsections: List[Subsection]
    total_chunks: int
    total_length: int


class SemanticChunker:
    def __init__(self, config: ChunkingConfig):
        self.config = config

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding from Ollama using the Python client"""
        # Ensure the text is not empty, as empty strings might cause issues with embeddings
        if not text.strip():
            return np.zeros(768)  # Return a zero vector for empty text, assuming 768 dimensions for nomic-embed-text
        response = ollama.embeddings(model=self.config.ollama_model, prompt=text)
        return np.array(response["embedding"])

    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences using regex"""
        # Improved regex to handle common sentence endings while avoiding splitting on abbreviations (e.g., Mr. Smith)
        # It looks for . ! or ? followed by whitespace and an uppercase letter, but not if preceded by a common abbreviation pattern.
        # This is a common pattern for sentence splitting.
        sentence_pattern = r"(?<!\b[A-Z]\.)(?<!\b[A-Z][a-z]\.)(?<=[.!?])\s+(?=[A-Z])"
        sentences = re.split(sentence_pattern, text)
        return [s.strip() for s in sentences if s.strip()]

    def chunk_sentences(self, sentences: List[str]) -> List[str]:
        """
        Chunks sentences based on semantic similarity, respecting max_chunk_size,
        and adding overlap.
        """
        if not sentences:
            return []

        chunks: List[str] = []
        current_chunk_sentences: List[str] = []
        current_chunk_char_length = 0

        # Get embeddings for all sentences
        sentence_embeddings = [self.get_embedding(s) for s in sentences]

        for i, sentence in enumerate(sentences):
            sentence_length = len(sentence)

            # Check if adding the current sentence would exceed max_chunk_size
            # and if we have enough content to form a valid chunk
            if current_chunk_char_length + sentence_length > self.config.max_chunk_size and current_chunk_sentences:
                # If we're at the beginning of a potential new chunk and the first sentence itself is too long,
                # we'll add it as a standalone chunk and handle it.
                if current_chunk_char_length == 0:
                    chunks.append(sentence)
                    current_chunk_sentences = []
                    current_chunk_char_length = 0
                    continue  # Move to the next sentence
                else:
                    # Finalize the current chunk
                    chunks.append(" ".join(current_chunk_sentences))
                    # Reset for the next chunk, adding overlap
                    current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i]
                    current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)

            # Add sentence to current chunk
            current_chunk_sentences.append(sentence)
            current_chunk_char_length += sentence_length

            # Semantic split check (only if there's more than one sentence in current_chunk_sentences)
            if len(current_chunk_sentences) > 1:
                # Compare the last two sentences in the current_chunk_sentences
                # We're interested in the similarity between the last added sentence and the one before it
                # to detect a potential topic shift at the boundary.
                embed1 = sentence_embeddings[i - 1]  # Embedding of the sentence before the current one
                embed2 = sentence_embeddings[i]  # Embedding of the current sentence

                # Calculate cosine distance (1 - cosine_similarity)
                # A higher distance means less similarity.
                if np.dot(embed1, embed2) == 0 and np.linalg.norm(embed1) == 0 and np.linalg.norm(embed2) == 0:
                    # Handle case where both embeddings are zero vectors (e.g., from empty strings), distance is 0
                    distance = 0
                elif np.linalg.norm(embed1) == 0 or np.linalg.norm(embed2) == 0:
                    # If one is zero and other is not, they are dissimilar
                    distance = 1.0
                else:
                    distance = 1 - cosine_similarity(embed1.reshape(1, -1), embed2.reshape(1, -1))[0][0]

                if distance > self.config.similarity_threshold:
                    log.debug(f"Semantic split detected between sentences {i - 1} and {i} with distance {distance:.4f}")
                    # Semantic split detected!
                    # If the chunk is long enough, finalize it before the current sentence.
                    if current_chunk_char_length - sentence_length >= self.config.min_chunk_size:
                        # Append chunk excluding the current sentence, which starts a new one
                        chunks.append(" ".join(current_chunk_sentences[:-1]))
                        # Reset for the next chunk, adding overlap
                        current_chunk_sentences = sentences[max(0, i - self.config.overlap_sentences) : i + 1]
                        current_chunk_char_length = sum(len(s) for s in current_chunk_sentences)
                    # If the chunk is too short, we don't split yet and continue to build it.
                    # This prevents very small chunks due to minor semantic shifts.

        # Add any remaining sentences as the last chunk
        if current_chunk_sentences:
            chunks.append(" ".join(current_chunk_sentences))

        # Post-processing: Ensure no chunks are empty and optionally merge very small chunks
        final_chunks = [chunk for chunk in chunks if chunk.strip()]
        return final_chunks


class SectionGrouper:
    def __init__(self, sectioning_config: SectioningConfig, embedding_model: str = "nomic-embed-text:v1.5"):
        self.config = sectioning_config
        self.embedding_model = embedding_model

    def generate_chunk_metadata(self, chunk: str, previous_metadata) -> SimpleChunkMetadata:
        """Generate structured metadata for a chunk using Ollama"""

        # Prepare previous metadata as a string for context
        metadata = ""
        if len(previous_metadata) > 0:
            metadata = (
                "\n".join(
                    f"{i + 1}. Title: {meta.title}, Category: {meta.category}"
                    for i, (chunk, meta) in enumerate(previous_metadata)
                )
                if previous_metadata
                else ""
            )

            if metadata != "":
                metadata = f"Previous metadata:\n{metadata}\n\n"

        prompt = f"""Analyze the following text and provide structured metadata:

Text: {chunk}

{metadata}

Generate a JSON response with:
- title: A concise 3-8 word descriptive title
- category: A broad 1-3 word topic category

Be specific and descriptive but concise. Keep consistency in section titles and categories across similar content."""

        try:
            response = ollama.generate(
                model=self.config.ollama_model,
                prompt=prompt,
                format=SimpleChunkMetadata.model_json_schema(),
                options={"temperature": self.config.temperature, "num_predict": self.config.max_tokens},
                keep_alive=True,
            )

            # Parse the JSON response
            metadata_dict = json.loads(response["response"])
            return SimpleChunkMetadata(**metadata_dict)

        except Exception as e:
            log.warning(f"Failed to generate metadata for chunk, using fallback: {e}")
            # Fallback metadata
            return SimpleChunkMetadata(
                title=f"Section {hash(chunk[:100]) % 1000}",
                category="General",
            )

    def get_embedding(self, text: str) -> np.ndarray:
        """Get embedding for text"""
        if not text.strip():
            return np.zeros(768)
        response = ollama.embeddings(model=self.embedding_model, prompt=text)
        return np.array(response["embedding"])

    def calculate_similarity(self, metadata1: SimpleChunkMetadata, metadata2: SimpleChunkMetadata) -> float:
        """Calculate weighted similarity between two chunk metadata objects"""

        # Get embeddings for each field
        title1_emb = self.get_embedding(metadata1.title)
        title2_emb = self.get_embedding(metadata2.title)

        category1_emb = self.get_embedding(metadata1.category)
        category2_emb = self.get_embedding(metadata2.category)

        # Calculate cosine similarities
        def safe_cosine_similarity(emb1, emb2):
            if np.linalg.norm(emb1) == 0 or np.linalg.norm(emb2) == 0:
                return 0.0
            return cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]

        title_sim = safe_cosine_similarity(title1_emb, title2_emb)
        category_sim = safe_cosine_similarity(category1_emb, category2_emb)

        # Weighted similarity
        weighted_sim = self.config.title_weight * title_sim + self.config.category_weight * category_sim

        return weighted_sim

    def group_chunks_into_sections(self, chunks: List[str]) -> List[DocumentSection]:
        """Group chunks into coherent sections based on metadata similarity"""

        if not chunks:
            return []

        log.info(f"Generating metadata for {len(chunks)} chunks...")

        # Stage 1: Generate metadata for all chunks
        chunk_metadata = []
        for i, chunk in enumerate(chunks):
            # We pass in the last 5 metadata items as context for the next chunk
            metadata = self.generate_chunk_metadata(
                chunk, chunk_metadata[:-5] if len(chunk_metadata) >= 5 else chunk_metadata
            )
            chunk_metadata.append((chunk, metadata))
            log.debug(f"Chunk {i + 1}: {metadata.title} | {metadata.category}")

        log.info("Grouping chunks into sections...")

        # Stage 2: Group chunks based on similarity
        sections = []
        current_section_chunks = [(chunk_metadata[0][0], chunk_metadata[0][1])]
        current_section_category = chunk_metadata[0][1].category

        for i in range(1, len(chunk_metadata)):
            chunk, metadata = chunk_metadata[i]
            prev_metadata = chunk_metadata[i - 1][1]

            # Calculate similarity with previous chunk
            similarity = self.calculate_similarity(metadata, prev_metadata)

            # Check if we should start a new section
            should_split = (
                # Similarity is below threshold so the current chunk is likely different
                # from the previous section
                similarity < self.config.section_merge_threshold
                # Or if the current section is too long
                or sum(len(c[0]) for c in current_section_chunks) + len(chunk) > self.config.max_section_size
            )

            if should_split and len(current_section_chunks) > 0:
                # Finalize current section
                section = self._create_section(current_section_chunks, current_section_category)
                sections.append(section)

                # Start new section
                current_section_chunks = [(chunk, metadata)]
                current_section_category = metadata.category

                log.debug(f"New section started at chunk {i + 1}, similarity: {similarity:.3f}")
            else:
                # Add to current section
                current_section_chunks.append((chunk, metadata))

        # Add final section
        if current_section_chunks:
            section = self._create_section(current_section_chunks, current_section_category)
            sections.append(section)

        log.info(f"Created {len(sections)} sections from {len(chunks)} chunks")
        return sections

    def _create_section(self, section_chunks: List[tuple], category: str) -> DocumentSection:
        """Create a DocumentSection from a list of (chunk, metadata) tuples"""

        # Group chunks into subsections based on title similarity
        subsections = []
        current_subsection = []
        current_title = section_chunks[0][1].title

        for chunk, metadata in section_chunks:
            if not current_subsection:
                current_subsection = [(chunk, metadata)]
                current_title = metadata.title
            else:
                # Check if we should group with current subsection
                title_sim = self.calculate_similarity(metadata, current_subsection[-1][1])

                if title_sim > 0.8:  # High threshold for subsection grouping
                    current_subsection.append((chunk, metadata))
                else:
                    # Create subsection from current group
                    subsection = self._create_subsection(current_subsection, current_title)
                    subsections.append(subsection)

                    # Start new subsection
                    current_subsection = [(chunk, metadata)]
                    current_title = metadata.title

        # Add final subsection
        if current_subsection:
            subsection = self._create_subsection(current_subsection, current_title)
            subsections.append(subsection)

        # Create section title from most common category and representative title
        section_title = self._generate_section_title(section_chunks)

        return DocumentSection(
            section_title=section_title,
            category=category,
            subsections=subsections,
            total_chunks=len(section_chunks),
            total_length=sum(len(chunk) for chunk, _ in section_chunks),
        )

    def _create_subsection(self, subsection_chunks: List[tuple], title: str) -> Subsection:
        """Create a Subsection from a list of (chunk, metadata) tuples"""
        chunks = [chunk for chunk, _ in subsection_chunks]
        combined_content = "\n\n".join(chunks)

        return Subsection(
            subsection_title=title,
            chunks=chunks,
            combined_content=combined_content,
            chunk_count=len(chunks),
            total_length=len(combined_content),
        )

    def _generate_section_title(self, section_chunks: List[tuple]) -> str:
        """Generate a representative title for the section"""
        # Use the title from the first chunk, or create from category
        if section_chunks:
            first_metadata = section_chunks[0][1]
            if len(section_chunks) == 1:
                return first_metadata.title
            else:
                # For multi-chunk sections, use category-based title
                return f"{first_metadata.category} Overview"
        return "Untitled Section"


def pdf_to_markdown(pdf_file: Path) -> str:
    from markitdown import MarkItDown

    md = MarkItDown()
    result = md.convert(pdf_file)
    text_content = result.text_content
    # Keep only ascii characters
    text_content = "".join(c for c in text_content if ord(c) < 128)
    # Remove leading and trailing whitespace
    text_content = text_content.strip()
    # Remove multiple newlines
    text_content = "\n".join(line.strip() for line in text_content.splitlines() if line.strip())
    return text_content


def get_pdf():
    import requests

    # Sample text chunks
    url = "https://www.berkshirehathaway.com/letters/2024ltr.pdf"
    response = requests.get(url)
    if response.status_code != 200:
        raise ValueError(f"Failed to download PDF: {response.status_code}")
    pdf_file = Path("./example_cache/2024ltr.pdf")
    pdf_file.parent.mkdir(parents=True, exist_ok=True)
    pdf_file.write_bytes(response.content)
    # Convert PDF to markdown text
    text = pdf_to_markdown(pdf_file)
    return text


# Usage example
def main():
    # Example usage
    sample_text = get_pdf()

    # Initialize chunker
    chunking_config = ChunkingConfig(
        max_chunk_size=750,
        min_chunk_size=150,
        similarity_threshold=0.25,
        ollama_model="nomic-embed-text:v1.5",
        overlap_sentences=2,
    )

    # Initialize section grouper
    sectioning_config = SectioningConfig(
        ollama_model="gemma3:1b", section_merge_threshold=0.75, max_section_size=3000, min_section_size=300
    )

    # Stage 1: Semantic chunking
    chunker = SemanticChunker(chunking_config)
    sentences = chunker.split_into_sentences(sample_text)
    chunks = chunker.chunk_sentences(sentences)

    print(f"Stage 1 Complete - Generated {len(chunks)} semantic chunks")

    # Stage 2: Section grouping
    grouper = SectionGrouper(sectioning_config)
    sections = grouper.group_chunks_into_sections(chunks)

    # Write sections to JSON file
    output_file = Path("./example_cache/sections.json")
    output_file.parent.mkdir(parents=True, exist_ok=True)
    with output_file.open("w", encoding="utf-8") as f:
        json.dump([section.model_dump() for section in sections], f, indent=2, ensure_ascii=False)
    log.info(f"Sections written to {output_file}")

    print(f"Stage 2 Complete - Generated {len(sections)} sections")

    # Display results
    print("\n" + "=" * 80)
    print("DOCUMENT STRUCTURE")
    print("=" * 80)

    for i, section in enumerate(sections, 1):
        print(f"\nSECTION {i}: {section.section_title}")
        print(f"Category: {section.category}")
        print(f"Total chunks: {section.total_chunks} | Length: {section.total_length} chars")
        print("-" * 60)

        for j, subsection in enumerate(section.subsections, 1):
            print(f"  {i}.{j} {subsection.subsection_title}")
            print(f"       Chunks: {subsection.chunk_count} | Length: {subsection.total_length} chars")

            # Show first chunk preview
            if subsection.chunks:
                preview = (
                    subsection.chunks[0][:200] + "..." if len(subsection.chunks[0]) > 200 else subsection.chunks[0]
                )
                print(f"       Preview: {preview}")
            print()

#LLM