Exploring Long-Context Summarization with Large Language Models in Chinese Novels (2025)

LingxiaoWei1111 HeYan2222 XiangjuLu2222 JunminZhu2222
 JunWang1111
 WeiZhang1111
1111
East China Normal University 2222iQIYI Inc
51265901053@stu.ecnu.edu.cnzhangwei.thu2011@gmail.com
Corresponding author.

Abstract

Large Language Models (LLMs) have been well-researched in many long-context tasks. However, due to high annotation costs, high-quality long-context summary datasets for training or evaluation are scarce, limiting further research. In this work, we introduce CNNSum, a new multi-scale Chinese long-context novel summarization benchmark, including four subsets, length covering 16k~128k, 695 samples in total, the annotations are human-driven.We evaluate commercial and open-source models on CNNSum and conduct a detailed analysis. Based on the observations, we further conduct fine-tuning exploration with short-context summary data. In our study:(1) GPT-4o underperformed, due to excessive subjective commentary.(2) Currently, long-context summarization mainly relies on memory ability, small LLMs with stable longer context lengths are the most cost-effective. Using long data concatenated from short-context summaries makes a significant improvement.(3) Prompt templates may cause a large performance gap but can be mitigated through fine-tuning.(4) Fine-tuned Chat or Instruction versions may harm the Base model and further fine-tuning cannot bridge performance gap.(5) while models with RoPE base scaling exhibit strong extrapolation potential, their performance may vary significantly when combined with other interpolation methods and need careful selection. (6) CNNSum provides more reliable and insightful evaluation results than other benchmarks.We release CNNSum to advance research in this field111https://github.com/CxsGhost/CNNSum.

{CJK}

UTF8gbsn

CNNSum: Exploring Long-Context Summarization with
Large Language Models in Chinese Novels


LingxiaoWei1111 HeYan2222 XiangjuLu2222 JunminZhu2222 JunWang1111 WeiZhang1111thanks: Corresponding author.1111East China Normal University 2222iQIYI Inc51265901053@stu.ecnu.edu.cnzhangwei.thu2011@gmail.com


1 Introduction

Large Language Models (LLMs) ’s context length has been significantly improvedPawar etal. (2024); Wang etal. (2024b). Long-text summarization is a crucial task for assessing the long-context ability of LLMs and forms the basis for understanding complex contextual factsPagnoni etal. (2021); Bhandari etal. (2020); Kim etal. (2024). Unlike QA or reasoning tasks, summarization is more subjective, without an objective gold standard. Obtaining high-quality summary data is challenging and annotation is costly. Especially for long texts, which require a global understanding and memory of the context, even human experts struggle to annotate directlyQiu etal. (2024). Therefore, long-text summarization datasets for training or evaluation are scarce, particularly in Chinese. The latest open-source book-length dataset remains BookSumKryściński etal. (2021).Recent studies on book-length summarizationChang etal. (2023); Kim etal. (2024); Song etal. (2024) focus on evaluating sentence coherence or faithfulness, most are expensive in human annotation. However, if outputs are disorganized, meaningless, or fail to follow instructions by generating context-related content but not a summary, these evaluations hold limited significance. The core is how LLMs handle long-context summarization tasks. Due to a lack of datasets, there is still insufficient systematic research and guidance.

Recently, many benchmarks for long-context LLMs evaluation have emergedAn etal. (2023); Bai etal. (2023b); Ni etal. (2024); Qiu etal. (2024); Zhang etal. (2024b). However, their summarization tasks seem to be not valued and carefully constructed. Regardless of language and domain, we summarize their shortcomings: (1) Built based on previous short summarization datasetsFabbri etal. (2019); Huang etal. (2021); Sharma etal. (2019); Wu etal. (2023); Kryściński etal. (2021), leading to high data leakage riskXu etal. (2024); Li etal. (2024). (2) The sample size is very small, may only dozens. (3) The average and maximum lengths are short (e.g., shorter than 16k). (4) Absence of multi-scale subsets based on length, limited evaluating LLMs in different context lengths. (5) Annotations are collected from web or synthesized by LLMs. Web-collected annotations carry a high data leakage risk and low quality, while LLM-generated book-length summaries tend to have issues like coherence errorsChang etal. (2023), factual errorsSong etal. (2024); Kim etal. (2024). These benchmarks each have at least two of the above.

To better explore and evaluate long-context summarization in LLMs and address the scarcity of Chinese datasets, we introduce Chinese Novel Summarization (CNNSum), a newly constructed, multi-scale benchmark that improved all the shortcomings above. Based on a newly collected Chinese novel corpus and preset target token sequence lengths, we built four subsets: L (16k), XL (32k), 2XL (64k), and 3XL (128k), 695 samples in total. The annotations were completed by human experts with assistance from LLMs. Details see Section3.Sclar etal. (2024)demonstrate that prompt templates significantly impact LLMs’ performance,Han etal. (2024); Liu etal. (2024c) shows beginning and ending tokens are crucial in long contexts. For summarization, the key is the relative position of context and "summary instruction". We defined prompt types: Instruction at the Beginning (Prompt-IB) and Instruction at the End (Prompt-IE). Using prompts inE.1, we benchmark many commercial and open-source LLMs on CNNSum. According to the results, GPT-4oOpenAI (2024b) unexpectedly underperformed compared to other commercial models, we analyzed specific cases to identify the reasons. The two prompts can result in a significant quality gap in LLM outputs. and the Chat or Instruction version may undermine Base model’s long-context summarization ability. The memory capacity with longer context length is critical for performance, as large LLMs may struggle to fully leverage their logical reasoning and comprehension abilities, suggesting that small LLMs are cost-effective.

Based on observations from CNNsum benchmark, we further explored fine-tuning LLMs specific for summarization. However, training directly on long-context summary data is costly and challenging due to data scarcity.Xiong etal. (2023); Fu etal. (2024) demonstrate that training on short data can obtain longer-context ability, requiring a few long data to activate. Therefore, we redesigned promptsE.2 and concatenated summary data into longer ones for training. Our findings indicate that:(1) The performance gap caused by prompt templates can be mitigated, and Base models are better for further fine-tuning and extrapolation.(2) Fine-tuning with short-context summary data can significantly improve long-context summarization performance.(3) Current open-source models widely use Adjusted Base Frequency (ABF)Xiong etal. (2023), improving the extrapolation potential of original RoPE, and maintaining decent performance on several times the training length. However, when combined with other interpolation methods, their extrapolation characteristics change, showing interesting differences depending on RoPE base.(4) The only similar work to CNNSum is CLongEval-LStSumQiu etal. (2024), constructed by translating BookSumKryściński etal. (2021) and merging short annotations with GPT-4. We also used it in fine-tuning experiments. It also has multi-scale subsets with a wide range of sample lengths and are linearly uniformly distributed. While this approach covers a wide length range easily, we find it may lead to misleading results, especially in extrapolation performance evaluation. In contrast, CNNSum employs a more rigorous and reasonable sampling strategy for multi-scale subsets, providing more reliable and insightful evaluation results.

In summary, we present CNNSum, a multi-scale Chinese long-context novel summarization benchmark, which significantly improves existing datasets’ shortcomings in design and construction. We evaluate and analyze factors affecting long-context summarization tasks for LLMs. We further explore fine-tuning for long-context summarization LLMs and demonstrating the advantages of CNNSum in evaluation. We hope to advance research in this field and provide valuable insights.

2 Related Work

2.1 Long-Context Extension for LLMs

RoPESu etal. (2024) is commonly used in LLMs, but its extrapolation performance beyond the training length is limited. Position InterpolationChen etal. (2023b) achieves efficient extension within the pretraining range but still needs fine-tuning on the target length. NTK-Awareu/LocalLLaMA (2023) and DynamicNTKr/LocalLLaMA (2024) mitigate the loss of high-frequency information caused by interpolation, enabling extrapolation directly but causing an overall perplexity increase. YaRNPeng etal. (2023) combines these methods and achieves 2x extrapolation. ResonanceRoPEWang etal. (2024a) can integrate with others further enhancing performance. CLEXChen etal. (2023a) is a more efficient dynamic scaling method, LongRoPEDing etal. (2024) leverages RoPE’s non-uniformity to search for better interpolations, both achieving 4x or 8x extrapolation, but their iteration process is complex. Above methods are mostly based on LlamaTouvron etal. (2023a) or Llama2Touvron etal. (2023b) experiments, with good performance. The Adjusted Base Frequency (ABF)Xiong etal. (2023) has been widely used in advanced LLMsYang etal. (2024); Team (2024); Young etal. (2024); Cai etal. (2024), significantly improving extrapolation potential.Liu etal. (2023) further proposes scaling laws for RoPE base. There has been no systematic research on the performance of applying other interpolation methods on this.

Efficient attention is another direction of focus. Such as StreamingLLMXiao etal. (2023) supports infinite inputs but may be less suitable for summarization tasks due to lack of long-term memory. methods like LM-InfiniteHan etal. (2024); Xiao etal. (2024a); Jiang etal. (2024b) for infinite-length have same limitation of being unable to access full context. LongLoRAChen etal. (2023c) andZhang etal. (2024a); Han etal. (2023); Xiao etal. (2024b) can be used in fine-tuning phase. However, the impact on the model’s extrapolation ability needs further investigation.Liu etal. (2024b); Liu and Abbeel (2023); Ding etal. (2023); Ao etal. (2024) designed for large-scale hardware training.

2.2 Long-Context Summarization Evaluation

ROUGELin (2004)remains the most popular for summarization tasks in benchmarks due to Implementation being simple and low cost. It measures the information overlap between outputs and the reference summaries.Chang etal. (2023)proposes a protocol for evaluating coherence of book-length summarization generated by LLMs and implements an LLM-based automatic evaluation method. This work does not focus on faithfulness, thus it does not rely on gold-standard summary but advanced LLMs like GPT-4oOpenAI (2024b).Kim etal. (2024) are the first to conduct a large-scale evaluation of faithfulness and content selection, demonstrating that the faithfulness of LLMs summarization still needs improvement. However, this study mainly focuses on commercial models, while many open-source models still struggle with generating normal outputs for long-context summarization. Both studies have highly costly human annotations.Krishna etal. (2023)proposes guidelines for human evaluation of faithfulness of long-form Summarization but does not extend this to automatic evaluation.Song etal. (2024)introduces a fine-grained summarization evaluator through prompts construction, capable of assessing faithfulness, completeness and so on. It also relies on advanced LLMs for reliable evaluation. All these studies are based on English.

2.3 Long-Context LLMs for Chinese

Although the Chinese language domain possesses one of the largest and richest corpora in the world, earlier open-source LLMs often provided poor support for Chinese. A specific issue is that their tokenizers exhibit low efficiency when converting Chinese text into tokens, meaning more tokens are required to represent Chinese text. Since the context length in LLMs is defined at the token level, this leads to a limitation in the actual length of Chinese text that can be processed. Moreover, as the Attention mechanism has a computational complexity of n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, more token means significantly increased training and inference costs for LLMs when processing Chinese text.

We performed a rough calculation of the Chinese encoding efficiency of several open-source LLMs based on the Chinese book corpus we collected, which is show in Table1. Such as LlamaTouvron etal. (2023a) and Llama2Touvron etal. (2023b) have extremely low Chinese encoding efficiency, with the effective number of Chinese characters they can process being limited to around 2k. The derived MistralJiang etal. (2023, 2024a) and LWM-TextLiu etal. (2024a) support context lengths of up to 128k and 1M, respectively, but still exhibit low Chinese encoding efficiency. Recently released Llama3.1Dubey etal. (2024) and MinistralAI (2023) support a 128k context length, benefiting from a substantial increase in vocabulary size to over one hundred thousand, resulting in significantly improved Chinese encoding efficiency, though still not ideal. Bilingual LLMs YiYoung etal. (2024) and ChatGLM3Du etal. (2021); Zeng etal. (2022), as well as InternLMCai etal. (2024), all developed by Chinese teams, achieve higher Chinese encoding efficiency with vocabularies of fewer than one hundred thousand, supporting context lengths ranging from 128k to 1M. Baichuan2Yang etal. (2023) has the highest Chinese encoding efficiency, with a vocabulary size exceeding one hundred thousand, but it mainly supports Chinese and English. QwenBai etal. (2023a); Yang etal. (2024); Team (2024) and GLM4GLM etal. (2024) support multiple languages and 128k context length, achieving relatively high Chinese encoding efficiency with vocabularies under two hundred thousand. The vocabulary size of Command RCohere For AI (2024) and Gemma 2Team etal. (2024b) exceeds two hundred thousand, leading to a higher proportion of parameters in the embedding layer. This may significantly increase training costs for methods requiring fine-tuning of the embedding layerTao etal. (2024), such as LongLoRAChen etal. (2023c).

Model SeriesVocab SizeZH Chars / Token
Llama / Llama232,0000.65
Mistral32,7680.82
Llama3 / Llama3.1128,2561.12
Ministral-8B-Instruct-2410*131,0721.02
Yi / Yi-1.564,0001.36
ChatGLM2 / ChatGLM365,0241.43
InternLM2 / InternLM2.592,5441.42
Baichuan2125,6961.49
Qwen1.5 / Qwen2 / Qwen2.5152,0641.40
GLM4151,5521.45
Command R256,0001.23
Gemma 2256,0001.37
GPT-4o-2024-08-06*?1.15

3 CNNSum Construction

3.1 Chinese Book Corpus Collection

We collected approximately 100 Chinese books from open-source data on the Chinese internet, each featuring a clear chapter structure. Initially, we excluded books composed of multiple short stories that lacked a fixed or explicit main storyline, as their context did not exhibit long-distance dependencies or causal relationships, making it difficult to create a cohesive summary. We then randomly selected multiple chapters from the remaining books and utilized the Qwen2-7B-Instruct to generate summaries for these chapters. If the model identified the book it belonged to or included additional, unrelated book information, we eliminated these "popular books" to ensure the fairness and validity of our benchmark construction. This was necessary because these books are likely to have been extensively leaked in the training data of other advanced LLMs. Ultimately, we filtered out about 20 books.

Our collection includes many web-serialized novels, often exhibiting the author’s unique writing habits and formatting styles. Common cases include non-standard punctuation usage, randomly interspersed interactive content with the reader, or additional annotations. These distinctive formats and contents may increase the understanding difficulty for LLMs or disrupt the coherence of the context as illustrated in the case presented in Appendix D.1. We formatted these web-serialized novels via designing regular expressions, supplemented by manual checks, to obtain cleaner corpus.

Model SeriesCNNSum-L (Count=190)CNNSum-XL (Count=195)CNNSum-2XL (Count=200)CNNSum-3XL (Count=110)
MinQ1MeanMaxMinQ1MeanMaxMinQ1MeanMaxMinQ1MeanMax
Yi / Yi-1.513,44714,84115,72118,32327,58630,93231,84434,70356,32363,68364,30367,403115,351128,673128,751132,688
Llama3 / Llama3.115,12818,20319,09723,40831,79537,34438,55742,65967,95276,32677,84585,690140,920152,603155,339166,504
Qwen1.5 / Qwen2 / Qwen2.512,93814,52915,32418,50127,13030,18931,02634,10454,76061,71862,52666,844110,373123,792124,995130,707
GPT-4o-2024-08-0615,56417,97418,81422,79232,42836,67637,98442,17768,64174,99676,59885,231138,417150,226152,875164,939

3.2 Multi-Scale Sampling

We use YiYoung etal. (2024) tokenizer to process all chapters in the book corpus and set a token-level target length T𝑇Titalic_T for each subset. We allow the data length to fluctuate within a specified range around T𝑇Titalic_T. Each target length T𝑇Titalic_T has a distinct lower boundary to avoid scoring bias from overly short data, while the upper boundary is uniformly set as T+2k𝑇2𝑘T+2kitalic_T + 2 italic_k, with k=1024𝑘1024k=1024italic_k = 1024, allowing limited extrapolation without causing score degradation. The specific settings for each of the four subsets are as follows:

TL=16k,RangeL=[16k4k,16k+2k]TXL=32k,RangeXL=[32k6k,32k+2k]T2XL=64k,Range2XL=[64k10k,64k+2k]T3XL=128k,Range3XL=[128k16k,128k+2k]subscript𝑇𝐿absent16𝑘𝑅𝑎𝑛𝑔subscript𝑒𝐿absent16𝑘4𝑘16𝑘2𝑘subscript𝑇𝑋𝐿absent32𝑘𝑅𝑎𝑛𝑔subscript𝑒𝑋𝐿absent32𝑘6𝑘32𝑘2𝑘subscript𝑇2𝑋𝐿absent64𝑘𝑅𝑎𝑛𝑔subscript𝑒2𝑋𝐿absent64𝑘10𝑘64𝑘2𝑘subscript𝑇3𝑋𝐿absent128𝑘𝑅𝑎𝑛𝑔subscript𝑒3𝑋𝐿absent128𝑘16𝑘128𝑘2𝑘\displaystyle\begin{aligned} T_{L}&=16k,&Range_{L}&=[16k-4k,16k+2k]\\T_{XL}&=32k,&Range_{XL}&=[32k-6k,32k+2k]\\T_{2XL}&=64k,&Range_{2XL}&=[64k-10k,64k+2k]\\T_{3XL}&=128k,&Range_{3XL}&=[128k-16k,128k+2k]\end{aligned}start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_CELL start_CELL = 16 italic_k , end_CELL start_CELL italic_R italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_CELL start_CELL = [ 16 italic_k - 4 italic_k , 16 italic_k + 2 italic_k ] end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = 32 italic_k , end_CELL start_CELL italic_R italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = [ 32 italic_k - 6 italic_k , 32 italic_k + 2 italic_k ] end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 2 italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = 64 italic_k , end_CELL start_CELL italic_R italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT 2 italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = [ 64 italic_k - 10 italic_k , 64 italic_k + 2 italic_k ] end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 3 italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = 128 italic_k , end_CELL start_CELL italic_R italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT 3 italic_X italic_L end_POSTSUBSCRIPT end_CELL start_CELL = [ 128 italic_k - 16 italic_k , 128 italic_k + 2 italic_k ] end_CELL end_ROW

Each book in the corpus has a chapter structure. A book B𝐵Bitalic_B composed of i𝑖iitalic_i chapters c𝑐citalic_c can be represented as B={c0,c1,,ci}𝐵subscript𝑐0subscript𝑐1subscript𝑐𝑖B=\{c_{0},c_{1},\ldots,c_{i}\}italic_B = { italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. For a target interval length R𝑅Ritalic_R and a book B𝐵Bitalic_B, the sampling method is as follows:(1) initialize a variable sliding window w𝑤witalic_w starting from chapter c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, expanding chapter by chapter.(2) when w=[cj,cj+1,,ck1,ck]𝑤subscript𝑐𝑗subscript𝑐𝑗1subscript𝑐𝑘1subscript𝑐𝑘w=[c_{j},c_{j+1},\ldots,c_{k-1},c_{k}]italic_w = [ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] (j<k<i)𝑗𝑘𝑖(j<k<i)( italic_j < italic_k < italic_i ):

  • If the length of w<infR𝑤𝑖𝑛𝑓𝑅w<infRitalic_w < italic_i italic_n italic_f italic_R, continue expanding with chapter ck+1subscript𝑐𝑘1c_{k+1}italic_c start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT.

  • If the length of wR𝑤𝑅w\in Ritalic_w ∈ italic_R, it is considered a sample and is recorded. Begin a new sampling from chapter ck+1subscript𝑐𝑘1c_{k+1}italic_c start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT.

  • If the length of w>supR𝑤𝑠𝑢𝑝𝑅w>supRitalic_w > italic_s italic_u italic_p italic_R, attempt to remove chapters sequentially from cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT onward until w𝑤witalic_w falls within or below R𝑅Ritalic_R.

(3) follow the above procedure until there are no remaining chapters in B𝐵Bitalic_B, then proceed to the next book.This sampling method is applied for each interval length R𝑅Ritalic_R across our book corpus, generating a candidate set for each range.

We perform a final screening based on the candidate set. Although the volume of the candidate set significantly exceeds our intended volume, variations in total length and chapter length across books result in considerable differences in the sampled data volume for each book. For example, books with a longer total length and shorter chapter lengths tend to sample more data. Conversely, books with a shorter total length may sample no data in the 3XL candidate set or may sample very limited data. For each candidate set, we prioritize selecting data from books with fewer samples. Regarding length, although we set ranges, we still prefer data closer to the target length.We evaluated the length of CNNSum on advanced open-source series models, with detailed statistics shown in Table2. For the Yi series model, each subset contains shorter data to maintain moderate diversity in length, but we strictly control the first quartile closely aligns with the preset target length T𝑇Titalic_T, ensuring the average length is approximately T𝑇Titalic_T.The final data volume was determined based on two considerations: (1) Annotation cost of long context summaries. (2) The differences in sampled data volume across books. Increasing the total data volume could lead to an imbalanced representation of data from different books, potentially introducing new biases related to book content.

3.3 Summary Annotation

Chang etal. (2023)summarize the methods for generating long-text summaries using LLMs, named Hierarchical mergingWu etal. (2021) and Incremental updating while claiming their limitations. These methods trust LLMs’ ability of short-context summarization (e.g., 4k or 8k). Based on these methods, our approach involves more human guidance. For one sampled data, we first generate a "plot synopsis" for each chapter via commercial LLMs with more of the original plot and character information. The prompt in AppendixE.3. Annotators then read each synopsis select and retain key plots based on their judgment, and merge them into a final summary. If find conflicting content, annotators will ask LLMs to locate related plots in the original chapter and correct them manually. We require annotators to not only delete LLMs’ output and merge the rest but also rewrite in their own words. We suggest annotators avoid adding subjective commentary, focusing instead on objective content. Although this may sacrifice some coherence, it increases the density of effective information. For L and XL, we limit the maximum to 500 words and 600 for 2XL and 3XL. On one hand, adding more words for coherence is not cost-effective; on the other hand, as observed in Section4, most models do not generate overly lengthy summaries. We use crowdsourcing for annotation to improve efficiency and reduce individual bias.

4 Experiments

4.1 Baselines

Commercial Models

(1) OpenAI’s flagship model GPT-4o, the GPT-4o-2024-08-06OpenAI (2024b) and the lite version GPT-4o-mini-2024-07-18OpenAI (2024a), both have 128k context length. (2) The Moonshot-v1-128k222https://api.moonshot.cn. (3) The Qwen-plus-2024-09-19333https://www.aliyun.com/product/tongyi with a 128k context length. (4) The Doubao-pro-128k444https://www.volcengine.com/product/doubao. (5) DeepMind’s Gemini-1.5-proTeam etal. (2024a), with maximum context length up to 2 million.

Open-source Models

(1) YiYoung etal. (2024) extends context via ABFXiong etal. (2023). Yi-6B and Yi-6B-Chat default to 4k context, consistent with the training phase, theoretically support extrapolation. Yi-6B-200K and Yi-34B-200K using larger RoPE Base to extend context to 200k. Yi-1.5-34B-32K and Yi-1.5-34B-Chat-16K are the largest long-context models in the Yi-1.5 series. (2) InternLM2.5Cai etal. (2024) also extends the context by adjusting the RoPE BaseLiu etal. (2023) and defaults to using DynamicNTKr/LocalLLaMA (2024). For context length, InternLM2.5-20B defaults to 256k, InternLM2.5-20B-Chat to 32k, and the special version InternLM2.5-7B-Chat-1M supports 1 million. (3) ChatGLM3-6B-128K is 128k version of ChatGLM3-6BDu etal. (2021). GLM4-9B-Chat-1M is special version of GLM4GLM etal. (2024) using LongAlignBai etal. (2024) and ABF. (4) Llama3.1 series is 128k version of Llama3Dubey etal. (2024) modified RoPE using a YaRN-likePeng etal. (2023) method. (5) LWM-Text-1M is the language-only version of Large World ModelLiu etal. (2024a), training with RingAttentionLiu etal. (2024b). (6) Ministral-8B-Instruct-2410AI (2023) is a edge model with 128k context. (7) Qwen1.5, Qwen2Yang etal. (2024), and Qwen2.5Team (2024) all extend context using ABF.

ModelCtx. Len.CNNSumMSE (P-IE vs P-IB)
L (16k16𝑘16k16 italic_k)XL (32k32𝑘32k32 italic_k)2XL (64k64𝑘64k64 italic_k)3XL (128k128𝑘128k128 italic_k)
Commercial Models
GPT-4o-2024-08-06128k128𝑘128k128 italic_k15.514.212.5-0.0
GPT-4o-mini-2024-07-18128k128𝑘128k128 italic_k15.713.712.8-0.3
Gemini-1.5-pro2m2𝑚2m2 italic_m19.318.116.814.60.1
Qwen-plus-2024-09-19128k128𝑘128k128 italic_k20.518.516.414.80.1
Doubao-pro-128k128k128𝑘128k128 italic_k19.617.815.513.20.6
Moonshot-v1-128k128k128𝑘128k128 italic_k22.420.318.015.20.9
Open-source Models
Yi-6B4k4𝑘4k4 italic_k8.84.42.72.31.1
Yi-6B-Chat4k4𝑘4k4 italic_k13.56.92.12.10.4
Yi-6B-200K200k200𝑘200k200 italic_k9.99.48.84.04.5
Yi-34B-200K200k200𝑘200k200 italic_k12.111.310.810.00.1
Yi-1.5-34B-32K32k32𝑘32k32 italic_k11.610.59.60.114.5
Yi-1.5-34B-Chat-16K16k16𝑘16k16 italic_k13.812.310.907.8
InternLM2.5-7B-Chat-1M1m1𝑚1m1 italic_m18.017.114.713.00.9
InternLM2.5-20B256k256𝑘256k256 italic_k18.216.314.412.70.1
InternLM2.5-20B-Chat32k32𝑘32k32 italic_k18.917.316.214.20.1
ChatGLM3-6B-128K128k128𝑘128k128 italic_k17.216.114.813.70.1
GLM4-9B-Chat-1M1m1𝑚1m1 italic_m19.017.916.515.40.2
Llama3.1-8B128k128𝑘128k128 italic_k7.88.08.43.17.2
Llama3.1-8B-Instruct128k128𝑘128k128 italic_k15.614.312.89.91.4
Llama3.1-70B128k128𝑘128k128 italic_k12.28.96.97.55.0
Llama3.1-70B-Instruct128k128𝑘128k128 italic_k17.915.713.810.60.5
LWM-Text-1M1m1𝑚1m1 italic_m3.33.02.51.10.2
Ministral-8B-Instruct-2410128k128𝑘128k128 italic_k16.214.011.33.50.1
Qwen1.5-7B32k32𝑘32k32 italic_k12.110.95.72.634.4
Qwen1.5-7B-Chat32k32𝑘32k32 italic_k15.113.610.32.70.3
Qwen1.5-32B-Chat32k32𝑘32k32 italic_k15.714.25.04.20.3
Qwen2-7B128k128𝑘128k128 italic_k9.88.48.06.91.5
Qwen2-7B-Instruct32k32𝑘32k32 italic_k15.213.312.311.31.5
Qwen2-72B128k128𝑘128k128 italic_k14.813.211.710.416.3
Qwen2-72B-Instruct32k32𝑘32k32 italic_k15.513.211.59.219.0
Qwen2.5-7B-Instruct32k32𝑘32k32 italic_k15.714.012.79.63.4
Qwen2.5-72B-Instruct32k32𝑘32k32 italic_k19.617.613.613.41.3

4.2 Experimental Setup

Based on the statistics of annotation summary lengths, we set maximum generation to 400 tokens for L and XL and 500 tokens for 2~3XL. For automatic evaluation, we first process the Chinese outputs using jieba555https://github.com/fxsjy/jieba for word segmentation. Then, we calculate the ROUGE-LLin (2004) to measure the information overlap between the outputs and the reference summaries. All experiments were conducted on NVIDIA A100 (80GB) GPUs.

Baseline Evaluation

The prompt templates we used are shown in AppendixE.1. For commercial model APIs, we set temperature=0𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒0temperature=0italic_t italic_e italic_m italic_p italic_e italic_r italic_a italic_t italic_u italic_r italic_e = 0 to reduce randomness in generation. GPT-4o is limited by Chinese encoding efficiency666https://github.com/openai/tiktoken and can only complete up to 2XL. Gemini and Qwen have strict content safety checks, which block 29% and 2% of samples, respectively. Doubao excluded 8% of samples on 3XL due to excessive length. These cases all caused score bias. For open-source models, we apply greedy sampling for deterministic results. We use vLLMKwon etal. (2023) for inference and modify the source code related to positional embedding for correct extrapolation. For models that default to using RoPE scaling methods, we keep their original settings. Qwen series Instruct models can optionally enable YaRN for long context. For consistency and extrapolation evaluation, we keep default RoPE settings.

Fine-tuning Datasets

Our training dataset has about 9,000 short novel excerpt summaries, which are non-overlapping with CNNSum’s corpus. Using Yi tokenizer, their lengths range from 2k~4k tokens. The prompt templates are shown in AppendixE.2. We randomly concatenated the samples, similar to Section3.2, setting the lengths to 14k~18k and 30k~34k, with average lengths of about 16k and 32k. The main motivation is that longer sequences enable LLMs to adapt to more positions and activate extrapolation ability. Besides, a recent study proposed a similar data augmentation strategyTian etal. (2024). It proved this strategy trains LLMs to focus more on relevant contexts, enhancing long-context ability.

Fine-tuning Experiments

We follow findings fromChen etal. (2023c), use a LoRA (rank=8)Hu etal. (2021), and unfreeze embedding and normalization layers. We used Flash Attention 2Dao (2024), DeepSpeed ZeRO2 / ZeRO3 with CPU offloadingRen etal. (2021), and AdamLoshchilov (2017) with a learning rate 2e-5, including a linear warmup of 30 steps. Inference also used vLLM. For fine-tuning concatenated data with an average length of 16k, we set global batch size to 16. Due to the varying convergence rates of RoPE-based scaling methods with different scale s𝑠sitalic_sPeng etal. (2023), we evaluated multiple checkpoints between 400 and 500 steps and selected the best result. For fine-tuning on concatenated data with an average length of 32k, we set the global batch size to 8 and started from a checkpoint that had been fine-tuned for 300 steps on data with an average length of 16k. We continued fine-tuning for another 200 to 300 steps and selected the best result. Each experiment was repeated at least three times to minimize randomness as much as possible.

4.3 Main Results on CNNSum

The detailed results are shown in Appendix Table7. We calculated the average scores and used the mean squared error (MSE) to measure the performance differences caused by prompts, as shown in Table3. Overall, commercial models significantly outperform most open-source models. The lower MSE indicates minimal performance differences between two prompts, demonstrating more stable long-context summarization ability. Moonshot-v1 performed the best, with the advantage more evident from L to 2XL. However, its performance decreased by 32.1% at 3XL. Gemini-pro, Qwen-plus, and Doubao-pro had similar performance. Doubao-pro experienced the most significant decrease in performance from L to 3XL, at 32.7%, while Gemini-pro showed the smallest decrease, at 24.4%. Unexpectedly, GPT-4o performed the worst, lagging behind other models by 20% to 30% on average from L to 2XL.

Several open-source models demonstrated strong competitiveness. GLM4-9B-Chat-1M performed the best overall, comparable to or better than commercial models across all subsets, with only an 18.9% performance decrease from L to 3XL. InternLM2.5-20B-Chat showed robust extrapolation capability, with a performance decrease of 24.8%. Among the Qwen series, Qwen2.5-72B-Instruct performed best, but performance decreased by 31.6% when extrapolated to 3XL. Other models performed similarly on L and XL, with Qwen2 series decreasing 20% to 40% on 3XL. Qwen1.5 had only a few meaningful outputs on 2XL. Due to lower Chinese encoding efficiency, Llama3.1 and Ministral required extrapolation at 3XL, with Llama3.1 showing an average performance decrease exceeding 40% and Ministral reaching as high as 78.3%.Yi-200K series demonstrated weak generation capability. Yi-34B-200K generally output very short summaries, just one or two sentences, or meaningless repetitions. Yi-6B-200K performed even worse. LWM-Text-1M’s insufficient Chinese capability led to mostly meaningless outputs.

4.4 Analysis and Case Study

Why Did GPT-4o Fail?

One obvious reason is that, according to Table1 and Table2, GPT-4o’s Chinese encoding efficiency is relatively lower, it needs processing more tokens leading to a decrease in performance. Despite this, GPT-4o’s performance on L still trails other commercial models on XL by an average of 16%. We compared the outputs and find that GPT-4o and Qwen-plus generated brief summaries, while Moonshot-v1 and Gemini-pro generated longer ones. Moonshot-v1’s outputs are significantly longer than all other models, with many samples being able to continue generating until reaching the maximum limit without repetition. Longer outputs contain richer information, more likely to overlap with reference summaries, leading to higher ROUGE scores. For GPT-4o, we find that its already shorter outputs often include "commentary" on the novel’s plot, examples are shown in AppendixD.2. These contents are extensions of the plot rather than the plot itself, often not overlapping with the reference summary. This reduces the valuable information in the output and led to lower ROUGE scores. Other models rarely exhibit this and focus on generating content related to the plot, regardless of the output length.

ModelCtx. Len.CNNSumLXL2XL3XL14.412.710.80.1Yi-1.5-34B-32K32k32𝑘32k32 italic_k8.88.28.30.115.414.212.10.0Yi-1.5-34B-Chat-16K16k16𝑘16k16 italic_k12.110.49.70.06.16.67.02.2Llama3.1-8B128k128𝑘128k128 italic_k9.49.49.73.911.87.55.27.0Llama3.1-70B128k128𝑘128k128 italic_k12.510.28.67.916.514.57.02.5Qwen1.5-7B32k32𝑘32k32 italic_k7.77.24.42.612.110.710.39.5Qwen2-72B128k128𝑘128k128 italic_k17.515.713.111.312.410.510.610.6Qwen2-72B-Instruct32k32𝑘32k32 italic_k18.515.812.37.8

Prompt-IB vs Prompt-IE

Several open-source models exhibited large MSE values between the scores of two prompts (Marked in red). Detailed results of these models are shown in Table4. Yi-1.5-34B and Qwen1.5-7B have relatively short context lengths, yet both performed much better with Prompt-IE.We further checked their outputs, due to the "summary instruction" in Prompt-IB being placed before the novel excerpt, they often failed to follow instruction after processing such a long context. Most outputs copied the excerpt or exhibited meaningless repetition. Using Prompt-IE within the default context length can generate many high-quality, clean summaries. see AppendixD.3. This advantage persisted in extrapolation on 2XL but was completely ineffective on 3XL. Yi-1.5-34B-Chat-16K is a further fine-tuned version that outperformed when using Prompt-IB, possibly because the fine-tuning data contained similar summarization tasks. However, fine-tuning somewhat damaged the Base model’s continuation ability when using Prompt-IE, although the ROUGE also improved. We find the summaries generated by Yi-1.5-34B-32K tend to be more concise and clean. The chat version would generate more content, but sometimes more disorganized, prone to repetition, and unable to stop properly. This observation also occurred in models with lower MSE, such as InternLM2.5-20B and its Chat version. Examples are shown in AppendixD.4.

Qwen2-72B have higher overall output quality. Unlike Yi-1.5 and Qwen1.5, when using Prompt-IE, they were more likely to copy content from the excerpt rather than generate a summary. Notably, when using Prompt-IB and extrapolating on 3XL, Qwen2-72B-Instruct had gradually forgotten instruction, randomly outputting other content related to the excerpt, leading to the ROUGE score being overtaken by Prompt-IE. AppendixD.5. Llama3.1 Base models have relatively poor overall output quality. The gap stems from the poor continuation ability with Prompt-IE, with some outputs just repeating the last sentence of the prompt, usually the "summary instruction" AppendixD.6. As the context length increases, such meaningless repetitions occur more frequently. With Prompt-IB, it did not happen at all. However, since the model isn’t instruction-tuned, it generates content related to the excerpt but is usually disorganized. Besides, the repetition is partly caused by greedy sampling strategy.Meister etal. (2023); Xu etal. (2022)

Rise of Smaller Long-Context LLMs

In open-source models, except GLM4-9B-Chat-1M leads in performance, ChatGLM3-6B-128K has the smallest parameter size but outperformed larger LLMs on 2~3XL. For the InternLM2.5, 7B-Chat-1M achieved 94% of 20B-Chat’s average performance, slightly outperforming 20B Base version. Similar observations were in Llama3.1 and Qwen, smaller Instruct or Chat models can match the performance of larger Base models.We summarized the main points: (1) A stable and sufficiently long context length is most important for long-context summarization. (2) Models with instruction fine-tuning may better recognize and remember instruction after reading a long-context novel.

Despite larger LLMs being better at reasoning, they hardly apply it to novel plots in a long context. The main challenge is global memory ability for long-text, to output more key plots. The Results of GPT-4o and GPT-4o-mini also support this view. Even on the L and 2XL, there is no large gap between high-scoring models. Although CNNSum is not a gold standard, it reflects the above point. Training small long-context LLMs remains significantly less costly and challenging, making them the most cost-effective for long-context summarization tasks. This point should apply to more content-linear summarization such as papers, meetings, and reports.

Base ModelCLongEval-LStSumCNNSumSmallMediumLargeLXL2XL3XL17.215.210.018.416.111.43.1Qwen1.5-7B17.414.811.118.115.712.12.916.714.77.417.715.68.52.6Qwen1.5-7B-Chat16.814.28.817.215.09.22.617.214.811.916.815.612.49.0Yi-6B17.015.012.017.115.512.58.315.514.612.916.615.512.56.0Yi-6B-Chat15.614.512.416.115.512.36.0

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (1)

4.5 Ablation Study for Fine-Tuning

We have analyzed factors affecting long-context summarization based on general LLMs. However, how to configure these factors for fine-tuning LLMs specifically for summarization requires further exploration. Using reconstructed promptsE.2, we fine-tuned Yi-6B and Qwen1.5-7B on concatenated datasets with an average length of 16k. Results are shown in Table5.

Extrapolation Potential

Compared to the results in Table3, the performance of Yi-6B on L and XL significantly improved by 2 to 3 times, while Qwen1.5 also improved by approximately 50%. We summarized two main points: (1) During pre-training, the RoPE base was scaled up via ABFXiong etal. (2023); Liu etal. (2023), endowing the model with greater potential for long-context. Fine-tuning on long-text data further activated this ability. (2) The model’s ability to follow instructions for summarization tasks was enhanced, reducing the meaningless outputs. For Yi-6B, with its short default context length, both points contributed to the improvement. In contrast, Qwen1.5, having limited long-context ability, benefited more from point (2). With the largest RoPE base of open-source baselines4.1, Yi-6B shows more extrapolation potential than Qwen1.5 on 2~3XL.

Chat vs Base

Within each model series, the best results for each subset were almost always achieved by base models. Fig1 illustrates the performance on CNNSum, indicating that the base models demonstrated better extrapolation potential and were better suited for fine-tuning summarization LLMs.

Prompt-IB vs Prompt-IE

In Table4, Qwen1.5-7B shows the most significant gaps when using different prompts. After fine-tuning, these differences nearly disappeared. Evaluations on CLongEval-LStSum further confirmed this. Fig1 intuitively shows that a model fine-tuned with a specific prompt has consistent long-text summarization performance, regardless of the prompt template. we also carefully checked and compared the model outputs but did not observe the quality difference mentioned in4.4.

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (2)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (3)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (4)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (5)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (6)

4.6 Reliable Evaluation of Extrapolation

Models’ scores on CLongEval Large (50k~100k) and 2XL (64k) and higher than on 3XL (128k) in Table5. We find the output quality has already declined obviously on the 2XL. Many samples generated some valid content but then degenerated into meaningless repetition and was worse on 3XL. In CLongEval Large, shorter samples have much better output, shown in Fig2, leading to misleading final scores. To further explore long-context summarization extrapolation and the importance of reliable evaluation, we fine-tuned Yi-6B and Qwen1.5-7B using RoPE scaling methods. Detailed results see Appendix Table8.

The misleading nature of CLongEval evaluations further increases. As shown in Fig3 (a), CLongEval shows that the original RoPE and PIChen etal. (2023b) with different scale s𝑠sitalic_s perform comparably, obtaining high scores. PI shows a slight overall performance decrease with large s𝑠sitalic_s, this is the known limitation. In contrast, CNNSum demonstrates that the original RoPE’s extrapolation ability is much weaker, leading to a 66% decrease on 3XL. For PI, using larger s𝑠sitalic_s performed poorer on the L and XL, but as context increases, smaller s𝑠sitalic_s (e.g., 2 and 4) show faster decrease due to insufficient interpolation. While PI theoretically cannot extrapolate without training, models with strong extrapolation ability seem to adapt to distance resolution variations.In Fig3 (b), CLongEval results for NTKu/LocalLLaMA (2023) with different s𝑠sitalic_s values are almost identical. However, CNNSum shows more significant differences. Moreover, with the same s𝑠sitalic_s, NTK is more moderate in interpolation than PI, leading to a 22% decrease in extrapolation performance on 3XL (e.g., s=2𝑠2s=2italic_s = 2).

Training Len. (Ave.)YaRN(s=16)𝑠16(s=16)( italic_s = 16 )CNNSumα(Slow)𝛼𝑆𝑙𝑜𝑤\alpha(Slow)italic_α ( italic_S italic_l italic_o italic_w )β(Fast)𝛽𝐹𝑎𝑠𝑡\beta(Fast)italic_β ( italic_F italic_a italic_s italic_t )LXL2XL3XL14k18ksimilar-to14𝑘18𝑘14k\sim 18k14 italic_k ∼ 18 italic_k (16k16𝑘16k16 italic_k)13217.716.014.012.31818.116.514.312.4RoPE16.815.612.49.0PI (s=16𝑠16s=16italic_s = 16)16.113.912.04.1NTK (s=16𝑠16s=16italic_s = 16)16.214.711.35.1

Yi-6B shows some interesting differences. The original RoPE with a larger base performs better in extrapolation, especially on 3XL. However, it appears highly sensitive to distance resolution variations. For PI with s=2,4𝑠24s=2,4italic_s = 2 , 4, CLongEval shows a faster decrease in Fig4 (a) and (b). CNNSum provides a clearer explanation. In Fig4 (a), it adapts well to s=2𝑠2s=2italic_s = 2 and could extrapolate to 2XL (64k). But for s=4,8,16𝑠4816s=4,8,16italic_s = 4 , 8 , 16, they extrapolate respectively to 16k, 32k, and 64k, exactly matching the interpolated positions based on default 4k context. The better performance with NTK in Fig4 (c) and (d) indicates that Yi-6B is more sensitive to high-frequency information of RoPE in extrapolation. Although fine-tuning on longer data can mitigate this, still cannot extrapolate on 3XL. Training on longer data without interpolation seems a stable method for models with a large RoPE base. The large base already significantly reduced the rotation speed of RoPE. With default YaRN settings, more than 50% dimensions (37/64) using PI, while only 20% dimensions (13/64) without interpolation, but still significantly improve extrapolation performance in Table6. We adjusted to increase this proportion to 30%, the performance gains were minor, indicating high-frequency information is much more valuable.

The other different observations fromPeng etal. (2023) during training. With the same scale s𝑠sitalic_s, PI led a higher initial loss and slower convergence than NTK and original RoPE. As s𝑠sitalic_s increases, the loss rises significantly, but NTK is nearly unchanged. Additionally, we conducted an initial exploration of sparse attention, results indicated that it may harm model’s extrapolation performance. Detailed results are provided in AppendixC.

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (7)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (8)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (9)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (10)

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (11)
Exploring Long-Context Summarization with Large Language Models in Chinese Novels (12)
Exploring Long-Context Summarization with Large Language Models in Chinese Novels (13)
Exploring Long-Context Summarization with Large Language Models in Chinese Novels (14)

5 Conclusion

We introduce CNNSum, a Chinese long-context novel summarization benchmark. We conduct rigorous multi-scale sampling on a newly collected corpus, creating L (16k)~3XL (128k) four subsets, 695 samples in total. The annotations are mainly human-driven, with LLMs assisting for efficiency. Compared to existing benchmarks, we resolve most shortcomings, such as new corpus and high-quality human annotations, which largely prevent leakage.We conduct comprehensive experiments on CNNSum, analyzing factors that influence long-context summarization. We examine why GPT-4o underperforms and how prompt patterns affect output quality. We find stable long-context memory ability is more important than parameter size.We also find using only short-context summary data activates the LLMs’ extrapolation potential for long-context summarization and significantly improves output quality. We confirm that CNNSum could provide more reliable and insightful evaluation results. Observations on basic RoPE scaling methods should generalize to advanced ones. We hope CNNSum contributes valuable evaluation insights for future research.

6 Limitations

Because advanced automatic evaluationsChang etal. (2023) require powerful LLMs like GPT-4o, we only use ROUGE and also conduct manual inspections. The effects of complex prompts like AppendixE.3, need further exploration. Although concatenated long data achieved generally favorable results, its potential drawbacks have not been investigated, such as interference between these short data, which may affect the convergence upper bound. Some observations based on the basic RoPE scaling methods could theoretically generalize to more advanced ones, but numerous possibilities remain unexplored. We leave these investigations for future work.

7 Acknowledgment

We would thank the annotators from iQIYI for providing high-quality manual annotations and corrections. We also appreciate the high-quality basic summary data for fine-tuning provided by iQIYI, as well as the corpus used to build CNNSum.

References

  • AI (2023)Mistral AI. 2023.Mistralai announcesministrax.Accessed: 2024-11-10.
  • An etal. (2023)Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang,Lingpeng Kong, and Xipeng Qiu. 2023.L-eval: Instituting standardized evaluation for long context languagemodels.arXiv preprint arXiv:2307.11088.
  • Ao etal. (2024)Sun Ao, Weilin Zhao, XuHan, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun,Shengnan Wang, and Teng Su. 2024.Burstattention: An efficient distributed attention framework forextremely long sequences.arXiv preprint arXiv:2403.09347.
  • Bai etal. (2023a)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan,Wenbin Ge, YuHan, Fei Huang, etal. 2023a.Qwen technical report.arXiv preprint arXiv:2309.16609.
  • Bai etal. (2024)Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, JiQi, Lei Hou, Jie Tang, YuxiaoDong, and Juanzi Li. 2024.Longalign: A recipe for long context alignment of large languagemodels.arXiv preprint arXiv:2401.18058.
  • Bai etal. (2023b)Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang,Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, etal. 2023b.Longbench: A bilingual, multitask benchmark for long contextunderstanding.arXiv preprint arXiv:2308.14508.
  • Bhandari etal. (2020)Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig.2020.Re-evaluating evaluation in text summarization.arXiv preprint arXiv:2010.07100.
  • Cai etal. (2024)Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen,Zehui Chen, Zhi Chen, Pei Chu, etal. 2024.Internlm2 technical report.arXiv preprint arXiv:2403.17297.
  • Chang etal. (2023)Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2023.Booookscore: A systematic exploration of book-length summarization inthe era of llms.arXiv preprint arXiv:2310.00785.
  • Chen etal. (2023a)Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing.2023a.Clex: Continuous length extrapolation for large language models.arXiv preprint arXiv:2310.16450.
  • Chen etal. (2023b)Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian.2023b.Extending context window of large language models via positionalinterpolation.arXiv preprint arXiv:2306.15595.
  • Chen etal. (2023c)Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, andJiaya Jia. 2023c.Longlora: Efficient fine-tuning of long-context large languagemodels.arXiv preprint arXiv:2309.12307.
  • Cohere For AI (2024)Cohere For AI. 2024.c4ai-command-r-08-2024.
  • Dao (2024)Tri Dao. 2024.FlashAttention-2: Faster attention with better parallelism and workpartitioning.In International Conference on Learning Representations(ICLR).
  • Ding etal. (2023)Jiayu Ding, Shuming Ma, LiDong, Xingxing Zhang, Shaohan Huang, Wenhui Wang,Nanning Zheng, and Furu Wei. 2023.Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486.
  • Ding etal. (2024)Yiran Ding, LiLyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, JiahangXu, Fan Yang, and Mao Yang. 2024.Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753.
  • Du etal. (2021)Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, andJie Tang. 2021.Glm: General language model pretraining with autoregressive blankinfilling.arXiv preprint arXiv:2103.10360.
  • Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, AhmadAl-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan,etal. 2024.The llama 3 herd of models.arXiv preprint arXiv:2407.21783.
  • Fabbri etal. (2019)AlexanderR Fabbri, Irene Li, Tianwei She, Suyi Li, and DragomirR Radev. 2019.Multi-news: A large-scale multi-document summarization dataset andabstractive hierarchical model.arXiv preprint arXiv:1906.01749.
  • Fu etal. (2024)Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim,and Hao Peng. 2024.Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171.
  • GLM etal. (2024)Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, DaYin, Diego Rojas,Guanyu Feng, Hanlin Zhao, Hanyu Lai, etal. 2024.Chatglm: A family of large language models from glm-130b to glm-4 alltools.arXiv preprint arXiv:2406.12793.
  • Han etal. (2024)Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, YuChen, Heng Ji, and Sinong Wang.2024.Lm-infinite: Zero-shot extreme length generalization for largelanguage models.In Proceedings of the 2024 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies (Volume 1: Long Papers), pages 3991–4008.
  • Han etal. (2023)Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, DavidP Woodruff, andAmir Zandieh. 2023.Hyperattention: Long-context attention in near-linear time.arXiv preprint arXiv:2310.05869.
  • Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685.
  • Huang etal. (2021)Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and LuWang. 2021.Efficient attentions for long document summarization.arXiv preprint arXiv:2104.02112.
  • Jiang etal. (2023)AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel,Guillaume Lample, Lucile Saulnier, etal. 2023.Mistral 7b.arXiv preprint arXiv:2310.06825.
  • Jiang etal. (2024a)AlbertQ Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, etal. 2024a.Mixtral of experts.arXiv preprint arXiv:2401.04088.
  • Jiang etal. (2024b)Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, SurinAhn, Zhenhua Han, AmirH Abdi, Dongsheng Li, Chin-Yew Lin, etal.2024b.Minference 1.0: Accelerating pre-filling for long-context llms viadynamic sparse attention.arXiv preprint arXiv:2407.02490.
  • Kim etal. (2024)Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, VarunManjunatha, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2024.Fables: Evaluating faithfulness and content selection in book-lengthsummarization.arXiv preprint arXiv:2404.01261.
  • Krishna etal. (2023)Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, ArmanCohan, and Kyle Lo. 2023.Longeval: Guidelines for human evaluation of faithfulness inlong-form summarization.arXiv preprint arXiv:2301.13298.
  • Kryściński etal. (2021)Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong,and Dragomir Radev. 2021.Booksum: A collection of datasets for long-form narrativesummarization.arXiv preprint arXiv:2105.08209.
  • Kwon etal. (2023)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, CodyHao Yu,JosephE. Gonzalez, Hao Zhang, and Ion Stoica. 2023.Efficient memory management for large language model serving withpagedattention.In Proceedings of the ACM SIGOPS 29th Symposium on OperatingSystems Principles.
  • Li etal. (2024)Yucheng Li, Yunhao Guo, Frank Guerin, and Chenghua Lin. 2024.An open-source data contamination report for large language models.In Findings of the Association for Computational Linguistics:EMNLP 2024, pages 528–541.
  • Lin (2004)Chin-Yew Lin. 2004.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81.
  • Liu and Abbeel (2023)Hao Liu and Pieter Abbeel. 2023.Blockwise parallel transformer for large context models.Advances in neural information processing systems.
  • Liu etal. (2024a)Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024a.World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268.
  • Liu etal. (2024b)Hao Liu, Matei Zaharia, and Pieter Abbeel. 2024b.Ring attention with blockwise transformers for near-infinite context.International Conference on Learning Representations.
  • Liu etal. (2024c)NelsonF Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,Fabio Petroni, and Percy Liang. 2024c.Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics,12:157–173.
  • Liu etal. (2023)Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. 2023.Scaling laws of rope-based extrapolation.arXiv preprint arXiv:2310.05209.
  • Loshchilov (2017)ILoshchilov. 2017.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101.
  • Meister etal. (2023)Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023.Locally typical sampling.Transactions of the Association for Computational Linguistics,11:102–121.
  • Ni etal. (2024)Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, and Piji Li.2024.Xl 2̂^ 2 bench: A benchmark for extremely long contextunderstanding with long-range dependencies.arXiv preprint arXiv:2404.05446.
  • OpenAI (2024a)OpenAI. 2024a.Gpt-4o mini: Advancing cost-efficient intelligence.Accessed: 2024-11-08.
  • OpenAI (2024b)OpenAI. 2024b.Hello gpt-4o.Accessed: 2024-11-08.
  • Pagnoni etal. (2021)Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021.Understanding factuality in abstractive summarization with frank: Abenchmark for factuality metrics.arXiv preprint arXiv:2104.13346.
  • Pawar etal. (2024)Saurav Pawar, SMTonmoy, SMZaman, Vinija Jain, Aman Chadha, and Amitava Das.2024.The what, why, and how of context length extension techniques inlarge language models–a detailed survey.arXiv preprint arXiv:2401.07872.
  • Peng etal. (2023)Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023.Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071.
  • Qiu etal. (2024)Zexuan Qiu, Jingjing Li, Shijue Huang, Xiaoqi Jiao, Wanjun Zhong, and IrwinKing. 2024.Clongeval: A chinese benchmark for evaluating long-context largelanguage models.arXiv preprint arXiv:2403.03514.
  • Ren etal. (2021)Jie Ren, Samyam Rajbhandari, RezaYazdani Aminabadi, Olatunji Ruwase, ShuangyanYang, Minjia Zhang, Dong Li, and Yuxiong He. 2021.{{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} modeltraining.In 2021 USENIX Annual Technical Conference (USENIX ATC 21),pages 551–564.
  • r/LocalLLaMA (2024)r/LocalLLaMA. 2024.Dynamically scaled rope further increases context length.Accessed: 2024-11-08.
  • Sclar etal. (2024)Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024.Quantifyinglanguage models’ sensitivity to spurious features in prompt design or: How ilearned to start worrying about prompt formatting.In The Twelfth International Conference on LearningRepresentations.
  • Sharma etal. (2019)Eva Sharma, Chen Li, and LuWang. 2019.Bigpatent: A large-scale dataset for abstractive and coherentsummarization.arXiv preprint arXiv:1906.03741.
  • Song etal. (2024)Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024.Finesure: Fine-grained summarization evaluation using llms.arXiv preprint arXiv:2407.00908.
  • Su etal. (2024)Jianlin Su, Murtadha Ahmed, YuLu, Shengfeng Pan, Wen Bo, and Yunfeng Liu.2024.Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063.
  • Tao etal. (2024)Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo,Min Lin, and Ngai Wong. 2024.Scaling laws with vocabulary: Larger models deserve largervocabularies.arXiv preprint arXiv:2407.13623.
  • Team etal. (2024a)Gemini Team, Petko Georgiev, VingIan Lei, Ryan Burnell, Libin Bai, AnmolGulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, etal.2024a.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context.arXiv preprint arXiv:2403.05530.
  • Team etal. (2024b)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju,Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale,Juliette Love, etal. 2024b.Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295.
  • Team (2024)Qwen Team. 2024.Qwen2.5: A party offoundation models.
  • Tian etal. (2024)Junfeng Tian, DaZheng, Yang Cheng, Rui Wang, Colin Zhang, and Debing Zhang.2024.Untie the knots: An efficient data augmentation strategy forlong-context pre-training in language models.arXiv preprint arXiv:2409.04774.
  • Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, EricHambro, Faisal Azhar, etal. 2023a.Llama: open and efficient foundation language models. arxiv.arXiv preprint arXiv:2302.13971.
  • Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal. 2023b.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
  • u/LocalLLaMA (2023)u/LocalLLaMA. 2023.Ntk-aware scaled rope allows llama models to have context lengthsexceeding 4k tokens.https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.Accessed: 2024-11-23.
  • Wang etal. (2024a)Suyuchen Wang, Ivan Kobyzev, Peng Lu, Mehdi Rezagholizadeh, and Bang Liu.2024a.Resonance rope: Improving context length generalization of largelanguage models.arXiv preprint arXiv:2403.00071.
  • Wang etal. (2024b)Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, andArmaghan Eshaghi. 2024b.Beyond the limits: A survey of techniques to extend the contextlength in large language models.arXiv preprint arXiv:2402.02244.
  • Wu etal. (2023)Han Wu, Mingjie Zhan, Haochen Tan, Zhaohui Hou, Ding Liang, and Linqi Song.2023.Vcsum: A versatile chinese meeting summarization dataset.arXiv preprint arXiv:2305.05280.
  • Wu etal. (2021)Jeff Wu, Long Ouyang, DanielM Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike,and Paul Christiano. 2021.Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862.
  • Xiao etal. (2024a)Chaojun Xiao, Pengle Zhang, XuHan, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang,Zhiyuan Liu, and Maosong Sun. 2024a.Infllm: Training-free long-context extrapolation for llms with anefficient context memory.In The Thirty-eighth Annual Conference on Neural InformationProcessing Systems.
  • Xiao etal. (2024b)Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, HaotianTang, Yao Fu, and Song Han. 2024b.Duoattention: Efficient long-context llm inference with retrieval andstreaming heads.arXiv preprint arXiv:2410.10819.
  • Xiao etal. (2023)Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023.Efficient streaming language models with attention sinks.arXiv.
  • Xiong etal. (2023)Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, RuiHou, Louis Martin, Rashi Rungta, KarthikAbinav Sankararaman, Barlas Oguz,etal. 2023.Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039.
  • Xu etal. (2022)Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. 2022.Learning to break the loop: Analyzing and mitigating repetitions forneural text generation.Advances in Neural Information Processing Systems,35:3082–3095.
  • Xu etal. (2024)Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024.Benchmarking benchmark leakage in large language models.arXiv preprint arXiv:2404.18824.
  • Yang etal. (2023)Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, CeBian, Chao Yin, ChenxuLv, DaPan, Dian Wang, Dong Yan, etal. 2023.Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305.
  • Yang etal. (2024)AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, ChengpengLi, Chengyuan Li, Dayiheng Liu, Fei Huang, etal. 2024.Qwen2 technical report.arXiv preprint arXiv:2407.10671.
  • Young etal. (2024)Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li,Jiangcheng Zhu, Jianqun Chen, Jing Chang, etal. 2024.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652.
  • Zeng etal. (2022)Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, ZhuoyiYang, Yifan Xu, Wendi Zheng, Xiao Xia, etal. 2022.Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414.
  • Zhang etal. (2024a)Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou.2024a.Long context compression with activation beacon.arXiv preprint arXiv:2401.03462.
  • Zhang etal. (2024b)Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, MooKhai Hao,XuHan, ZhenLeng Thai, Shuo Wang, Zhiyuan Liu, etal. 2024b.$
    infty$ bench: Extending long context evaluation beyond 100ktokens.
    arXiv preprint arXiv:2402.13718.

Appendix A Baseline Results of Two Prompts

ModelCtx. Len.CNNSumL (16k16𝑘16k16 italic_k)XL (32k32𝑘32k32 italic_k)2XL (64k64𝑘64k64 italic_k)3XL (128k128𝑘128k128 italic_k)15.314.212.4-GPT-4o-2024-08-06128k128𝑘128k128 italic_k15.714.212.5-16.013.813.1-GPT-4o-mini-2024-07-18128k128𝑘128k128 italic_k15.313.712.5-19.117.916.714.6Gemini-1.5-pro2m2𝑚2m2 italic_m19.518.416.914.620.217.515.212.8Doubao-pro-128k128k128𝑘128k128 italic_k19.118.115.713.620.318.516.214.5Qwen-plus-2024-09-19128k128𝑘128k128 italic_k20.618.616.715.021.819.717.614.7Moonshot-v1-128k128k128𝑘128k128 italic_k22.920.818.415.69.33.52.42.2Yi-6B4k4𝑘4k4 italic_k8.35.32.92.413.57.52.02.0Yi-6B-Chat4k4𝑘4k4 italic_k13.46.22.22.211.010.89.64.6Yi-6B-200K200k200𝑘200k200 italic_k8.77.98.03.312.011.410.810.3Yi-34B-200K200k200𝑘200k200 italic_k12.111.210.89.614.412.710.80.1Yi-1.5-34B-32K32k32𝑘32k32 italic_k8.88.28.30.115.414.212.10.0Yi-1.5-34B-Chat-16K16k16𝑘16k16 italic_k12.110.49.70.017.817.315.013.8InternLM2.5-7B-Chat-1M1m1𝑚1m1 italic_m18.216.914.312.118.016.314.612.9InternLM2.5-20B256k256𝑘256k256 italic_k18.316.314.112.518.817.516.314.2InternLM2.5-20B-Chat32k32𝑘32k32 italic_k19.017.116.114.217.316.214.913.9ChatGLM3-6B-128k128k128𝑘128k128 italic_k17.116.014.613.519.218.216.615.6GLM4-9B-Chat-1M1m1𝑚1m1 italic_m18.817.616.415.26.16.67.02.2Llama3.1-8B128k128𝑘128k128 italic_k9.49.49.73.916.115.012.810.6Llama3.1-8B-Instruct128k128𝑘128k128 italic_k15.013.512.79.211.87.55.27.0Llama3.1-70B128k128𝑘128k128 italic_k12.510.28.67.918.615.813.910.5Llama3.1-70B-Instruct128k128𝑘128k128 italic_k17.315.613.710.73.12.82.21.1LWM-Text-1M1m1𝑚1m1 italic_m3.53.22.81.016.114.111.63.5Ministral-8B-Instruct-2410128k128𝑘128k128 italic_k16.314.011.03.516.514.57.02.5Qwen1.5-7B32k32𝑘32k32 italic_k7.77.24.42.614.814.110.32.7Qwen1.5-7B-Chat32k32𝑘32k32 italic_k15.413.110.32.715.814.24.54.2Qwen1.5-32B-Chat32k32𝑘32k32 italic_k15.614.15.54.28.88.28.67.1Qwen2-7B128k128𝑘128k128 italic_k10.88.57.36.615.113.713.311.7Qwen2-7B-Instruct32k32𝑘32k32 italic_k15.312.911.210.812.110.710.39.5Qwen2-72B128k128𝑘128k128 italic_k17.515.713.111.312.410.510.610.6Qwen2-72B-Instruct32k32𝑘32k32 italic_k18.515.812.37.817.015.212.710.1Qwen2.5-7B-Instruct32k32𝑘32k32 italic_k14.412.812.79.119.718.313.812.5Qwen2.5-72B-Instruct32k32𝑘32k32 italic_k19.516.913.314.2

Appendix B Detailed Results of Extrapolation

Base ModelTraining Len. (Ave.)Pos. Emb.CLongEval-LStSumCNNSum
SmallMediumLargeL (16k16𝑘16k16 italic_k)XL (32k32𝑘32k32 italic_k)2XL (64k64𝑘64k64 italic_k)3XL (128k128𝑘128k128 italic_k)
Qwen1.5-7B14k18ksimilar-to14𝑘18𝑘14k\sim 18k14 italic_k ∼ 18 italic_k (16k16𝑘16k16 italic_k)RoPE17.215.210.018.416.111.43.1
PI (s=2𝑠2s=2italic_s = 2)16.914.915.118.316.614.67.4
PI (s=4𝑠4s=4italic_s = 4)15.714.514.117.315.814.312.7
PI (s=8𝑠8s=8italic_s = 8)16.114.113.617.415.614.112.8
PI (s=16𝑠16s=16italic_s = 16)15.213.112.316.714.913.312.1
NTK (s=2𝑠2s=2italic_s = 2)17.114.514.818.116.414.22.9
NTK (s=8𝑠8s=8italic_s = 8)16.914.313.518.416.214.513.3
NTK (s=16𝑠16s=16italic_s = 16)16.914.013.117.815.814.512.6
Qwen1.5-7B30k34ksimilar-to30𝑘34𝑘30k\sim 34k30 italic_k ∼ 34 italic_k (32k32𝑘32k32 italic_k)RoPE16.915.913.818.516.513.44.1
PI (s=2𝑠2s=2italic_s = 2)17.016.015.318.216.314.612.3
PI (s=4𝑠4s=4italic_s = 4)16.715.215.018.516.515.113.0
PI (s=8𝑠8s=8italic_s = 8)16.114.914.118.516.614.914.0
PI (s=16𝑠16s=16italic_s = 16)15.113.713.217.516.314.913.8
NTK (s=2𝑠2s=2italic_s = 2)17.715.915.418.516.614.49.6
NTK (s=8𝑠8s=8italic_s = 8)17.915.915.918.616.915.713.7
NTK (s=16𝑠16s=16italic_s = 16)17.716.215.618.517.015.214.5
Yi-6B14k18ksimilar-to14𝑘18𝑘14k\sim 18k14 italic_k ∼ 18 italic_k (16k16𝑘16k16 italic_k)RoPE17.214.811.916.815.612.49.0
PI (s=2𝑠2s=2italic_s = 2)16.514.312.517.215.012.65.4
PI (s=4𝑠4s=4italic_s = 4)15.611.26.517.09.16.65.5
PI (s=8𝑠8s=8italic_s = 8)15.013.76.516.214.45.74.1
PI (s=16𝑠16s=16italic_s = 16)14.512.811.116.113.912.04.1
NTK (s=2𝑠2s=2italic_s = 2)16.714.311.916.715.313.05.3
NTK (s=8𝑠8s=8italic_s = 8)16.414.811.717.515.212.75.1
NTK (s=16𝑠16s=16italic_s = 16)15.913.510.616.214.711.35.1
Yi-6B30k34ksimilar-to30𝑘34𝑘30k\sim 34k30 italic_k ∼ 34 italic_k (32k32𝑘32k32 italic_k)RoPE16.915.213.616.816.213.611.4
PI (s=2𝑠2s=2italic_s = 2)16.014.413.617.215.414.15.7
PI (s=4𝑠4s=4italic_s = 4)14.813.612.016.614.712.45.5
PI (s=8𝑠8s=8italic_s = 8)14.613.08.316.015.08.13.9
PI (s=16𝑠16s=16italic_s = 16)15.013.312.715.914.213.06.2
NTK (s=2𝑠2s=2italic_s = 2)17.014.613.516.615.914.26.2
NTK (s=8𝑠8s=8italic_s = 8)16.214.512.817.515.313.86.3
NTK (s=16𝑠16s=16italic_s = 16)15.914.012.416.615.912.66.1

Appendix C Results of Training with Sparse Attention

Training Len. (Ave)Pos. Emb.S2superscript𝑆2S^{2}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-AttnCLongEval-LStSumCNNSumSmallMediumLargeL (16k16𝑘16k16 italic_k)XL (32k32𝑘32k32 italic_k)2XL (64k64𝑘64k64 italic_k)3XL (128k128𝑘128k128 italic_k)14k18ksimilar-to14𝑘18𝑘14k\sim 18k14 italic_k ∼ 18 italic_k (16k16𝑘16k16 italic_k)PI (s=16𝑠16s=16italic_s = 16)\checkmark14.412.19.815.413.411.50.5×\times×14.512.811.116.113.912.04.1RoPE\checkmark16.714.310.216.515.311.07.6×\times×17.214.811.916.815.612.49.030k32ksimilar-to30𝑘32𝑘30k\sim 32k30 italic_k ∼ 32 italic_k (34k34𝑘34k34 italic_k)PI (s=16𝑠16s=16italic_s = 16)\checkmark14.612.010.115.913.612.60.6×\times×15.013.312.715.914.213.06.2RoPE\checkmark16.814.510.016.515.311.28.0×\times×16.915.213.616.816.213.611.4

Appendix D Examples of Cases

D.1 Cases in Web Serialized Novels Corpus

D.2 Commercial Models’ Outputs Comparison

D.3 Outputs of Base Models with Prompt-IE vs Prompt-IB

D.4 Outputs of Base vs Chat Models with Prompt-IE

D.5 Output of Qwen2-72B-Instruct with Prompt-IE and Prompt-IB

D.6 Bad Cases of Llama3.1 with Prompt-IE

Appendix E All Used Prompts

E.1 Baseline Prompts

E.2 Training Prompts

E.3 Annotation Prompt

Exploring Long-Context Summarization with Large Language Models in Chinese Novels (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Errol Quitzon

Last Updated:

Views: 5423

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Errol Quitzon

Birthday: 1993-04-02

Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

Phone: +9665282866296

Job: Product Retail Agent

Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.