Changyu Lee

Novel AutoGen: A Memory-Augmented Multi-LLM Agent for Long-Form Web Novel Generation

Tags
Nestyle Intelligence
Award
Year
2023
1 more property

Summary

The Novel Autogen project is a project to develop an AI co-writing agent that combines LLMs, RAG technology, and a custom agent memory system that mimics human cognitive processes to support human writers in planning, scene design, and writing for long-form web novels.
Starting from an initial sequential generation pipeline (v1), the system evolved through a v2 version that introduced long-term memory and forgetting mechanisms, and finally into a v3 system with a modular architecture and multi-LLM integration.
This project was conducted while I was working at Nestyle Intelligence, and I was responsible for the entire development process from the initial PoC to the full system implementation.

Problem Definition

The problem that AI cannot create novels with long-form context
The biggest challenge in generating long-form web novels is not simply writing natural sentences, but consistently maintaining character personalities, relationship changes, worldbuilding settings, foreshadowing, and emotional arcs across dozens of episodes. In my internal experiments, even when using high-performance LLMs available at the time, problems such as character collapse, missing relationship settings, forgetting past events, and premature exposure of plot spoilers repeatedly occurred after around three episodes. Therefore, the core problem of this project was not to design “a model that generates long stories,” but to design “an agent system that can remember, control, and collaboratively write long stories.”
In internal experiments, the system was able to write up to approximately three episodes using the strongest model at the time, GPT-4o-Turbo.

My Role

1. AI Agent Architecture and Multi-LLM Pipeline Design (Agent Architect)

Designing a human-cognition-based creative pipeline: I divided the novel creation process into the stages of “Planning (Plot) Scene Design (Beat) Writing (Story)” and built them as parallel asynchronous chains using LangChain’s LCEL (LangChain Expression Language).
Purpose-driven Multi-LLM routing: I designed and implemented model orchestration logic so that GPT-4 would handle planning stages that require advanced reasoning and structural design, while HyperClovaX/Claude would handle the writing stage that requires the distinctive style and emotional expression of web novels.
Introducing an Information Hiding structure: To prevent spoilers from leaking when the entire plot is injected into the LLM at once, I designed a pacing architecture that divides the story into Beat-level units and injects only limited short-term contextual memory.

2. Agent Memory System and Hybrid RAG Construction (Memory System Engineering)

Implementing a Forgetting Curve algorithm: Using a Vector DB (Chroma DB), I developed a custom long-term/short-term memory integration system in which memory strength decreases as turns (Time) pass and is reinforced when repeatedly retrieved.
Optimizing SelfQueryRetriever based on four-dimensional metadata: To address hallucinations caused by simple K-NN search, I built a hybrid RAG system in which the LLM automatically converts natural language queries into metadata filters, similar to SQL WHERE clauses, to retrieve character relationships and context with 100% accuracy.
Building an autonomous ecosystem for dynamic character generation: When a character not stored in the DB appears, the system detects the exception, instantly creates a new character profile that fits the existing story context, and stores it in the DB through automated logic.

3. Backend API and Interactive UI Development (API & Web Architecture)

FastAPI-based backend for session and concurrency control: To prevent agent contexts from being mixed across multiple users, I built an API server with independent memory sandboxes for each user using SessionMiddleware.
Cache-tree-based human-AI collaborative UI (AIZAC): Using Gradio, I completed an interactive co-writing web application where users can review and edit AI-generated Beats in real time.

4. LLM Response Control and Evaluation Logic Design (Output Control & Evaluation)

Strict output control based on Pydantic: I introduced PydanticOutputParser to force unstable LLM text outputs into strict JSON schemas that can be parsed internally by the system, including fields such as location, time, characters, and emotions, thereby preventing system errors.
AI model fine-tuning (SFT/DPO): Based on custom-defined NovelBeat/Story schema datasets, I conducted SFT/DPO experiments and reviewed improvement directions so that the model could more reliably follow scene structures and web novel writing styles. The model achieved 94% accuracy.
Context window limitation and RAG evaluation: I defined GPT-4o’s maximum generation limit of three episodes through internal experiments, identified the limitations of RAG hallucination in long-form serialization beyond ten episodes, and led evaluation logic for establishing performance improvement directions such as introducing Core Memory.

Experimentation and Development Process

Version 1: Establishing the Basic Pipeline

Experiment: I tested a fully top-down sequential generation pipeline that followed the flow of genre -> synopsis -> main characters -> plot -> episode -> scene -> beat.
Results and limitations: I attempted to maintain context using a single LangChain memory module, but as the text grew longer, the system encountered context window/token limit issues, resulting in character collapse (OOC: Out of Character) and inconsistencies in story settings.

Version 2: Introducing the Ebbinghaus Memory System (Vector DB) and Enhancing Beat Design

Experiment
Designed a custom memory structure that categorizes and stores character states and relationships in Chroma DB and continuously tracks them
Tested a forgetting curve formula, exp(-time/memory_strength), where memory fades over time and is reinforced upon repeated retrieval
Refined the Beat schema by explicitly defining emotions and conflicts, and applied a recommendation logic that predicts the next development using beats.csv, which was built by parsing Beats from existing works
Results: Long-term memory was successfully preserved, significantly improving narrative consistency. However, the Korean novel-writing capability of a single model (GPT) had limitations in capturing the distinctive style of actual serialized web novels, especially the unique “flavor” of the genre.

Version 3: Multi-LLM Integration and Interactive AIZAC UI

Experiment
Introduced LangChain’s SelfQueryRetriever to maximize retrieval accuracy by converting natural language queries into metadata filters during Memory search
Tested a multi-model division of labor in which GPT-4 handled structural planning while Naver HyperClovaX or Claude handled final novel sentence writing
Built an interactive UI (AIZAC) where users and AI communicate at the Beat level and collaboratively write novels
Results: The system achieved both the logical planning ability of GPT-4 and the fluent Korean prose style of HyperClovaX. With a memory system inspired by the human brain, the project successfully built a production-level AI writing assistant system capable of supporting long-form serialization.

Tech Stack

Language: Python 3.11
LLM / AI: OpenAI API (GPT-4-1106-preview / GPT-4o), Claude API, Naver HyperClovaX
Framework & Libraries: LangChain, Pydantic
Vector Database: Chroma DB
Web / UI: FastAPI, Starlette, Gradio, HTML/CSS (Jinja2 Templates)

System Architecture

1) Planning and Narrative Agent (Writer Agent)

Role: The Writer Agent controls the entire creative pipeline, from plot construction to Beat design, which is the smallest unit of a scene, and final story writing.
Multi-LLM division strategy
Design and logic (GPT-4 Turbo / GPT-Preivew, etc.): Used for tasks that require advanced reasoning and planning, such as designing NovelPlot and NovelBeat structures, extracting character information, and defining relationships.
Story writing (HyperClovaX/Claude Opus, etc.): When writing the actual novel manuscript based on Beats, including events, emotions, and conflicts, the system used Naver HyperClovaX, which is strong in Korean context and the delicate writing style unique to web novels, or Claude Opus, which excels at contextual understanding.
LangChain pipeline: I constructed the workflow by combining prompt chains optimized for each stage, such as beatWriterChain, complementBeatChain, and storyWriterChain, with Pydantic Output Parser.

2) Custom Agent Memory System (Writer Memory Architecture)

Beyond simple Vector DB RAG search, I directly designed and implemented a custom long-term/short-term memory integration system that mimics Ebbinghaus’s Forgetting Curve and human memory mechanisms.
Four-dimensional metadata classification system (Tagging & Categorization)
When the agent generates a new scene (Beat), an internal extraction module automatically parses it, decomposes it into four categories, and stores it in the Vector DB (Chroma).
1.
Story: Summary of the developed story flow
2.
Character: Character profiles and changes in psychological/status states
3.
CharacterRelation: Changes in relationships between characters
4.
NovelSetting: Key setting information such as objects, era, culture, and worldbuilding
When storing data in the DB, I assigned explicit metadata fields such as tag, character1, character2, and character_name, enabling not only semantic search but also pinpoint retrieval through filtering.
Forgetting curve and memory reinforcement mechanism (Forgetting & Reinforcement)
Every memory element has attributes called time, which indicates the time elapsed since creation, and memory_strength, which indicates memory strength.
Forgetting: Whenever the story turn advances, the time value of every memory increases, and memory gradually decreases according to the formula memory_factor = exp(-time / memory_strength) * 10.
Threshold: If the memory_factor falls below the threshold (mf_threshold), that information is considered forgotten from the agent’s mind and is no longer retrieved.
Reinforcement: When the agent recalls and uses a specific memory, the memory_strength of that memory increases by +1, meaning it will be forgotten more slowly next time. This allows important settings and main-character relationships that frequently appear in the novel to remain consistent until the end of the work.
LLM-based autonomous query transformation (Self-Query Retriever)
Natural language questions such as “What is the relationship between Lee Youngmin and his old friend?” are analyzed by the LLM and automatically converted into Vector DB metadata filter queries such as character1='Lee Youngmin' AND character2='old friend'. This enables accurate character information retrieval without hallucination.
Short-term context memory system
Separately from long-term memory (Vector DB), I operated LangChain’s ConversationSummaryBufferMemory in parallel to preserve the vivid flow of the currently ongoing scene. This structure protects against token limits while ensuring that the latest development summaries are not lost.

3) Reference Knowledge Database (Aizac DB Connector)

Role: Existing web novel clichés and standardized Beat developments (beats.csv) are loaded into a Vector DB and used as Inspiration RAG. This allows the system to naturally expand user-provided keywords into web novel clichés or recommend appropriate events (Beats) when the next development is blocked.

4) Overall System Workflow (Workflow & Sub-module Interaction)

This shows the full cycle of how submodules such as UI, Writer, Memory, DB, and LLM interact with one another to complete a single novel.
1.
User request (UI Layer): The user enters a genre and keywords on the frontend or clicks “Generate Next.”
2.
Context collection (Memory Layer): The Writer module does not send a prompt immediately. Instead, it queries the WriterMemory module to retrieve the previous summary and character information.
3.
Structured creation (LLM Layer - GPT): Based on the retrieved memory, GPT-4 designs a detailed structure (Beat) without logical contradictions and returns it as JSON (Pydantic).
4.
Information extraction and memory update (Data Layer): Once the user reviews and confirms the Beat, the Memory extraction module is activated, parses new information from the Beat, updates the DB, and runs the entire forgetting algorithm (time + 1).
5.
Final writing (LLM Layer - Clova): For emotional flow and stylistic description, the refined Beat is finally injected into HyperClovaX to complete the actual text that the user reads.

Core Features

1.
Dynamic character generation system
When an extra or a new character who was not defined in advance needs to suddenly intervene in the story, the Memory module detects this and autonomously creates the character’s proper name, profile, traits, and other information according to the current story flow, then registers it in the DB.
2.
Automatic plot and worldbuilding planning
When a user enters a genre and keywords, such as regression or possession, the system passes them through the cliché DB and designs a solid novel plot and key settings following the structure of introduction, development, crisis, climax, and resolution.
3.
Beat-level scene design and interactive refinement
The system divides a novel into detailed Beat units, which are broken down by time, space, characters, events, dialogue, emotions, and conflicts. Users can review the generated Beats, directly modify them, or instruct the AI to improve them.
4.
Web-novel-optimized manuscript writing
Based on the completed Beats, the system automatically generates natural Korean web novel manuscripts with detailed situational descriptions and dialogue-driven prose. The final output can be viewed through FastAPI and the Gradio-based AIZAC UI web application, downloaded as a text file, or sent by email.

Technical Implementation Details

1. LLM Chain Optimization and Dynamic Data Pipelining (LCEL & Output Parsing)

I designed the internal chains of the agent using a concise LCEL (LangChain Expression Language) structure that combines LangChain’s RunnablePassthrough and PromptTemplate.
Parallel data mapping: When executing the Beat design chain (beatWriterChain), I maximized the LLM’s contextual awareness by injecting multiple contexts in parallel, such as {"memory": RunnablePassthrough(), "relations": RunnablePassthrough(), "stories": RunnablePassthrough(), ...}.
By using PydanticOutputParser, I ensured that the LLM’s response is always forcibly converted into a strict NovelBeat object structure, including location, time, characters, dialogue arrays, emotions, and conflicts. This fundamentally prevents internal data parsing errors in the system.

2. Forgetting Curve Data Update Trigger and Dynamic Character Generation Algorithm

VectorDB updates were implemented not as simple text appends, but through an ID-based Document replacement (update_document) method. Whenever an agent turn runs, the system queries the entire memory DB, increments the time variable in batch by +1, dynamically recalculates memory_factor = exp(-time / memory_strength) * 10, and updates the results.
Dynamic Character Generation: When extracting character status information from the generated Beat using __extractCharacterInfo, if the existing character (tag="Character", character_name=target) does not exist in Chroma DB, the system raises an error (AnyDocumentNotFoundException) and immediately triggers the __generateCharacter chain in the exception handler. The LLM then instantly creates a new character profile, including personality, trauma, and other details that do not overlap with previous context, and stores it in the DB, forming a fully autonomous ecosystem.

3. Hybrid RAG Based on AttributeInfo (SelfQueryRetriever)

To address hallucination and inaccuracy caused by simple K-NN (K-Nearest Neighbor) similarity search, I injected a sophisticated AttributeInfo schema into SelfQueryRetriever.
I defined and registered the type and meaning of metadata fields such as tag, time, id, memory_factor, memory_strength, character1, character2, and keyword in the Retriever. Through this, the model can understand a user prompt such as “What was the past conflict between these two?” and retrieve documents not by simply finding semantically similar text, but by compiling the prompt into precise metadata filters such as SQL WHERE clauses using AND, OR, and EQ.

4. Purpose-Driven Multi-LLM Integration

I placed the GPT-4-1106-preview model with a high temperature setting (1.1~1.2) at the front line so that it could handle detailed planning stages such as generating unpredictable conflict structures, describing character psychology, and creating plot twists.
In contrast, for the stage of weaving actual novel sentences (generateStory), I separately implemented a custom wrapper class for Naver HyperClovaX (aizac.llms.HyperClovaX) to control HTTP requests and streaming. By separating and combining a Korean-specialized model that is strong at predicate variation and exaggerated emotional descriptions unique to web novels, I maximized the quality of the final output.

5. Interface Modularization and Session State Management (Web Architecture)

For interactions between the AI turn and the writer turn, the Gradio web UI (demo.py) used a List and an index cursor (beats_answer_curIndex) to implement an undo/redo cache tree for generated outputs, optimizing the user experience.
In the separated FastAPI server (main.py) prepared for commercialization, I used SessionMiddleware to build concurrency sandboxes for each individual user (request.session['user_id']) and separated frontend-backend dependencies through Jinja2 template rendering.

Technical Challenges and Solutions

1. The Zero-To-One Problem and the Difficulty of Securing Original Novel Data

Problem: To create a high-quality writer AI, millions of high-quality original web novel texts are required. However, due to copyright issues, it is nearly impossible to legally crawl or use original web novel texts as fine-tuning data. In addition, when the model is asked to write a novel from an empty prompt (Zero-To-One), the output tends to be generic and lacks plausibility.
Solution
Instead of training on the original text itself, I focused on the structural skeleton of novels, namely “clichés and Beats” as templates.
I parsed only the development structures of existing works into metadata formats, such as location, time, conflict, and emotional flow, and built them into a knowledge DB (AizacDB) in the form of beats.csv.
During generation, I first provided this skeleton to the LLM through RAG-based retrieval (Keyword Expansion & Beat Prediction), encouraging the model to receive structural inspiration.
In addition, for the final text writing stage, I adopted Naver HyperClovaX, which had already been heavily pre-trained on Korean novels and internet corpora. This allowed the system to secure web-novel-specific style and word choice capabilities without illegal fine-tuning.

2. The Problem of Story Progression Breaking Due to Foreshadowing That Should Appear Later

Problem: When the entire plot structure, including introduction, development, crisis, climax, and resolution, was given to the LLM at the beginning, the model would suddenly hint at major spoilers from the ending in episode one, such as “the trusted master was actually the mastermind,” or prematurely resolve foreshadowing. This completely destroyed the dramatic tension and story progression.
Solution (Information Hiding & Beat Architecture)
I strictly separated the generation pipeline for the “Plot,” which is the full blueprint, from the “Beat,” which is the current scene.
Instead of allowing the agent to write a long text all at once, I introduced a Beat-splitting structure, which resembles how actual writers write stories.
When designing Beats and writing prose, I did not inject the entire plot into the LLM. Instead, I injected only short-term context memory (ConversationSummaryBufferMemory) and recently updated character/setting state information in a limited way (Information Hiding).
This focused the LLM’s view only on the development and conflict of the current scene, preventing unnecessary spoiler leakage and enabling tension-filled pacing.

3. The Problem of Character Relationships Not Being Maintained in the Long Term

Problem: As the number of novel episodes increased and the prompt token count exceeded the context window limit, the AI began to forget past events. A sworn enemy who fought fiercely yesterday would suddenly speak politely and kindly, or the character’s personality would collapse (OOC: Out of Character).
Solution (Four-Dimensional Tagging Memory and Forgetting/Reinforcement Curve)
Instead of simply concatenating past summaries as text, I built an independent Vector DB-based agent memory system (WriterMemory).
At the end of each Beat, the internal extraction module clearly separates and updates CharacterRelation, meaning relationships between characters, and Character, meaning personality and psychological state, in JSON format.
When writing a scene where two characters appear together, SelfQueryRetriever explicitly retrieves the latest relationship Matrix between the two characters from Chroma DB through precise filtering.
I applied the Forgetting Curve algorithm so that meaningless old information is forgotten through a time penalty, while important hostile or romantic relationships that frequently reappear increase in memory strength by +1 whenever they are retrieved. This was designed to keep character identity and emotional arcs stable until the final part of the story.

Lessons Learned

1) Technical Insights

The limitations of simple RAG and the power of structured data: I experienced firsthand that conventional RAG, which simply embeds text and retrieves it through K-NN search, causes critical hallucinations when tracking complex character relationships and timelines in novels. To solve this, I implemented a hybrid retrieval method that classifies metadata such as tag and character and converts queries into DB queries through SelfQueryRetriever. Through this, I deeply understood the importance of structured data pipelines in LLM systems.
Multi-LLM orchestration: I learned that rather than relying on a single best-performing model for everything, separating models according to the nature of each task is much more efficient. By assigning logical planning and design to GPT-4 and emotional, delicate Korean prose writing to HyperClovaX/Claude, I gained practical knowledge in maximizing output quality through model specialization.
LCEL and strict Output Parsing: I learned how to elegantly construct complex asynchronous parallel chains using LangChain’s LCEL, and acquired stability control techniques for forcing the LLM’s unpredictable outputs into fully parseable JSON formats using PydanticOutputParser.

2) Architectural Insights

Domain-driven design inspired by humans: Instead of solving AI’s long-context limitation simply by adopting a model with a larger context window, I fundamentally addressed the problem by translating the writing process of human authors, such as Beat splitting and Information Hiding, and the memory mechanism of the human brain, such as the Forgetting Curve, into software architecture. I realized that deep understanding of the domain, in this case novel creation, leads to strong system architecture.

3) Actionable Takeaways for Future Projects

Complete separation of the agent’s logic (Brain) and memory (Memory): The modular pattern of managing the agent’s state not through prompts but by fully separating it into an external Vector DB and sessions is something I will apply as a basic architecture to any future AI agent, such as CS chatbots or coding assistants.
Sandboxed session management: I plan to actively apply the concurrency control experience I gained from thoroughly separating sessions and independently operating memory at the backend level (FastAPI), so that contexts do not mix when multiple users access the system simultaneously and run their own agents.

Future Improvements

1) Features I Want to Add (Future Features)

Multi-modal automatic illustration pipeline: I am considering integrating image generation models such as Stable Diffusion and DALL-E 3 by automatically converting NovelSetting, such as worldbuilding and costumes, and NovelBeat, such as location and situation, stored in the current Vector DB into prompts, so that web novel covers or illustrations suitable for specific scenes can be generated in real time.
User writing style personalization (Personalization Learning): When a user manually edits a Beat or manuscript written by the AI, I want the agent to capture that and store the user’s unique writing style or plot development patterns as Few-Shot data through self-learning. Through this, the system can evolve from “an AI that writes the same way for everyone” into “a personalized writing assistant that gradually resembles a specific author’s style.”

2) Performance Optimization

Need for a RAG optimization architecture: However, I continued to observe that once the story goes beyond ten episodes, assuming around 5,000 characters per episode, hallucinations may still not be fully eliminated from RAG.
Cost and latency optimization through local sLLM adoption: Currently, the entire pipeline uses heavy commercial APIs such as GPT-4, which can create token cost and latency burdens during commercialization. I plan to improve the routing logic so that lightweight tasks such as metadata parsing, summarization, and simple emotion extraction are handled by locally served fine-tuned open-source sLLMs in the 8B–14B range.

3) Challenges to Solve

Exception handling for the forgetting curve (introducing Core Memory): The current forgetting curve inevitably weakens memory strength as time passes. Because of this, critical foreshadowing that appears only occasionally but runs through the entire story, such as “the One Ring that appeared in episode one,” may be deleted or forgotten by the system. To prevent this, an additional Core Memory layer must be introduced so that users can pin specific memories and prevent them from ever being forgotten.
Dialogue control in complex multi-character scenes: In complex scenes where three or more characters appear simultaneously, talk to one another, and experience conflict, speakers may become mixed up or relationship Matrix queries may become exponentially more complex. This requires architectural consideration, such as separating a sub-agent specialized in multi-party dialogue, namely a Dialogue Controller.

Reference

Home
Projects
Blog
Contact