Unlocking the Potential of Large Language Models: A Deep Dive into Long Writer

azhar
6 min readSep 12, 2024

--

The progression of large language models (LLMs) over the past few years has been nothing short of remarkable. From the early days of modest token context windows to today’s expansive architectures capable of understanding and generating vast amounts of text, the AI landscape is rapidly evolving. At the forefront of this evolution is a new project: Long Writer, developed by researchers at Shizuoka University. This article explores Long Writer’s capabilities, the methods employed to achieve long-form content generation, and how developers can utilize these advancements in their projects.

Before we proceed, let’s stay connected! Please consider following me on Medium, and don’t forget to connect with me on LinkedIn for a regular dose of data science and deep learning insights.” 🚀📊🤖

The Evolution of Context Windows in Large Language Models

Context windows in LLMs, which determine how much text a model can consider at any given time, have expanded significantly in recent years. Initially, models could handle around 2,000 tokens, but advancements have led to figures as high as 1 million tokens with the release of Google’s Gemini 1.5. However, even with these developments, many models still struggled to output long-form content. OpenAI’s models, for example, may have increased their internal processing capabilities but did not significantly enhance the volume of text they could coherently generate in a single response, often topping out around 4,000 tokens.

The inherent limitation in generating extensive and cohesive articles poses challenges for use cases requiring lengthy outputs, such as document writing, technical manuals, and comprehensive guides. Enter Long Writer, aimed at addressing these issues directly.

Introducing Long Writer

Long Writer is a fine-tuned version of existing large language models that aims to generate up to 10,000 words in a single prompt. The model evaluates its outputs against a dataset specifically designed for long-form generation tasks, allowing it to produce coherent content over extended sections. The researchers have made available the fine-tuned models, as well as a detailed paper describing the methodologies behind this project.

Key aspects of Long Writer include:

  1. Model Options: Two primary models are available:
  2. GLM 49B Long Writer
  3. Llama 3 8 Billion Long Writer
  4. Supervised Fine-Tuning: By employing a dataset of 6,000 examples ranging from 2,000 to 32,000 words, the authors improved model performance through supervised fine-tuning. This ensures the model learns to generate extended outputs coherently.
  5. Data-Centric Approach: The researchers utilized a novel approach to create the datasets, leveraging an agent to orchestrate and control the text generation process.

Understanding the Process of Long Writer

The Dataset Creation Methodology

One of the novel aspects of Long Writer is how the dataset was constructed. The research team developed an agent-based approach dubbed “Agent WR,” which effectively plans and executes long-form content generation. The agent piece acts as a control unit, guiding the model through necessary stages of writing. This involves splitting the task into segments, thus allowing for logical organization and extended detail. For instance, a piece on the Roman Empire might be structured into 15 separate sections:

  1. Introduction
  2. Historical Background
  3. Key Figures
  4. Major Events
  5. Influence on Modern Culture
  6. Conclusion

Creating outputs this way ensures that when the model writes lengthy documents, each section is maintained in coherence and relevance to the overall topic. This structured approach ultimately contributes to the model’s ability to produce extensive content, defining a new standard in LLM technology.

Training with Supervised Fine-Tuning

The researchers utilized a carefully curated dataset to train Long Writer. The training strategy involved supervised fine-tuning, helping the model learn gradually and effectively reproduce coherent long-form texts. The dataset included examples from diverse domains, increasing the model’s versatility across various writing tasks.

Training on such extensive texts, as outlined in the research paper, involved:

  1. High-Volume Examples: Using a dataset that spans various genres and lengths to prevent overfitting and ensure a generalized understanding of long-form content generation.
  2. Alignment Training: Incorporating techniques like “Demonstration of Prefered Outcomes” (DPO) to align model output more closely with user expectations.

Model Evaluation and Benchmarks

Long Writer has been benchmarked using the Long Bench and Long Bench Ruler evaluation metrics. These benchmarks allow for comparing model performance against industry-standard outputs, showcasing how well Long Writer stacks against proprietary models. Initial evaluations indicate that Long Writer’s outputs surpass current models in generating long-form content efficiently and coherently.

Practical Applications of Long Writer

The practical applications of Long Writer are diverse and versatile. Here are some scenarios where the capabilities of this model can be leveraged:

  1. Content Creation: Generate extensive articles for blogs, websites, or publications.
  2. Technical Writing: Develop comprehensive manuals and documentation for software, tools, or machinery.
  3. Educational Resources: Produce vast resources for learning, such as textbooks or detailed guides.
  4. Creative Writing: Explore fiction by creating novels or short stories with expansive plots and subplots.

Code Implementation

For developers interested in getting started with Long Writer, the model is readily available on platforms like Hugging Face. Below, we provide a sample implementation using the Long Writer model with the Hugging Face Transformers library.

Install necessary packages

!pip -q install accelerate 
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q sentencepiece

Import libraries

from transformers import AutoTokenizer, AutoModelForCausalLM 
import torch

Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained("THUDM/LongWriter-llama3.1-8b", trust_remote_code=True) 
model = AutoModelForCausalLM.from_pretrained("THUDM/LongWriter-llama3.1-8b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
model.eval()

Define the device

device = "cuda"

Prepare a prompt for the model

query = "Write a 10000-word China travel guide" 
prompt = f"[INST]{query}[/INST]"

Tokenize input

input = tokenizer(prompt, truncation=False, return_tensors="pt").to(device)

Generate output with specified parameters

context_length = input.input_ids.shape[-1] 
output = model.generate( **input, max_new_tokens=32768, num_beams=1, do_sample=True, temperature=0.5, )[0]

Decode and print the response

response = tokenizer.decode(output[context_length:], skip_special_tokens=True) 
print(response)

In this sample code: — We initialize the Long Writer model and tokenizer. — We prepare a query asking for a 10,000-word travel guide to China and pass that to the model. — We generate text based on our prompt, specifying parameters like maximum new tokens and sampling temperature for output variation.

Extensive Testing

By running the Long Writer model across diverse topics, you can observe how it manages to generate coherent outputs. For example, testing it with a prompt for a knitting guide yielded over 10,000 words, even when the topic had less training data available.

To illustrate further, here’s how we can tweak our code to generate different topics:

topics = [ "Write a 10000-word guide to knitting", "Write a 10000-word guide to underwater kickboxing", "Write a 10000-word story on Aliens impersonating people on earth. It should be in a serious style and be based around a murder of someone who discovered the alien shape shifter." ]

for topic in topics:

prompt = f"[INST]{topic}[/INST]" 
input = tokenizer(prompt, truncation=False, return_tensors="pt").to(device)
context_length = input.input_ids.shape[-1]
output = model.generate( **input, max_new_tokens=32768, num_beams=1, do_sample=True, temperature=0.5, )[0]
response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
print(f"Response for topic '{topic}': \n{response[:500]}...\n") # Print the first 500 characters

With this approach, we encapsulate multiple queries in a loop, leading to varied and potentially extensive outputs for nearly any requested topic.

Conclusion

The Long Writer project represents a significant step forward for large language models, particularly in terms of generating long-form content coherently and meaningfully. By leveraging supervised fine-tuning, a well-thought-out dataset, and an innovative agent-based approach, Long Writer enhances LLM capabilities, making it a powerful tool for writers, educators, and organizations.

For developers, the open availability of the Long Writer models and datasets on platforms like Hugging Face offers the opportunity to explore and leverage these advancements in their applications. By integrating these tools, we can push the boundaries of what’s possible in automated content creation, paving the way for future developments in artificial intelligence and natural language processing.

As the landscape of AI continues to evolve, projects like Long Writer exemplify the potential for scalable and impactful technologies that can transform how we create and consume information. I encourage all curious developers and researchers to explore this technology further and consider how it might fit into their workflows. Let’s embrace these advancements and shape the future of content together.

For feedback, discussions, or community support, consider joining Discord servers dedicated to AI, where you can learn from others’ experiences. The AI landscape is vast, and continuous learning is key to maximizing its potential.

--

--