Mergoo: Open-Source Model Merging Across Architectures and Sizes

Introduction

The landscape of Large Language Model (LLM) development is shifting from pure training to sophisticated model composition. As the community produces thousands of fine-tuned variants of Llama, Mistral, and Qwen, the ability to combine these models into something greater than the sum of their parts has become a competitive necessity. Mergoo, developed by Leeroo-AI, represents a significant breakthrough in this space. While traditional merging tools require models to share identical architectures, Mergoo is specifically designed for model merging across different architectures and varying parameter counts. This capability allows developers to synthesize hybrid models that leverage the unique strengths of divergent architectures, effectively democratizing high-end model architectural design for the open-source community.

What Is Mergoo?

Mergoo is a specialized Python library that facilitates the merging of heterogeneous Large Language Models. In the context of deep learning, merging usually involves combining the weights of two or more models to create a single checkpoint that retains the capabilities of all ancestors. Mergoo is the first open-source library that allows for merging models even when they have different architectures (e.g., merging a Llama-based model with a Mistral-based model) or different sizes (e.g., merging a 7B model with a 1B model). This flexibility is achieved through advanced mapping techniques and Mixture of Experts (MoE) construction strategies that handle the discrepancies in hidden dimensions and attention mechanisms.

Released under the Apache 2.0 license, the project provides a unified interface for researchers to experiment with “Model Soups,” TIES merging, and MoE conversion without needing to write custom boilerplate for every new architecture pair. It integrates deeply with the Hugging Face ecosystem, making it easy to pull base models and push merged results directly to the Hub.

Why Mergoo Matters

The primary constraint in modern AI development is the cost of compute. Training a new model from scratch to gain the capabilities of two existing models is prohibitively expensive for most organizations. Model merging offers a shortcut, but until Mergoo, that shortcut was limited to models from the same family. If you wanted the reasoning capability of a specific Llama-3 fine-tune and the efficiency of a Mistral variant, you were largely out of luck. Mergoo removes this barrier, allowing for the creation of “architectural hybrids” that can bridge the gap between different model families.

Furthermore, Mergoo provides the tools to create Mixture of Experts (MoE) models from dense models of different sizes. This means a developer can take a large, highly capable model and use a smaller, faster model as a secondary expert, creating a system that balances performance and inference speed dynamically. This is a critical development for deploying AI on edge devices or in resource-constrained environments where a full-sized dense model might be too slow, but a tiny model is too inaccurate.

Key Features

Cross-Architecture Support: Mergoo can merge models with different underlying architectures, such as Llama-3 and Mistral-7B, by mapping corresponding layers and handling dimensional mismatches.
Heterogeneous Size Merging: The library allows for the combination of models with different parameter counts, enabling the distillation of knowledge from large models into hybrid structures.
Mixture of Experts (MoE) Construction: Users can convert multiple dense models into a single MoE model, where a router determines which “expert” layers to activate for a given input.
Config-Driven Workflow: Mergoo uses simple YAML configuration files to define the merging strategy, model paths, and layer mappings, making experiments reproducible.
Integrated Weight Averaging: Support for classic merging techniques like Model Soups and TIES is built-in, optimized for cross-architecture scenarios.
Hugging Face Compatibility: Seamlessly loads models and tokenizers from the Hugging Face Hub and supports standard AutoModel classes.
Extensible Router Training: When creating MoE models, Mergoo provides frameworks for training or initializing routers to effectively navigate the new expert layers.

How Mergoo Compares

To understand the value of Mergoo, it is essential to compare it with the current industry standard, Mergekit. While Mergekit is incredibly powerful for same-family merging, it often struggles when architectural components do not align perfectly. Mergoo fills the gap for heterogeneous merging.

Feature	Mergoo	Mergekit	Manual Scripting
Cross-Architecture	Native Support	Limited	Very Difficult
Different Model Sizes	Yes	No	Complex Math Required
MoE Creation	Advanced/Heterogeneous	Standard/Homogeneous	Manual Implementation
Ease of Use	High (YAML)	High (YAML)	Low

While Mergekit remains the go-to for merging multiple Llama-3 checkpoints to create a slightly better Llama-3, Mergoo is the tool you use when you want to experiment with new architectural paradigms altogether. It is less a tool for simple optimization and more a tool for architectural innovation.

Getting Started: Installation

Mergoo is a Python-based library and can be installed via pip. It requires a modern Python environment and the PyTorch ecosystem.

Standard Installation

pip install mergoo

Installation from Source

For those who want to access the latest experimental features or contribute to the project, installing from source is recommended:

git clone https://github.com/Leeroo-AI/mergoo.git
cd mergoo
pip install -e .

Prerequisites

Ensure you have the following installed to avoid compatibility issues:

Python 3.9+
PyTorch 2.0+
Transformers library (latest version recommended)
Accelerate library for efficient weight handling

How to Use Mergoo

The workflow in Mergoo revolves around the ModelMerge or MoEMerge classes and a configuration object. The process typically involves defining which models to merge and how their layers should be mapped if they aren’t identical.

For basic merging, you define a list of model paths and a merging method. For more complex cross-architecture merging, you will need to specify the layer mappings in a YAML configuration file. This file tells Mergoo which layers in the source models correspond to each other, allowing the library to handle any necessary dimensionality conversions behind the scenes.

Code Examples

Below is a conceptual example of how to initialize a merge between two models of different sizes using the Mergoo API.

Basic Cross-Architecture Setup

from mergoo.models.modeling_mergoo import MergooForCausalLM
from transformers import AutoConfig

# Define the target configuration for the hybrid model
config = AutoConfig.from_pretrained("leeroo/mergoo-l3-8b-m-v1")
model = MergooForCausalLM.from_pretrained(
    "leeroo/mergoo-l3-8b-m-v1",
    config=config
)

# The model now contains layers from multiple architectures
print(model)

MoE Merge Configuration

A typical YAML configuration for creating a Mixture of Experts model from two different base models might look like this:

base_model: "meta-llama/Meta-Llama-3-8B"
experts:
  - source_model: "mistralai/Mistral-7B-v0.1"
    positive_prompts: ["coding", "math"]
  - source_model: "meta-llama/Meta-Llama-3-8B-Instruct"
    positive_prompts: ["chat", "general"]
method: moe
device: cuda

Advanced Configuration

Mergoo shines in its ability to handle fine-grained configuration. You can specify different merging weights for specific layers, or even use different merging methods for the attention mechanism versus the MLP (Multi-Layer Perceptron) blocks. This level of control is necessary because different architectures store knowledge in different components. For example, you might want to preserve the attention heads of a Llama-3 model while using the MLP layers from a specialized Mistral fine-tune.

Real-World Use Cases

Domain-Specific MoE: Combine a general-purpose model with a highly specialized medical or legal model of a different size to create a versatile expert system that is smaller than a full ensemble.
Architecture Upcycling: Take the weights from an older, well-understood architecture and merge them into a newer, more efficient architecture to give the new model a “head start” in knowledge.
Resource Optimization: Create hybrid models that use large experts for complex reasoning tasks and small experts for simple linguistic processing, significantly reducing inference latency.
Cross-Lingual Transfer: Merge a strong English-centric Llama model with a smaller but linguistically diverse model from another family to improve multi-lingual performance without full retraining.

Contributing to Mergoo

The Mergoo project is actively looking for contributors to expand its library of supported architectures. If you wish to contribute, you can start by checking the GitHub issues for “good first issues.” The project follows a standard PR-based workflow. Because Mergoo deals with complex weight manipulations, writing unit tests for new merging methods is highly encouraged. Documentation improvements and new YAML templates for popular model combinations are also highly valued by the maintainers.

Community and Support

Leeroo-AI maintains an active presence for the Mergoo community. You can find technical discussions and support through the following channels:

GitHub Discussions: For architectural questions and feature requests.
Discord: For real-time troubleshooting and collaboration with other Mergoo users.
Official Documentation: Provides in-depth API references and advanced tutorials on layer mapping.

Conclusion

Mergoo is more than just another merging tool; it is a framework for architectural experimentation. By breaking the constraint that merged models must share the same skeleton, it opens up a new frontier in model design. Whether you are looking to create a custom MoE that balances speed and intelligence, or you want to experiment with hybrid Llama-Mistral architectures, Mergoo provides the necessary primitives to do so efficiently. As the open-source community continues to iterate on LLMs, tools like Mergoo will be essential for synthesizing the vast amount of individual progress into unified, highly capable systems. Star the repository on GitHub, explore the existing YAML templates, and start building the next generation of hybrid AI models today.

Resources

What is Mergoo and what problem does it solve?

Mergoo is an open-source Python library for merging Large Language Models with different architectures and parameter sizes. It solves the limitation found in most merging tools where models must have identical structures to be combined, allowing for the creation of cross-architecture hybrid models.

How does Mergoo handle merging models of different sizes?

Mergoo uses advanced layer mapping and dimensionality reduction/expansion techniques to align weights between models of different sizes. It can also utilize Mixture of Experts (MoE) strategies where models of different sizes act as separate experts within a unified framework.

Is Mergoo compatible with Hugging Face models?

Yes, Mergoo is built to integrate seamlessly with the Hugging Face Transformers library. It supports loading models and tokenizers using standard AutoModel classes and allows for pushing merged checkpoints directly to the Hugging Face Hub.

Can I use Mergoo to create a Mixture of Experts (MoE)?

Absolutely. One of Mergoo’s primary features is the MoEMerge capability, which allows you to take several dense models (even with different architectures) and combine them into a single MoE model with a learnable or initialized router.

What merging methods does Mergoo support?

Mergoo supports several popular merging methods including Model Soups, TIES merging, and MoE conversion. These methods are adapted to handle the complexities of heterogeneous architectural components.

How does Mergoo compare to Mergekit?

While Mergekit is excellent for merging models of the same architecture, Mergoo is specifically optimized for cross-architecture and cross-size merging. Mergoo provides the mathematical mapping necessary to combine models like Llama and Mistral which Mergekit does not natively support as deeply.

Can I use Mergoo for non-Transformer architectures?

Currently, Mergoo is focused on Transformer-based Large Language Models as they are the dominant architecture in the community. However, the library is designed to be extensible to other architectures in the future as the deep learning landscape evolves.

Does Mergoo require a lot of GPU memory?

Merging can be memory-intensive, but Mergoo leverages the Accelerate library to load model weights efficiently. For very large models, you may need a high-memory GPU or significant system RAM to perform the weight averaging and mapping operations.

Can I merge a Llama-3 model with a Mistral model using Mergoo?

Yes, this is one of the key use cases for Mergoo. By defining a layer mapping in the configuration, Mergoo can align the attention and MLP blocks of a Llama-3 model with those of a Mistral model to create a functional hybrid.

Is the router in a Mergoo-created MoE trainable?

Yes, Mergoo provides the functionality to initialize or train the gate/router for a created MoE model, ensuring that the model learns how to effectively route inputs to the most appropriate expert layers.