Meituan Longcat Switches On LongCat Flash Lite

Graphical logo of Meituan Longcat

Meituan Longcat has introduced LongCat Flash Lite, a 68.5B parameter model with approximately 3B activated parameters, designed to tackle specific scaling inefficiencies in Mixture-of-Experts (MoE) architectures. LongCat Flash Lite supports a context length of 256k via the YaRN method and integrates a massive N-gram embedding table, allocating over 30B parameters specifically to this component.

By shifting focus from expert scaling to embedding scaling, the developers aim to overcome the diminishing returns often found in traditional sparse models. This release targets developers and researchers looking for high-performance inference in agentic and coding applications without the full computational load of dense models.

Core Features & Technical Capabilities

  • N-gram embedding table integration for enhanced performance.
  • Support for 256k context length utilizing the YaRN method.
  • Specialized N-gram Cache and synchronized kernels for optimized latency.
  • Mitigation of I/O bottlenecks typically associated with FFN-based experts.
  • Competitive proficiency in agentic tool use and coding tasks.

Benchmark Results & Performance Metrics

The model demonstrates distinct advantages in specific verticals, particularly in agentic tasks where it outperformed comparable models. On the Tau2-Retail benchmark, LongCat-Flash-Lite achieved a score of 73.10, significantly higher than the 57.3 scored by Qwen3-Next-80B-A3B-Instruct. It also recorded 72.80 on Tau2-Telecom and 54.40 on SWE-Bench for agentic coding. While general domain results remain competitive, the architecture specifically shines in reducing the latency issues common to similar MoE systems.

Expert Analysis & Developer Insights

The development team argues that traditional scaling methods are hitting a ceiling. The technical paper states,

'While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks.'

This insight drove the creation of the N-gram embedding approach, which serves as an orthogonal dimension for scaling. The project documentation highlights the system-level benefits, noting,

'In contrast to FFN-based experts, the N-gram embedding table inherently mitigates I/O bottlenecks within MoE layers, yielding substantial improvements in inference latency.'

Furthermore, the researchers claim they

'identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to increasing the number of experts.'

More on LongCat Flash Lite