Loading...

热门

伯克利｜适用于长上下文大模型的分块并行Transformer

大模型2年前 (2023)发布智源社区

735 0 0

Blockwise Parallel Transformer for Long Context Large Models

Hao Liu, Pieter Abbeel
[UC Berkeley]

适用于长上下文大模型的分块并行Transformer

要点:

动机：解决自注意力机制和大型前馈网络在Transformer中带来的内存需求问题，以处理长序列和长程依赖性任务。
方法：提出一种新方法，即块状并行Transformer(BPT)，通过块状计算自注意力和前馈网络融合，以最小化内存成本。
优势：BPT可以处理比普通Transformer长32倍的训练序列，并且比之前的内存高效方法能处理2至4倍更长的序列。在语言建模和强化学习任务上进行的大量实验证明了BPT在降低内存要求和提高性能方面的有效性。

提出了块状并行Transformer(BPT)方法，通过块状计算自注意力和前馈网络融合，降低内存需求并处理长序列和长程依赖性任务。

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

https://arxiv.org/abs/2305.19370
伯克利｜适用于长上下文大模型的分块并行Transformer

# 大模型 # 智源社区 # 大模型 # 论文

© 版权声明

文章版权归作者所有，未经允许请勿转载。

相关文章

伯克利开源LLM推理与服务库：GPU减半、吞吐数十倍猛增

智源社区

972

陈丹琦等｜微调语言模型内存高效的零阶优化器MeZO，内存减少多达12倍

智源社区

790

朴素贝叶斯算法多元分类预测 | Matlab 基于朴素贝叶斯算法(bayesian)的数据分类预测

智源社区

577

悉尼科技大学 | 通过交互式提示进行高效的多模式融合

智源社区

846

Stability AI发布Vicuna，第一个开源RLHF聊天机器人

智源社区

862

GPT Daily | 04.21（千模大战）

GPTDaily

1,249

暂无评论

暂无评论...

这是一个专注于人工智能产品的导航站。

关于我们友情链接

Copyright © 2025 Ai导航鄂ICP备2023001728号