persiaml/persia

WARNING: THIS PROJECT IS CURRENTLY IN MAINTENANCE MODE, DUE TO COMPANY REORGANIZATION

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by Training Deep Learning-based recommender models of 100 trillion parameters over Google Cloud


.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI [email protected] Technology, collaborating with ETH. It is a PyTorch-based (the first public one to our best knowledge) system for training large scale deep learning recommendation models on commodity hardwares. It is capable of training recommendation models with up to 100 trillion parameters. To the best of our knowledge, this is the largest model size in recommendation systems so far. Empirical study on public datasets indicate PERSIA's significant advantage over several other existing training systems in recommendation [1]. Its efficiency and robustness have also been validated by multiple applications with 100 million level DAU at Kuaishou.

Disclaimer: The program is usable and has served several important businesses. However, the official English documentation and tutorials are still under heavy construction and they are a bit raw now. We encourage adventurers to try out PERSIA and contribute!

News

  • Training Deep Learning-based recommender models of 100 trillion parameters over Google Cloud
  • 突破百万亿参数规模,追求极致的效率和性价比:华人团队开源首个异构并行推荐系统训练框架 PERSIA (In Chinese. Title: Breaking through the trillion parameter scale in pursuit of ultimate efficiency and cost effectiveness: Chinese team open source PERSIA, the first heterogeneous parallel recommendation system)
  • 参数量卷到一百万亿!华人团队开源史上最大的推荐训练系统 PERSIA (In Chinese. Title: PERSIA, the Largest Recommended Training System in the History of Open Source by Far)
  • AI Engines in the "Short-video" Era: Eating 100 Trillion Parameters, Invited talk, Facebook, 2021.
  • 单机训练速度提升 640 倍!独家解读快手商业广告模型 GPU 训练平台 PERSIA (In Chinese. Title: 640x Faster GPU Based Learning System for Ad Recommendation)
    • [AI Front] [中国日报] [InfoQ] [CSDN] [Tencent Cloud News] [AcFun]
  • 创新、平衡与大格局:快手商业化的慢与快 (In Chinese. Title: Innovation, Balance, and Big Picture: The Speed of Kwai Commercialization)
    • [TechSir] [China Daily] [Sohu]

Links

  • GitHub Repository
  • Tutorials
  • API documentation (Under Construction)

Discussion

Feel free to join our Telegram Group for discussion!

References

  1. Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ce Zhang, & Ji Liu. (2021). Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters.

  2. Ji Liu & Ce Zhang. (2021). Distributed Learning Systems with First-order Methods.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Issues

Collection of the latest Issues

zxgx

zxgx

Comment Icon3

I adopted Persia to implement a DLRM and run it over Criteo Kaggle dataset.
I set batch size to 1024, and below is the content of docker-compose.yml:

Here's a screenshot of the running process image

As you can see, the throughput is about 30 it/s. Since the batch size is 1024, the throughput is only half of the results reported in your paper. I also notice that the logger kept warning that the local forwarded queue is empty, and these processes didn't cost any GPU memory. Is there any problem about my settings or, do you guys have any suggestions on how to improve the throughput?

williamstar

williamstar

Comment Icon0

Add the EmbeddingModel class to manage the current embedding tensors. It is the torch.nn.Module which instantiates by embedding_config.yml.

Features following:

  • The EmbeddingModel can be initialized from embedding_config.yml.
  • Support more attention operation on attention_embedding_tensor, such as mean, max.
  • Defining embedding initialization method in embedding_config.yml.
  • Abstracted attention_embedding and raw_embedding into Embedding Python class. It can make the EmbeddingModel more interpretable to users, especially in forward and backward phases.
nealgavin

nealgavin

Comment Icon0

Each feature has a certain probability to be created, and the probability is accumulated and calculated according to the positive and negative sample probabilities of the feature, the positive sample probability create _ CLK _ prob and the negative sample probability create _ nonclk _ prob of a common slot, Special slots (slots specified in select _ prob _ slots) are calculated according to create _ CLK _ prob and create _ nonclk _ prob in select _ prob _ slot

nealgavin

nealgavin

Comment Icon0
  1. feature score Each feature has a feature _ score that is calculated a feature_score=clk_coeffpos_ins_num+nonclk_coeffneg_ins_num On push, feature _ score is always accumulated
  2. time_decay End _ day (sample time) will trigger time _ decay Time _ decay, the feature _ score is attenuated by CVM _ plugin.decay _ ratio
  3. shrink_table After time _ decay, or when feature _ num exceeds Max _ features during training, shrink _ table is triggered If one of the conditions is met, it is deleted: ● score < _delete_threshold ● value.unseen_days > _delete_after_unseen_days ● _ select _ prob _ slot _ set.get (value.slot) = = 1 & & score < _ select _ delete _ threshold/Id Group Id type feature, increase delete threshold ● _ photoid _ slot _ set.get (value.slot) = = 1 & & value.unseen _ days > 2/Delete after two days of combined photo _ ID feature If feature _ num is still greater than Max _ features after deletion, it will be deleted from small to large according to feature score until feature _ num < Max _ features is satisfied
NOBLES5E

NOBLES5E

Comment Icon0

Documentation: @karoka

  • Tutorials
    • Introduction
    • Installation @williamstar
    • Getting Started @williamstar
    • Benchmark (experiments * 3)
      • XDL @snowpeakz
      • Paddle @snowpeakz
    • Data processing @snowpeakz
    • Monitoring @snowpeakz
    • Inference @snowpeakz
    • Model checkpointing @williamstar
    • Configuration @snowpeakz
    • Troubleshooting @williamstar
    • Kubernetes Integration @williamstar
    • Advanced: @williamstar
      • Raw embedding
      • HashStack
      • Separate data loader
  • API documentation @williamstar https://persiaml.pages.dev/
  • add system test (with buildkite)
    • train, test (assert number of samples, train and test auc etc.) @williamstar
    • load, dump @snowpeakz
    • inference @snowpeakz
    • move e2e test script to examples @williamstar

Bug:

  • persia-core PyForward && PyBackward thread handler exit with exceptions
  • insert embedding in offline inference phase

Feature:

  • python launcher @williamstar
  • cpu backward with gloo backend @williamstar
  • ddp launch with tcp to replace env_file @snowpeakz
  • support epoch level dataloader @williamstar
  • easy pip install persiaml @williamstar
  • unified storage trait, saving both dense & sparse parameters @snowpeakz
  • support s3 @williamstar

Experiments:

  • Single card + force enable communication
  • Change figure titles

Release

  • merge server code with python code @NOBLES5E
  • GitHub releases @williamstar
  • all persia container images @williamstar
  • kubernetes operator @williamstar

  • NATS integration @snowpeakz
  • multi-node multi-card usable demo @williamstar
  • fix remaining bugs (like in data processing) @williamstar
  • publish PyPI wheels on CI @NOBLES5E
  • API documentation website @NOBLES5E

Information - Updated Sep 14, 2022

Stars: 311
Forks: 40
Issues: 22

Repositories & Extras

IDE

3.5K

Rust Language Server (RLS)

The RLS provides a server that runs in the background, providing IDEs,

Rust Language Server (RLS)

Rust lang bookmarking tool

Rust and Rocket used bookmarking tool for search bar

Rust lang bookmarking tool

Rust Language Security

execrices: RUSTSEC-2021-0001

Rust Language Security

False Positive for rust-lang/rust#83583

The deprecation lint proc_macro_derive_resolution_fallback is intended to catch proc macro generated code that refers to items from parent modules that should not be in scope:

False Positive for rust-lang/rust#83583

rust_icu: low-level rust language bindings for the ICU library

See: The latest version of this file is available at

rust_icu: low-level rust language bindings for the ICU library

Rust lang exercises

Personal tips and drills in my journey as a beginner rustacean

Rust lang exercises

😍 Rust Language

👍 Download and execute rustup

😍 Rust Language

TensorFlow Rust provides idiomatic Rust language

bindings for Documentation

TensorFlow Rust provides idiomatic Rust language

Rust Language Learning material

Rust is blazingly fast systems programming language that prevents segfaults and guarantees thread safety

Rust Language Learning material

leetcode in rust lang

立个flag一年刷完leetcode

leetcode in rust lang
Facebook Instagram Twitter GitHub Dribbble
Privacy