Research

Where do you get the numbers from?

Widely used generative AI models such as ChatGPT are proprietary and/or accessed via APIs to cloud-based services. As a result, measuring the energy consumption or the carbon footprint of their underlying infrastructure is not possible. To ameliorate this, we primarily focus on using open-source models.

For each of the tasks or modalities found in the corpus (text-to-text, text-to-image, etc.), we chose appropriate open-source models and performed inference experiments on local hardware. We used a desktop workstation with Intel i7 processor, with 16GB RAM, and NVIDIA-RTX3090 GPU with 24GB memory. The energy consumption and carbon footprint were measured with Carbontracker [1].

For the most popular tasks based on analyzing the CHI 2024 corpus (which we assumed is representative of HCI research more broadly), we performed comprehensive measurements for different settings.

text-to-image: Stable-Diffusion-XL [2] was the proxy model used for all text to image generation tasks. Inference runs with varying generated image resolutions: [128, 256, . . . , 1024] were carried out.
text-to-text: Llama-3.1-Instruct [3] was the proxy model used for all text to text tasks. Inference runs of varying token length: [128, 256, . . . , 1024] were performed.
audio-to-text: Whisper [4] model, specifically the large-v2 version was the proxy model used for all audio to text tasks. Inference runs for transcribing audio signals of different lengths were performed: [60, 120, . . . , 600] seconds.

The inference energy consumption of these three models for different task settings is shown in the figure below.

The energy consumptionis measured for an aggregate of 10 model runs to amortize theinitial model loading energy costs. The standard deviation is over10 repeats per setting. We see an almost linear increase in the energy consumption with increasing task load - either number of prompts, or duration of audio.

For the remaining tasks that are less common in the CHI 2024 papers (text-to-3D, image-to-video, etc.) we measured the energy consumption for a fixed task load (19 repeats of a single inference with fixed task load) instead of exploring several settings).

We used the following models for the less common tasks:
• text-to-video: AnimateDiff [5]
• text-to-3D: Shap-E [6]
• text-to-audio: MusicGen [7]
• image-to-text: BLIP [8]
• image-to-image: Instruct-Pix2Pix [9]
• image-to-3D: One-2-3-45 [10]
• video-to-text: XCLIP [11]
• video-to-video: RIFE [12]
• audio-to-video: FreeVC [13]
• image-to-video: SadTalker [14]

References

Anthony, Lasse F. Wolff, Benjamin Kanding, and Raghavendra Selvan. "Carbontracker: Tracking and predicting the carbon footprint of training deep learning models." arXiv preprint arXiv:2007.03051 (2020).
Podell, Dustin, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. "Sdxl: Improving latent diffusion models for high-resolution image synthesis." arXiv preprint arXiv:2307.01952 (2023).
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. "Robust speech recognition via large-scale weak supervision." In International conference on machine learning, pp. 28492-28518. PMLR, 2023.
Lin, Shanchuan, and Xiao Yang. "Animatediff-lightning: Cross-model diffusion distillation." arXiv preprint arXiv:2403.12706 (2024).
Jun, Heewoo, and Alex Nichol. "Shap-e: Generating conditional 3d implicit functions." arXiv preprint arXiv:2305.02463 (2023).
Copet, Jade, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. "Simple and controllable music generation." Advances in Neural Information Processing Systems 36 (2023): 47704-47720.
Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." In International conference on machine learning, pp. 12888-12900. PMLR, 2022.
Brooks, Tim, Aleksander Holynski, and Alexei A. Efros. "Instructpix2pix: Learning to follow image editing instructions." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18392-18402. 2023.
Liu, Minghua, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. "One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization." Advances in Neural Information Processing Systems 36 (2023): 22226-22246.
Ni, Bolin, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. "Expanding language-image pretrained models for general video recognition." In European conference on computer vision, pp. 1-18. Cham: Springer Nature Switzerland, 2022.
Huang, Zhewei, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. "Real-time intermediate flow estimation for video frame interpolation." In European Conference on Computer Vision, pp. 624-642. Cham: Springer Nature Switzerland, 2022.
Li, Jingyi, Weiping Tu, and Li Xiao. "Freevc: Towards high-quality text-free one-shot voice conversion." In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.
Zhang, Wenxuan, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. "Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8652-8661. 2023.