-4.9 C
New York
Thursday, January 9, 2025

Posit AI Blog: Hugging Face Integrations



We are happy to announce the first releases of hfhub and tok now they are in CRAN. hfhub is an R interface for Hugging Hub Faceallowing users to download and cache files from Hugging Face Hub while tok implements R hooks for Hug Face Tokenizers
library.

hugging face quickly became he platform for creating, sharing, and collaborating on deep learning applications, and we hope these integrations will help R users get started using Hugging Face tools as well as create novel applications.

We have also previously announced the safety tensioners
Package that allows reading and writing files in Safetensors format.

hfhub

hfhub is an R interface for Hugging Face Hub. Currently, hfhub implements a single functionality: downloading files from Hub repositories. Model Hub repositories are primarily used to store pre-trained model weights along with any other metadata needed to load the model, such as hyperparameter settings and tokenizer vocabulary.

Downloaded files are stored using the same layout as the Python library, so cached files can be shared between the R and Python implementation, for easier and faster switching between languages.

We already use hfhub in the minhub package and in the Blog post ‘GPT-2 from scratch with torch’ to download pre-trained weights from Hugging Face Hub.

you can use hub_download() to download any file from a Hugging Face Hub repository by specifying the repository ID and the path to the file you want to download. If the file is already in the cache, the function returns the file path immediately; otherwise the file is downloaded, cached, and then the path is returned.

blog post ‘What are large language models? What are they not?.

When using a pre-trained model (for both inference and fine-tuning), it is very important that it uses the exact same tokenization process that was used during training, and the Hugging Face team has done an incredible job making sure their algorithms match. tokenization strategies used the majority of LLMs.

tok provides R bindings to the tokenizer library 🤗. The tokenizer library is implemented in Rust to improve performance and our bindings use the extender project
to help interact with R. Using tok we can tokenize text in exactly the same way that most NLP models do, making it easy to load pre-trained models into R, as well as sharing our models with the NLP community wider.

tok can be installed from CRAN and its use is currently restricted to loading tokenizer vocabularies from files. For example, you can load the tokenizer for the GPT2 model with:

Remember that you can now host
Shiny (for R and Python) in Hugging Face Spaces. As an example, we have created a Shiny application that uses:

  • torch to implement GPT-NeoX (the neural network architecture of StableLM – the model used to chat)
  • hfhub to download and cache pre-trained weights from the StableLM Repository
  • tok to tokenize and preprocess text as input to the torch model. tok also uses hfhub to download the tokenizer vocabulary.

The application is hosted on this Space. It currently runs on the CPU, but you can easily change the Docker image if you want to run it on a GPU for faster inference.

The app’s source code is also open source and can be found on Spaces. file tab.

Thinking about the future

It’s the early days of hfhub and tok and there’s still a lot of work to do and functionality to implement. We hope to get help from the community to prioritize the work; Therefore, if any feature is missing, please open an issue in the
GitHub repositories.

Re-use

Text and figures are licensed under a Creative Commons Attribution license. CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be recognized by a note in their caption: “Figure of…”.

Citation

For attribution, please cite this work as

Falbel (2023, July 12). Posit AI Blog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/

BibTeX Citation

@misc{hugging-face-integrations,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: Hugging Face Integrations},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/},
  year = {2023}
}

Related Articles

Latest Articles