State-of-the-art NLP fashions from R

2025年1月6日

18

Introduction

He Transformers repository of “Hug face” accommodates numerous ready-to-use state-of-the-art fashions, that are straightforward to obtain and tune with Tensorflow and Keras.

To do that, customers usually have to receive:

The mannequin itself (e.g. Bert, Albert, RoBerta, GPT-2, and so on.)
The tokenizer object
The weights of the mannequin.

On this submit, we are going to work on a basic binary classification activity and practice our dataset on 3 fashions:

Nonetheless, readers ought to be conscious that transformers will be labored with in a wide range of downstream duties, reminiscent of:

characteristic extraction
sentiment evaluation
textual content classification
reply to questions
abstract
translation and many extra.

Conditions

Our first job is to put in the transformers bundle through reticulate.

reticulate::py_install('transformers', pip = TRUE)

Then, as normal, load commonplace ‘Keras’, ‘TensorFlow’ >= 2.0 and a few basic R libraries.

Please be aware that for those who run TensorFlow on GPU, the next parameters will be specified to keep away from reminiscence points.

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices((1)),TRUE)

tf$keras$backend$set_floatx('float32')

Template

We already talked about that to coach knowledge on the particular mannequin, customers have to obtain the mannequin, its tokenizer object, and its weights. For instance, to get a RoBERTa mannequin you need to do the next:

# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Mannequin with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Information preparation

A knowledge set for binary classification is supplied in text2vec bundle. Let’s load the dataset and take a pattern for fast mannequin coaching.

Divide our knowledge into 2 elements:

idx_train = pattern.int(nrow(df)*0.8)

practice = df(idx_train,)
take a look at = df(!idx_train,)

Information entry for Keras

To this point, we now have solely lined importing knowledge and splitting between assessments. To feed the enter to the community, we now have to transform our plain textual content into indexes through the imported tokenizer. After which adapt the mannequin to carry out binary classification by including a dense layer with a single unit on the finish.

Nonetheless, we wish to practice our knowledge for 3 fashions GPT-2, RoBERTa and Electra. We have to write a loop for that.

Notice: a basic mannequin requires 500-700 MB

# record of three fashions
ai_m = record(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create an inventory for mannequin outcomes
gather_history = record()

for (i in 1:size(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m((i))(2)}$from_pretrained('{ai_m((i))(3)}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # mannequin
  model_ = glue::glue("transformer${ai_m((i))(1)}$from_pretrained('{ai_m((i))(3)}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  textual content = record()
  # outputs
  label = record()
  
  data_prep = perform(knowledge) {
    for (i in 1:nrow(knowledge)) {
      
      txt = tokenizer$encode(knowledge(('comment_text'))(i),max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% record()
      lbl = knowledge(('goal'))(i) %>% t()
      
      textual content = textual content %>% append(txt)
      label = label %>% append(lbl)
    }
    record(do.name(plyr::rbind.fill.matrix,textual content), do.name(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(practice)
  test_ = data_prep(take a look at)
  
  # slice dataset
  tf_train = tensor_slices_dataset(record(train_((1)),train_((2)))) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$knowledge$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(record(test_((1)),test_((2)))) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an enter layer
  enter = layer_input(form=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(enter)((1)), axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(models=1, activation='sigmoid')
  mannequin = keras_model(inputs=enter, outputs = output)
  
  # compile with AUC rating
  mannequin %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m((i))(1)}'))
  # practice the mannequin
  historical past = mannequin %>% keras::match(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history((i))<- historical past
  names(gather_history)(i) = ai_m((i))(1)
}

Play in a Laptop computer

Extract outcomes to see benchmarks:

Each the Roberta and electra The fashions present some additional enhancements after 2 coaching epochs, which can’t be mentioned for GPT-2. On this case, it’s clear that it might be adequate to coach a state-of-the-art mannequin even for a single epoch.

Conclusion

On this submit, we present the best way to use R’s cutting-edge NLP fashions. To grasp the best way to apply them to extra advanced duties, it’s extremely really useful to assessment the transformers tutorial.

We encourage readers to strive these fashions and share your outcomes beneath within the feedback part!

Corrections

When you see errors or wish to recommend adjustments, please create an issue within the supply repository.

Re-use

Textual content and figures are licensed beneath a Inventive Commons Attribution license. CC BY 4.0. The supply code is accessible at https://github.com/henry090/transformersexcept in any other case indicated. Figures which have been reused from different sources will not be lined by this license and will be acknowledged by a be aware of their caption: “Determine of…”.

Quotation

For attribution, please cite this work as

Abdullayev (2020, July 30). Posit AI Weblog: State-of-the-art NLP fashions from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/

BibTeX Quotation

@misc{abdullayev2020state-of-the-art,
  creator = {Abdullayev, Turgut},
  title = {Posit AI Weblog: State-of-the-art NLP fashions from R},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/},
  yr = {2020}
}

State-of-the-art NLP fashions from R

Introduction

Conditions

Template

Information preparation

Information entry for Keras

Conclusion

Corrections

Re-use

Quotation

Related Articles

Creation of higher medical care outcomes with the Azure Openai and Azure Ai Foundry service

Chatgpt turns into superb to guess the places of the images, which brought on Doxx considerations.

Carplay software with net browser to transmit video blows App Retailer

Latest Articles

Creation of higher medical care outcomes with the Azure Openai and Azure Ai Foundry service

Chatgpt turns into superb to guess the places of the images, which brought on Doxx considerations.

Carplay software with net browser to transmit video blows App Retailer

314 issues that the federal government might find out about you

Inside O3 and O4 -Mini: Unlocking of recent potentialities by means of multimodal reasoning and built-in instruments units

ABOUT US