7.1 C
New York
Sunday, January 19, 2025

Simple audio grading with flashlight.


This text is translated Daniel Falbel‘s ‘Easy audio classification’ article of tensorflow/keras to torch/torchaudio. The principle goal is to current torchaudio and illustrate its contributions to the torch ecosystem. Right here, we concentrate on a well-liked knowledge set, the audio loader and the spectrogram transformer. An attention-grabbing byproduct is the parallel between torch and tensorflow, typically displaying the variations and typically the similarities between them.

Obtain and import

torchaudio has the speechcommand_dataset integrated. It filters background noise by default and permits us to decide on between variations. v0.01 and v0.02.

# set an current folder right here to cache the dataset
DATASETS_PATH <- "~/datasets/"

# 1.4GB obtain
df <- speechcommand_dataset(
  root = DATASETS_PATH, 
  url = "speech_commands_v0.01",
  obtain = TRUE
)

# anticipate folder: _background_noise_
df$EXCEPT_FOLDER
# (1) "_background_noise_"

# variety of audio information
size(df)
# (1) 64721

# a pattern
pattern <- df(1)

pattern$waveform(, 1:10)
torch_tensor
0.0001 *
 0.9155  0.3052  1.8311  1.8311 -0.3052  0.3052  2.4414  0.9155 -0.9155 -0.6104
( CPUFloatType{1,10} )
pattern$sample_rate
# 16000
pattern$label
# mattress

plot(pattern$waveform(1), sort = "l", col = "royalblue", essential = pattern$label)

Determine 1: A pattern waveform for a “mattress”.

Lessons

 (1) "mattress"    "hen"   "cat"    "canine"    "down"   "eight"  "5"  
 (8) "4"   "go"     "joyful"  "home"  "left"   "marvin" "9"  
(15) "no"     "off"    "on"     "one"    "proper"  "seven"  "sheila"
(22) "six"    "cease"   "three"  "tree"   "two"    "up"     "wow"   
(29) "sure"    "zero"  

Generator knowledge loader

torch::dataloader has the identical process as data_generator outlined within the unique article. It’s chargeable for getting ready batches (together with shuffling, padding, one-hot encoding, and many others.) and dealing with gadget I/O parallelism/orchestration.

In torch we do that by passing the prepare/take a look at subset to torch::dataloader and encapsulate all of the batch configuration logic inside a collate_fn() operate.

At this level, dataloader(train_subset) it would not work as a result of the samples are usually not stuffed. So we have to construct our personal collate_fn() with the fill technique.

I recommend utilizing the next strategy when implementing the collate_fn():

  1. begin with collate_fn <- operate(batch) browser().
  2. instantiate dataloader with the collate_fn()
  3. create an environment by calling enumerate(dataloader) so you possibly can request to retrieve a batch from the info loader.
  4. run atmosphere((1))((1)). It ought to now be despatched inside collate_fn() with entry to batch enter object.
  5. construct the logic.
collate_fn <- operate(batch) {
  browser()
}

ds_train <- dataloader(
  train_subset, 
  batch_size = 32, 
  shuffle = TRUE, 
  collate_fn = collate_fn
)

ds_train_env <- enumerate(ds_train)
ds_train_env((1))((1))

the ultimate collate_fn() fills the waveform to a size of 16001 after which stacks all the things. At this level there are nonetheless no spectrograms. We are going to make the spectrogram transformation a part of the mannequin structure.

pad_sequence <- operate(batch) {
    # Make all tensors in a batch the identical size by padding with zeros
    batch <- sapply(batch, operate(x) (x$t()))
    batch <- torch::nn_utils_rnn_pad_sequence(batch, batch_first = TRUE, padding_value = 0.)
    return(batch$permute(c(1, 3, 2)))
  }

# Last collate_fn
collate_fn <- operate(batch) {
 # Enter construction:
 # checklist of 32 lists: checklist(waveform, sample_rate, label, speaker_id, utterance_number)
 # Transpose it
 batch <- purrr::transpose(batch)
 tensors <- batch$waveform
 targets <- batch$label_index

 # Group the checklist of tensors right into a batched tensor
 tensors <- pad_sequence(tensors)
 
 # goal encoding
 targets <- torch::torch_stack(targets)

 checklist(tensors = tensors, targets = targets) # (64, 1, 16001)
}

The construction of the lot is:

  • batch((1)): waveformstensor with dimension (32, 1, 16001)
  • batch((2)): objectivestensor with dimension (32, 1)

Moreover, torchaudio comes with 3 chargers, av_loader, tuner_loaderand audiofile_loader– extra to come back. set_audio_backend() It’s used to configure one in all them as an audio loader. Their performances differ relying on the audio format (mp3 or wav). There isn’t a excellent world but: tuner_loader it’s higher for mp3, audiofile_loader is best for wav, however neither of them have the choice to partially load a pattern of an audio file with out first bringing all the info into reminiscence.

For a given audio backend, we have to move it to every employee through worker_init_fn() argument.

ds_train <- dataloader(
  train_subset, 
  batch_size = 128, 
  shuffle = TRUE, 
  collate_fn = collate_fn,
  num_workers = 16,
  worker_init_fn = operate(.) {torchaudio::set_audio_backend("audiofile_loader")},
  worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)

ds_test <- dataloader(
  test_subset, 
  batch_size = 64, 
  shuffle = FALSE, 
  collate_fn = collate_fn,
  num_workers = 8,
  worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)

Mannequin definition

Reasonably keras::keras_model_sequential()we’re going to outline a torch::nn_module(). As talked about within the unique article, the mannequin is predicated on this structure for MNIST from this tutorialand I’ll name him ‘DanielNN’.

dan_nn <- torch::nn_module(
  "DanielNN",
  
  initialize = operate(
    window_size_ms = 30, 
    window_stride_ms = 10
  ) {
    
    # spectrogram spec
    window_size <- as.integer(16000*window_size_ms/1000)
    stride <- as.integer(16000*window_stride_ms/1000)
    fft_size <- as.integer(2^trunc(log(window_size, 2) + 1))
    n_chunks <- size(seq(0, 16000, stride))
    
    self$spectrogram <- torchaudio::transform_spectrogram(
      n_fft = fft_size, 
      win_length = window_size, 
      hop_length = stride, 
      normalized = TRUE, 
      energy = 2
    )
    
    # convs 2D
    self$conv1 <- torch::nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = c(3,3))
    self$conv2 <- torch::nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = c(3,3))
    self$conv3 <- torch::nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = c(3,3))
    self$conv4 <- torch::nn_conv2d(in_channels = 128, out_channels = 256, kernel_size = c(3,3))
    
    # denses
    self$dense1 <- torch::nn_linear(in_features = 14336, out_features = 128)
    self$dense2 <- torch::nn_linear(in_features = 128, out_features = 30)
  },
  
  ahead = operate(x) {
    x %>% # (64, 1, 16001)
      self$spectrogram() %>% # (64, 1, 257, 101)
      torch::torch_add(0.01) %>%
      torch::torch_log() %>%
      self$conv1() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv2() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv3() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv4() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      torch::nnf_dropout(p = 0.25) %>%
      torch::torch_flatten(start_dim = 2) %>%
      
      self$dense1() %>%
      torch::nnf_relu() %>%
      torch::nnf_dropout(p = 0.5) %>%
      self$dense2() 
  }
)

mannequin <- dan_nn()


gadget <- torch::torch_device(if(torch::cuda_is_available()) "cuda" else "cpu")
mannequin$to(gadget = gadget)

print(mannequin)
An `nn_module` containing 2,226,846 parameters.

── Modules ──────────────────────────────────────────────────────
● spectrogram:  #0 parameters
● conv1:  #320 parameters
● conv2:  #18,496 parameters
● conv3:  #73,856 parameters
● conv4:  #295,168 parameters
● dense1:  #1,835,136 parameters
● dense2:  #3,870 parameters

Mannequin tuning

In contrast to tensorflow, there is no such thing as a mannequin %>% compile(...) step to the torch, so let’s put loss criterion, optimizer technique and analysis metrics explicitly within the coaching cycle.

loss_criterion <- torch::nn_cross_entropy_loss()
optimizer <- torch::optim_adadelta(mannequin$parameters, rho = 0.95, eps = 1e-7)
metrics <- checklist(acc = yardstick::accuracy_vec)

Coaching loop

library(glue)
library(progress)

pred_to_r <- operate(x) {
  courses <- issue(df$courses)
  courses(as.numeric(x$to(gadget = "cpu")))
}

set_progress_bar <- operate(whole) {
  progress_bar$new(
    whole = whole, clear = FALSE, width = 70,
    format = ":present/:whole (:bar) - :elapsed - loss: :loss - acc: :acc"
  )
}
epochs <- 20
losses <- c()
accs <- c()

for(epoch in seq_len(epochs)) {
  pb <- set_progress_bar(size(ds_train))
  pb$message(glue("Epoch {epoch}/{epochs}"))
  coro::loop(for(batch in ds_train) {
    optimizer$zero_grad()
    predictions <- mannequin(batch((1))$to(gadget = gadget))
    targets <- batch((2))$to(gadget = gadget)
    loss <- loss_criterion(predictions, targets)
    loss$backward()
    optimizer$step()
    
    # eval stories
    prediction_r <- pred_to_r(predictions$argmax(dim = 2))
    targets_r <- pred_to_r(targets)
    acc <- metrics$acc(targets_r, prediction_r)
    accs <- c(accs, acc)
    loss_r <- as.numeric(loss$merchandise())
    losses <- c(losses, loss_r)
    
    pb$tick(tokens = checklist(loss = spherical(imply(losses), 4), acc = spherical(imply(accs), 4)))
  })
}



# take a look at
predictions_r <- c()
targets_r <- c()
coro::loop(for(batch_test in ds_test) {
  predictions <- mannequin(batch_test((1))$to(gadget = gadget))
  targets <- batch_test((2))$to(gadget = gadget)
  predictions_r <- c(predictions_r, pred_to_r(predictions$argmax(dim = 2)))
  targets_r <- c(targets_r, pred_to_r(targets))
})
val_acc <- metrics$acc(issue(targets_r, ranges = 1:30), issue(predictions_r, ranges = 1:30))
cat(glue("val_acc: {val_acc}nn"))
Epoch 1/20                                                            
(W SpectralOps.cpp:590) Warning: The operate torch.rfft is deprecated and will likely be eliminated in a future PyTorch launch. Use the brand new torch.fft module capabilities, as an alternative, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (operate operator())
354/354 (=========================) -  1m - loss: 2.6102 - acc: 0.2333
Epoch 2/20                                                            
354/354 (=========================) -  1m - loss: 1.9779 - acc: 0.4138
Epoch 3/20                                                            
354/354 (============================) -  1m - loss: 1.62 - acc: 0.519
Epoch 4/20                                                            
354/354 (=========================) -  1m - loss: 1.3926 - acc: 0.5859
Epoch 5/20                                                            
354/354 (==========================) -  1m - loss: 1.2334 - acc: 0.633
Epoch 6/20                                                            
354/354 (=========================) -  1m - loss: 1.1135 - acc: 0.6685
Epoch 7/20                                                            
354/354 (=========================) -  1m - loss: 1.0199 - acc: 0.6961
Epoch 8/20                                                            
354/354 (=========================) -  1m - loss: 0.9444 - acc: 0.7181
Epoch 9/20                                                            
354/354 (=========================) -  1m - loss: 0.8816 - acc: 0.7365
Epoch 10/20                                                           
354/354 (=========================) -  1m - loss: 0.8278 - acc: 0.7524
Epoch 11/20                                                           
354/354 (=========================) -  1m - loss: 0.7818 - acc: 0.7659
Epoch 12/20                                                           
354/354 (=========================) -  1m - loss: 0.7413 - acc: 0.7778
Epoch 13/20                                                           
354/354 (=========================) -  1m - loss: 0.7064 - acc: 0.7881
Epoch 14/20                                                           
354/354 (=========================) -  1m - loss: 0.6751 - acc: 0.7974
Epoch 15/20                                                           
354/354 (=========================) -  1m - loss: 0.6469 - acc: 0.8058
Epoch 16/20                                                           
354/354 (=========================) -  1m - loss: 0.6216 - acc: 0.8133
Epoch 17/20                                                           
354/354 (=========================) -  1m - loss: 0.5985 - acc: 0.8202
Epoch 18/20                                                           
354/354 (=========================) -  1m - loss: 0.5774 - acc: 0.8263
Epoch 19/20                                                           
354/354 (==========================) -  1m - loss: 0.5582 - acc: 0.832
Epoch 20/20                                                           
354/354 (=========================) -  1m - loss: 0.5403 - acc: 0.8374
val_acc: 0.876705979296493

Make predictions

We have already got all of the predictions calculated for test_subsetLet’s recreate the alluvial plot of the unique article.

library(dplyr)
library(alluvial)
df_validation <- knowledge.body(
  pred_class = df$courses(predictions_r),
  class = df$courses(targets_r)
)
x <-  df_validation %>%
  mutate(appropriate = pred_class == class) %>%
  rely(pred_class, class, appropriate)

alluvial(
  x %>% choose(class, pred_class),
  freq = x$n,
  col = ifelse(x$appropriate, "lightblue", "pink"),
  border = ifelse(x$appropriate, "lightblue", "pink"),
  alpha = 0.6,
  cover = x$n < 20
)

Rendimiento del modelo: etiquetas verdaderas <--> supplied labels.” width=”336″/></p>
<p class=

Determine 2: Mannequin efficiency: true labels <–> predicted labels.

The accuracy of the mannequin is 87.7%, barely worse than the tensorflow model within the unique publication. Nonetheless, all the conclusions from the unique submit stay legitimate.

Re-use

Textual content and figures are licensed beneath a Artistic Commons Attribution license. CC BY 4.0. Figures which have been reused from different sources are usually not coated by this license and could be acknowledged by a observe of their caption: “Determine of…”.

Quotation

For attribution, please cite this work as

Damiani (2021, Feb. 4). Posit AI Weblog: Easy audio classification with torch. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/

BibTeX Quotation

@misc{athossimpleaudioclassification,
  creator = {Damiani, Athos},
  title = {Posit AI Weblog: Easy audio classification with torch},
  url = {https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/},
  yr = {2021}
}

Related Articles

Latest Articles