0.7 C
New York
Tuesday, December 3, 2024

Posit AI Weblog: Torch Time Sequence, Take Three: Sequence-to-Sequence Prediction


Right now we proceed our exploration of multi-step time sequence forecasting with torch. This submit is the third in a sequence.

  • InitiallyWe cowl the fundamentals of recurrent neural networks (RNN) and practice a mannequin to foretell the following worth in a sequence. We additionally discovered that we might forecast fairly a couple of steps forward by feeding the person predictions again right into a loop.

  • Subsequentwe created a “native” mannequin for multi-step prediction. A small multilayer perceptron (MLP) was used to venture the RNN output to varied time factors sooner or later.

Of each approaches, the latter was probably the most profitable. However conceptually it has an unsatisfactory contact: when the MLP extrapolates and generates outcomes for, say, ten consecutive moments in time, there is no such thing as a causal relationship between them. (Think about a ten-day climate forecast that was by no means up to date.)

Now we want to attempt one thing extra intuitively interesting. The enter is a sequence; the output is a sequence. In pure language processing (NLP), one of these job is quite common: it’s precisely the kind of scenario we see with machine translation or summarization.

Appropriately, the kinds of fashions used for these functions are referred to as sequence-to-sequence fashions (typically abbreviated seq2seq). Briefly, they divided the duty into two parts: an encoding half and a decoding half. The primary is completed solely as soon as per input-target pair. The latter is completed in a loop, as in our first try. However the decoder has extra info at its disposal: in every iteration, its processing relies on each the earlier prediction and the earlier state. That earlier state will likely be that of the encoder when a loop is began and yours thereafter.

Earlier than discussing the mannequin intimately, we have to adapt our information enter mechanism.

We proceed working with vic_elec supplied by tsibbledata.

Once more, the dataset definition within the present submit appears to be like a bit of totally different than it did earlier than; It’s the form of the lens that differs. This time, y it is the identical xmoved one to the left.

The rationale we do that is due to the way in which we’re going to practice the community. With seq2seqFolks typically use a method referred to as “instructor forcing” the place, as a substitute of feeding again their very own prediction to the decoder module, they cross it the worth they ought I’ve predicted. To be clear, that is carried out solely throughout coaching and to a configurable diploma.

library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)

n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps

vic_elec_get_year <- perform(12 months, month = NULL) {
  vic_elec %>%
    filter(12 months(Date) == 12 months, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    choose(Demand)
}

elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()

train_mean <- imply(elec_train)
train_sd <- sd(elec_train)

elec_dataset <- dataset(
  identify = "elec_dataset",
  
  initialize = perform(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- size(self$x) - self$n_timesteps - 1
    
    self$begins <- kind(pattern.int(
      n = n,
      dimension = n * sample_frac
    ))
    
  },
  
  .getitem = perform(i) {
    
    begin <- self$begins(i)
    finish <- begin + self$n_timesteps - 1
    lag <- 1
    
    checklist(
      x = self$x(begin:finish),
      y = self$x((begin+lag):(finish+lag))$squeeze(2)
    )
    
  },
  
  .size = perform() {
    size(self$begins) 
  }
)

The info set and information loader cases can proceed as earlier than.

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Technically, the mannequin consists of three modules: the aforementioned encoder and decoder, and the seq2seq module that orchestrates them.

Encoder

The encoder takes its enter and runs it by means of an RNN. Of the 2 issues {that a} recurrent neural community returns, outputs and state, to this point now we have solely been utilizing output. This time, we do the other: we discard the outputs and solely return the state.

If the RNN in query is a GRU (and assuming that of the outputs, we take solely the final time step, which is what now we have been doing all through the method), there may be actually no distinction: the ultimate state is the same as the output finish. Nevertheless, whether it is an LSTM, there’s a second kind of state, the “cell state”. In that case, returning the state as a substitute of the ultimate consequence will carry extra info.

encoder_module <- nn_module(
  
  initialize = perform(kind, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$kind <- kind
    
    self$rnn <- if (self$kind == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  ahead = perform(x) {
    
    x <- self$rnn(x)
    
    # return final states for all layers
    # per layer, a single tensor for GRU, an inventory of two tensors for LSTM
    x <- x((2))
    x
    
  }
  
)

Decoder

Within the decoder, as within the encoder, the primary part is an RNN. Nevertheless, in contrast to the architectures proven above, it does not simply return a prediction. It additionally reviews the ultimate standing of the RNN.

decoder_module <- nn_module(
  
  initialize = perform(kind, input_size, hidden_size, num_layers = 1) {
    
    self$kind <- kind
    
    self$rnn <- if (self$kind == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(hidden_size, 1)
    
  },
  
  ahead = perform(x, state) {
    
    # enter to ahead:
    # x is (batch_size, 1, 1)
    # state is (1, batch_size, hidden_size)
    x <- self$rnn(x, state)
    
    # break up RNN return values
    # output is (batch_size, 1, hidden_size)
    # next_hidden is
    c(output, next_hidden) %<-% x
    
    output <- output$squeeze(2)
    output <- self$linear(output)
    
    checklist(output, next_hidden)
    
  }
  
)

seq2seq module

seq2seq It is the place the motion occurs. The plan is to encode as soon as after which name the decoder in a loop.

In case you look again on the decoder ahead()you see that two arguments are wanted: x and state.

Relying on the context, x It corresponds to one in all three issues: last enter, prior prediction, or prior basic fact.

  • The primary time the decoder is named on an enter stream, x is assigned to the ultimate enter worth. That is totally different from a job like machine translation, the place a begin token could be handed. Nevertheless, with time sequence we want to proceed the place the precise measurements depart off.

  • On future calls, we would like the decoder to proceed from its most up-to-date prediction. Due to this fact, it’s logical to return on the earlier forecast.

  • That stated, a method referred to as “instructor forcing” is often utilized in NLP to hurry up coaching. With the instructor’s obligation, as a substitute of the forecast we cross the actual fact, what the decoder ought to have predicted. We do that solely in a configurable fraction of circumstances and, after all, solely throughout coaching. The rationale behind this method is that with out this type of recalibration, consecutive prediction errors can shortly erase any remaining sign.

stateIt is usually versatile. However there are solely two prospects right here: encoder state and decoder state.

  • The primary time the decoder is named, it’s “began” with the ultimate state of the encoder. Discover how that is the one time We use encryption.

  • From that second on, the decoder will go to the earlier state. Keep in mind that it returns two values, forecast and standing?

seq2seq_module <- nn_module(
  
  initialize = perform(kind, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(kind = kind, input_size = input_size,
                                   hidden_size = hidden_size, num_layers, encoder_dropout)
    self$decoder <- decoder_module(kind = kind, input_size = input_size,
                                   hidden_size = hidden_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  ahead = perform(x, y, teacher_forcing_ratio) {
    
    # put together empty output
    outputs <- torch_zeros(dim(x)(1), self$n_forecast)$to(gadget = gadget)
    
    # encode present enter sequence
    hidden <- self$encoder(x)
    
    # prime decoder with last enter worth and hidden state from the encoder
    out <- self$decoder(x( , n_timesteps, , drop = FALSE), hidden)
    
    # decompose into predictions and decoder state
    # pred is (batch_size, 1)
    # state is (1, batch_size, hidden_size)
    c(pred, state) %<-% out
    
    # retailer first prediction
    outputs( , 1) <- pred$squeeze(2)
    
    # iterate to generate remaining forecasts
    for (t in 2:self$n_forecast) {
      
      # name decoder on both floor fact or earlier prediction, plus earlier decoder state
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      enter <- if (teacher_forcing == TRUE) y( , t - 1, drop = FALSE) else pred
      enter <- enter$unsqueeze(3)
      out <- self$decoder(enter, state)
      
      # once more, decompose decoder return values
      c(pred, state) %<-% out
      # and retailer present prediction
      outputs( , t) <- pred$squeeze(2)
    }
    outputs
  }
  
)

web <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)

# coaching RNNs on the GPU presently prints a warning which will muddle 
# the console
# see https://github.com/mlverse/torch/points/461
# alternatively, use 
# gadget <- "cpu"
gadget <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

web <- web$to(gadget = gadget)

The coaching process is largely unchanged. Nevertheless, we should resolve teacher_forcing_ratiothe proportion of enter sequences on which we wish to carry out recalibration. In valid_batch()this could all the time be 0whereas in train_batch()It depends upon us (or quite, on experimentation). Right here we set it to 0.3.

optimizer <- optim_adam(web$parameters, lr = 0.001)

num_epochs <- 50

train_batch <- perform(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
  goal <- b$y$to(gadget = gadget)
  
  loss <- nnf_mse_loss(output, goal)
  loss$backward()
  optimizer$step()
  
  loss$merchandise()
  
}

valid_batch <- perform(b, teacher_forcing_ratio = 0) {
  
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
  goal <- b$y$to(gadget = gadget)
  
  loss <- nnf_mse_loss(output, goal)
  
  loss$merchandise()
  
}

for (epoch in 1:num_epochs) {
  
  web$practice()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
  
  web$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}
Epoch 1, coaching: loss: 0.37961 

Epoch 1, validation: loss: 1.10699 

Epoch 2, coaching: loss: 0.19355 

Epoch 2, validation: loss: 1.26462 

# ...
# ...

Epoch 49, coaching: loss: 0.03233 

Epoch 49, validation: loss: 0.62286 

Epoch 50, coaching: loss: 0.03091 

Epoch 50, validation: loss: 0.54457

It’s attention-grabbing to check the efficiency for various environments. teacher_forcing_ratio. With a configuration of 0.5coaching loss decreases far more slowly; The other is seen with a configuration of 0. Nevertheless, the validation loss isn’t considerably affected.

The code for inspecting check suite forecasts has not modified.

web$eval()

test_preds <- vector(mode = "checklist", size = size(test_dl))

i <- 1

coro::loop(for (b in test_dl) {
  
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds((i)) <- preds
  i <<- i + 1
  
})

vic_elec_jan_2014 <- vic_elec %>%
  filter(12 months(Date) == 2014, month(Date) == 1)

test_pred1 <- test_preds((1))
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2 <- test_preds((408))
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3 <- test_preds((817))
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts <- vic_elec_jan_2014 %>%
  choose(Demand) %>%
  add_column(
    mlp_ex_1 = test_pred1 * train_sd + train_mean,
    mlp_ex_2 = test_pred2 * train_sd + train_mean,
    mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = identify)


preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()

Determine 1: Predictions for one week in January 2014.

Evaluating this with the forecast obtained from the RNN-MLP combo from final time, we do not see a lot distinction. Is that this stunning? For me it’s. In case you requested me to invest on why, I might most likely say this: in all of the architectures now we have used to this point, the primary info service has been the ultimate hidden state of the RNN (one and solely RNN within the above two configurations, encoder RNN on this one). Will probably be attention-grabbing to see what occurs within the final a part of this sequence, once we increase the encoder-decoder structure utilizing consideration.

Thanks for studying!

Photograph by Suzuha Kozuki in unpack

Related Articles

Latest Articles