Technical Brief

Neural Probabilistic Forecasting of Symbolic Sequences With Long Short-Term Memory

[+] Author and Article Information
Michael Hauser

Department of Mechanical Engineering,
The Pennsylvania State University,
University Park, PA 16802
e-mail: mzh190@psu.edu

Yiwei Fu

Department of Mechanical Engineering,
The Pennsylvania State University,
University Park, PA 16802
e-mail: yxf118@psu.edu

Shashi Phoha

Applied Research Laboratory,
The Pennsylvania State University,
University Park, PA 16802
e-mail: sxp26@arl.psu.edu

Asok Ray

Fellow ASME
Department of Mechanical Engineering,
The Pennsylvania State University,
University Park, PA 16802
e-mail: axr2@psu.edu

1Corresponding author.

Contributed by the Dynamic Systems Division of ASME for publication in the JOURNAL OF DYNAMIC SYSTEMS, MEASUREMENT,AND CONTROL. Manuscript received April 17, 2017; final manuscript received January 8, 2018; published online March 30, 2018. Assoc. Editor: Dumitru I. Caruntu.

J. Dyn. Sys., Meas., Control 140(8), 084502 (Mar 30, 2018) (6 pages) Paper No: DS-17-1199; doi: 10.1115/1.4039281 History: Received April 17, 2017; Revised January 08, 2018

This paper makes use of long short-term memory (LSTM) neural networks for forecasting probability distributions of time series in terms of discrete symbols that are quantized from real-valued data. The developed framework formulates the forecasting problem into a probabilistic paradigm as hΘ: X × Y → [0, 1] such that yYhΘ(x,y)=1, where X is the finite-dimensional state space, Y is the symbol alphabet, and Θ is the set of model parameters. The proposed method is different from standard formulations (e.g., autoregressive moving average (ARMA)) of time series modeling. The main advantage of formulating the problem in the symbolic setting is that density predictions are obtained without any significantly restrictive assumptions (e.g., second-order statistics). The efficacy of the proposed method has been demonstrated by forecasting probability distributions on chaotic time series data collected from a laboratory-scale experimental apparatus. Three neural architectures are compared, each with 100 different combinations of symbol-alphabet size and forecast length, resulting in a comprehensive evaluation of their relative performances.

Copyright © 2018 by ASME
Your Session has timed out. Please sign back in to continue.


Montgomery, D. C. , Jennings, C. L. , and Kulahci, M. , 2015, Introduction to Time Series Analysis and Forecasting, Wiley, Hoboken, NJ.
Hauser, M. , Fu, Y. , Li, Y. , and Ray, A. , 2017, “ Probabilistic Forecasting of Symbol Sequences With Deep Neural Networks,” American Control Conference (ACC), Seattle, WA, May 24–26, pp. 3147–3152.
Zhang, G. , Patuwo, B. E. , and Hu, M. Y. , 1998, “ Forecasting With Artificial Neural Networks: The State of the Art,” Int. J. Forecasting, 14(1), pp. 35–62. [CrossRef]
Hornik, K. , Stinchcombe, M. , and White, H. , 1989, “ Multilayer Feedforward Networks are Universal Approximators,” Neural Networks, 2(5), pp. 359–366. [CrossRef]
Gneiting, T. , 2008, “ Editorial: Probabilistic Forecasting,” J. R. Stat. Soc. Ser. A, 171(2), pp. 319–321.
Gneiting, T. , and Katzfuss, M. , 2014, “ Probabilistic Forecasting,” Annu. Rev. Stat. Appl., 1(1), pp. 125–151. [CrossRef]
Box, G. E. , Jenkins, G. M. , Reinsel, G. C. , and Ljung, G. M. , 2015, Time Series Analysis: Forecasting and Control, Wiley, Hoboken, NJ. [PubMed] [PubMed]
Dupont, P. , Denis, F. , and Esposito, Y. , 2005, “ Links Between Probabilistic Automata and Hidden Markov Models, Probability Distributions, Learning Models and Induction Algorithms,” Pattern Recognit., 38(9), pp. 1349–1371. [CrossRef]
Rozenberg, G. , and Salomaa, A. , 1997, Handbook of Formal Languages: Beyonds Words, Vol. 3, Springer Science & Business Media, Berlin.
Wen, Y. , Mukherjee, K. , and Ray, A. , 2013, “ Adaptive Pattern Classification for Symbolic Dynamic Systems,” Signal Process., 93(1), pp. 252–260. [CrossRef]
Ray, A. , 2004, “ Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection,” Signal Process., 84(7), pp. 1115–1130. [CrossRef]
Mukherjee, K. , and Ray, A. , 2014, “ State Splitting and Merging in Probabilistic Finite State Automata for Signal Representation and Analysis,” Signal Process., 104, pp. 105–119. [CrossRef]
Darema, F. , 2005, “ Dynamic Data Driven Applications Systems: New Capabilities for Application Simulations and Measurements,” Fifth International Conference on Computational Science (ICCS), Atlanta, GA, May 22–25, pp. 610–615.
Hochreiter, S. , and Schmidhuber, J. , 1997, “ Long Short-Term Memory,” Neural Comput., 9(8), pp. 1735–1780. [CrossRef] [PubMed]
Graves, A. , 2012, “ Supervised Sequence Labelling,” Supervised Sequence Labelling With Recurrent Neural Networks, Springer, New York, pp. 5–13. [CrossRef]
Gers, F. A. , Schmidhuber, J. , and Cummins, F. , 2000, “ Learning to Forget: Continual Prediction With LSTM,” Neural Comput., 12(10), pp. 2451–2471. [CrossRef] [PubMed]
Li, Y. , Chattopadhyay, P. , and Ray, A. , 2015, “ Dynamic Data-Driven Identification of Battery State-of-Charge Via Symbolic Analysis of Input–Output Pairs,” Appl. Energy, 155, pp. 778–790. [CrossRef]
Hauser, M. , Li, Y. , Li, J. , and Ray, A. , 2016, “ Real-Time Combustion State Identification Via Image Processing: A Dynamic Data-Driven Approach,” American Control Conference (ACC), Boston, MA, July 6–8, pp. 3316–3321.
Abarbanel, H. , 2012, Analysis of Observed Chaotic Data, Springer Science & Business Media, New York.
Cover, T. M. , and Thomas, J. A. , 2012, Elements of Information Theory, Wiley, Hoboken, NJ.
Nasr, G. E. , Badr, E. , and Joun, C. , 2002, “ Cross Entropy Error Function in Neural Networks: Forecasting Gasoline Demand,” Fifteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS), Pensacola, FL, May 14–16, pp. 381–384.
Zeiler, M. D. , 2012, “Adadelta: An Adaptive Learning Rate Method,” preprint arXiv:1212.5701.
Duchi, J. , Hazan, E. , and Singer, Y. , 2011, “ Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” J. Mach. Learn. Res., 12, pp. 2121–2159.
Bastien, F. , Lamblin, P. , Pascanu, R. , Bergstra, J. , Goodfellow, I. , Bergeron, A. , Bouchard, N. , Warde-Farley, D. , and Bengio, Y. , 2012, “Theano: New Features and Speed Improvements,” preprint arXiv:1211.5590.
Bergstra, J. , Breuleux, O. , Bastien, F. , Lamblin, P. , Pascanu, R. , Desjardins, G. , Turian, J. , Warde-Farley, D. , and Bengio, Y. , 2010, “ Theano: A CPU and GPU Math Compiler in Python,” Ninth Python in Science Conference, Austin, TX, June 28–July 3, pp. 1–7.
Theano Development Team, 2016, “Theano: A Python Framework For Fast Computation Of Mathematical Expressions,” e-print arXiv:1605.02688.
Sarkar, S. , Chakravarthy, S. , Ramanan, V. , and Ray, A. , 2016, “ Dynamic Data-Driven Prediction of Instability in a Swirl-Stabilized Combustor,” Int. J. Spray Combust., 8(4), pp. 235–253. [CrossRef]
Graben, P. B. , 2001, “ Estimating and Improving the Signal-to-Noise Ratio of Time Series by Symbolic Dynamics,” Phys. Rev. E, 64(5), p. 051104. [CrossRef]
Thompson, J. , and Stewart, H. , 1986, Nonlinear Dynamics and Chaos, Wiley, Chichester, UK.
Cheng, L. , Liu, W. , Hou, Z.-G. , and Yu, J. , 2015, “ Neural Network Based Nonlinear Model Predictive Control for Piezoelectric Actuators,” IEEE Trans. Ind. Electron., 62(12), pp. 7717–7727. [CrossRef]


Grahic Jump Location
Fig. 1

Schematic of the LSTM neural network stucture. The nonlinear activations, σ and tanh, act elementwise on their respective input vectors. Similarly, multiplication and addition blocks operate on their input pairs elementwise.

Grahic Jump Location
Fig. 2

Schematic diagram of the combustor apparatus

Grahic Jump Location
Fig. 3

Forecasts from the neural probabilistic framework; dark background corresponds to low probability and white corresponds to high probability. The LSTM and feed forward networks are compared, as well as forecast lengths of 4 and 10 time-steps. Solid lines are the true trajectory while dotted lines are the expectation over the forecasted probability distributions.

Grahic Jump Location
Fig. 4

Test error rates for different combinations of forecast length and number of symbols. The lighter shade corresponds to a lower error rate while a darker shade corresponds to a higher error rate. When averaged over all pairwise combinations, the relative reductions in error can be seen in Table 1.

Grahic Jump Location
Fig. 5

Weighted error rates for different combinations of forecast length and number of symbols, where the weighting factor is determined by the distance between the predicted symbol and the true symbol (determined by partition centroid). The weighted error can be interpreted as the average distance between the predicted and true symbol, scaled between 0 and 1. The lighter shade corresponds to a lower error rate while a darker shade corresponds to a higher error rate. When averaged over all pairwise combinations, the relative reductions in weighted error can be seen in Table 2.



Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In