Character-Aware Neural Language Models (1508.06615v4)

Published 26 Aug 2015 in cs.CL, cs.NE, and stat.ML

Abstract: We describe a simple neural LLM that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network LLM (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for LLMing. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.

Authors (4)

Yoon Kim (92 papers)
Yacine Jernite (47 papers)
David Sontag (95 papers)
Alexander M. Rush (115 papers)

Citations (1,637)

View on Semantic Scholar

Summary

Character-Aware Neural LLMs

The paper "Character-Aware Neural LLMs" by Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush presents a novel approach in the field of neural LLMing. The authors introduce a model that uniquely leverages character-level inputs yet produces predictions at the word level, distinguishing it from traditional word-level neural LLMs (NLMs). The model employs a convolutional neural network (CNN) and a highway network over characters, with the output being fed into a long short-term memory (LSTM) recurrent neural network LLM (RNN-LM).

Model Architecture and Methodology

The model architecture is designed to balance efficacy with parameter efficiency. It integrates the following components:

Character-Level Convolutional Neural Network (CharCNN): This processes raw character inputs using a series of convolutional operations followed by max-over-time pooling. The CharCNN effectively captures local dependencies within each word.
Highway Network: The output from the CharCNN is fed into a highway network, which facilitates the learning of both complex transformations and direct pass-throughs of the input. This step is crucial for combining local character-level features into coherent word representations.
LSTM Recurrent Neural Network: The resultant word representations are then input into a multi-layer LSTM, which captures temporal dependencies within sequences of words to model language effectively.

Unlike previous works that utilized embeddings at the word or morpheme levels, this model eschews such embeddings at the input stage, thereby reducing the total number of parameters significantly.

Experimental Results

The authors conducted comprehensive evaluations across several language corpora, focusing on both English (using the Penn Treebank) and a variety of morphologically rich languages, including Arabic, Czech, French, German, Spanish, and Russian.

Results on the English Penn Treebank

The model demonstrated state-of-the-art performance with a substantial reduction in parameters:

LSTM-Char-Large Model: Achieved a perplexity of 78.9 on the Penn Treebank, exhibiting competitive performance with 60% fewer parameters compared to existing state-of-the-art models.
LSTM-Char-Small Model: Achieved notably lower perplexities compared to other neural LLMs of comparable size.

Results on Morphologically Rich Languages

The character-based models displayed significant improvements over word-level and morpheme-level baselines:

Perplexity Reductions: The model consistently outperformed Kneser-Ney baselines and log-bilinear models incorporating morphological information across various languages.
Parameter Efficiency: Even with fewer parameters, the character models managed to surpass traditional models, benefiting from their ability to capture nuanced subword information and dynamic vocabulary without needing morphological analysis.

Implications and Future Directions

The results indicate that character inputs alone sufficed for effective LLMing across diverse languages, particularly those with rich morphological structures. The ability to encode both semantic and orthographic information at the character level suggests potential applications beyond LLMing, such as in text normalization and handling noisy textual data.

The theoretical implications extend to questioning the necessity of word embeddings—long held as fundamental in neural LLMs. Additionally, further inquiries could explore the use of character-level models with different neural architectures such as transformers or in other NLP tasks like neural machine translation.

Thus, the proposed character-aware neural LLM not only achieves competitive performance with fewer parameters but also underscores the utility of fine-grained, character-level representations in language understanding tasks. Future work may examine the generalizability of this approach to a broader spectrum of NLP applications and evaluate the performance of CharCNNs combined with alternative highway network settings.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - yoonkim/lstm-char-cnn: LSTM language model with CNN over characters (829 stars)