0. The Architectural Challenge
Standard sentiment analysis treats a document as a single "Bag of Words" with one unified sentiment. This fails in complex real-world reviews where users express mixed feelings.
Input: "The camera is amazing, but the battery drains instantly."
Prediction: "Neutral"
(Positive and Negative words cancel out)
Camera -> Positive
Battery -> Negative
1. Linguistic Feature Extraction
Before any Deep Learning, we must linguistically deconstruct the sentence. We utilize the SpaCy Dependency Parser to isolate "Noun Chunks" (potential aspects) and "Adjectives/Verbs" (potential sentiments).
import spacy import pandas as pd # Load English tokenizer, tagger, parser and NER nlp = spacy.load('en_core_web_sm') def extract_linguistic_features(reviews): aspect_terms = [] sentiment_terms = [] for doc in nlp.pipe(reviews, batch_size=50): # 1. Extract Noun Chunks for Aspect Classification # Logic: Aspects are almost always Nouns (e.g. "Screen", "Lens") chunks = [chunk.root.text for chunk in doc.noun_chunks if chunk.root.pos_ == 'NOUN'] aspect_terms.append(' '.join(chunks)) # 2. Extract Adjectives/Verbs for Sentiment Polarity # Logic: Sentiments are descriptive (e.g. "Bad", "Loved", "Hate") sentiments = [token.lemma_ for token in doc if token.pos_ in ['ADJ', 'VERB'] and not token.is_stop] sentiment_terms.append(' '.join(sentiments)) return aspect_terms, sentiment_terms
2. Model Architecture: Aspect Classifier
We need to map thousands of raw terms (e.g., "mAh", "charging", "plug") into 11 distinct categories (e.g., BATTERY#OPERATION). We employ a dense neural network with embedding layers.
Data Vectorization (Bag of Words)
Neural networks cannot understand text. We use Keras Tokenizer to create a frequency matrix (Bag of Words) representing the top 6,000 features.
from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Activation # Define the architecture model = Sequential() # Layer 1: Dense Layer for high-dimensional feature mapping # Input shape matches our vocabulary size (6000) model.add(Dense(512, input_shape=(6000,), activation='relu')) # Dropout for regularization (preventing overfitting on small datasets) model.add(Dropout(0.25)) # Layer 2: Output Layer # Softmax is critical here for multi-class probability distribution model.add(Dense(11, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Model Training Logs
Training on 50 Epochs with a batch size of 32. Notice the rapid convergence of accuracy.
Epoch 1/50 4500/4500 [================] - 1s 230us/step - loss: 2.3012 - acc: 0.1840 - val_loss: 1.8901 - val_acc: 0.4500
Epoch 5/50 4500/4500 [================] - 0s 190us/step - loss: 0.8501 - acc: 0.7620 - val_loss: 0.9200 - val_acc: 0.7100
Epoch 10/50 4500/4500 [================] - 0s 190us/step - loss: 0.3204 - acc: 0.9150 - val_loss: 0.6504 - val_acc: 0.8420
Epoch 25/50 4500/4500 [================] - 0s 185us/step - loss: 0.0820 - acc: 0.9840 - val_loss: 0.7201 - val_acc: 0.8600 ✔ Model training complete. Weights saved to disk.
3. Model Architecture: Sentiment Polarity
The second model is architecturally similar but solves a simpler 3-class problem: Positive, Negative, or Neutral. The input here differs: we feed it the Adjective/Verb vectors extracted in Step 1.
from sklearn.preprocessing import LabelEncoder from tensorflow.keras.utils import to_categorical # Encode Target Variable (Positive/Negative/Neutral -> 0/1/2) encoder = LabelEncoder() y_encoded = encoder.fit_transform(dataset['sentiment']) y_dummy = to_categorical(y_encoded) # Sentiment Model sentiment_model = Sequential() sentiment_model.add(Dense(512, input_shape=(6000,), activation='relu')) sentiment_model.add(Dropout(0.5)) # Higher dropout due to noise in adjectives sentiment_model.add(Dense(3, activation='softmax')) sentiment_model.compile(loss='categorical_crossentropy', optimizer='adam')
4. Integration & Inference Pipeline
We glue both models together. The pipeline processes raw text, predicts the Topic, predicts the Emotion, and synthesizes a human-readable report.
Terminal Output: Batch Inference
Loading models... [OK] Processing 17 reviews...
[Review 01]: "The unit works bad, slow charge" -> Aspect: BATTERY#OPERATION -> Sentiment: NEGATIVE (Confidence: 98.2%)
[Review 02]: "Innovative and good camera product" -> Aspect: CAMERA#GENERAL -> Sentiment: POSITIVE (Confidence: 99.1%)
[Review 03]: "It's a moderate performance phone" -> Aspect: PERFORMANCE#GENERAL -> Sentiment: NEUTRAL (Confidence: 65.4%)
------------------------------------------------ AGGREGATE REPORT: Total Positive: 13 Total Negative: 4 Dominant Issue: BATTERY (3 Negative mentions)
5. Competitive Differentiation Logic
Using the aggregated sentiment data, we can programmatically compare Product A vs. Product B to generate automated market intelligence reports.
def compare_products(prod_a_stats, prod_b_stats): diff_pos = prod_a_stats['positive'] - prod_b_stats['positive'] if diff_pos > 0: print(f"✅ Product A has {diff_pos} more positive reviews than Product B.") elif diff_pos < 0: print(f"✅ Product B leads with {abs(diff_pos)} more positive reviews.") # Critical Issue Detection if prod_a_stats['neg_battery'] > 5: print("⚠ WARNING: Product A shows significant Battery Quality issues.") # Output: # > Product B leads with 4 more positive reviews. # > WARNING: Product A shows significant Battery Quality issues.