Tools
Tools: The problem with dialogue datasets
2026-03-08
0 views
admin
What is actually missing ## Why this matters for training and evaluation ## A different approach: simulate cognition before generating language ## What the output looks like ## Evaluation without LLM self-scoring ## Current state ## The open question Most dialogue datasets used to train and evaluate language models contain only text. A speaker label. A message. Sometimes a sentiment tag. That is the standard format. And for many tasks it is fine. But if you are building systems that need to reason about people — not just respond to them — text alone is not enough. Real conversations are not just sequences of messages. They are driven by internal state that never appears in the transcript: "I'm not upset about the meeting, I'm upset you didn't tell me earlier." The text is visible. But what drove that message is not: None of that is in the transcript. And without it, the dataset cannot tell you why that message happened — only that it did. If you train a conversational model on text-only data, it learns to imitate surface patterns. It learns what responses look like. Not what drives them. That works well enough for simple tasks. But it creates a ceiling for anything that requires: For these tasks, you need datasets where the internal structure is explicit — not inferred after the fact from the text. We have been exploring a different approach with a project called StrataSynth. Instead of prompting an LLM to generate a conversation directly, the system simulates a minimal cognitive model first. The language model is only used at the final step to render decisions into text. The pipeline looks like this: PsycheGraph → identity, attachment style, biases, voice
Belief Engine → evolving beliefs with confidence scores
Relationship State → trust, tension, connection, dominance
Decision Engine → intent, goal, communication act
LLM Rendering → natural language The LLM cannot decide what to believe or how to relate to the other agent. Those are determined upstream by the state model. The LLM only renders the decision into text. This separation means the internal state is always explicit — it is not something you try to extract from the output after the fact. It is the input that produced the output. Each conversation turn includes the full internal state that produced it: Across a full conversation, this produces trajectories such as: One problem we wanted to avoid was evaluating synthetic data with the same LLM that generated it. LLM self-evaluation can hide problems instead of revealing them. A model that generates structurally inconsistent data will often rate it as high quality. All quality metrics in StrataSynth are computed deterministically: No LLM scoring.
No circular evaluation. We have published three initial datasets on Hugging Face: They are small and prototype-grade — 15 conversations each. The structure is what we wanted to share, not the volume. Datasets: https://huggingface.co/StrataSynth
Platform: https://www.stratasynth.com Structured social datasets could be useful for: But we are not sure this is the right abstraction yet. The cognitive model is minimal by design: Whether that is enough signal or a crude approximation is something we want to understand better. If you have worked on structured dialogue datasets, agent evaluation, or social reasoning benchmarks, I would be very interested in hearing where this approach seems wrong. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
{ "speaker": "A", "text": "I'm not upset about the meeting. I'm upset you didn't tell me.", "intent": "reveal", "goal": "seek_validation", "communication_act": "accusation", "belief_delta": { "trust_other": -0.07 }, "relationship_state": { "trust": 0.62, "tension": 0.44, "connection": 0.38 }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "speaker": "A", "text": "I'm not upset about the meeting. I'm upset you didn't tell me.", "intent": "reveal", "goal": "seek_validation", "communication_act": "accusation", "belief_delta": { "trust_other": -0.07 }, "relationship_state": { "trust": 0.62, "tension": 0.44, "connection": 0.38 }
} CODE_BLOCK:
{ "speaker": "A", "text": "I'm not upset about the meeting. I'm upset you didn't tell me.", "intent": "reveal", "goal": "seek_validation", "communication_act": "accusation", "belief_delta": { "trust_other": -0.07 }, "relationship_state": { "trust": 0.62, "tension": 0.44, "connection": 0.38 }
} - Beliefs about the other person that evolve with each exchange
- Goals behind each message (seek validation, assert control, repair trust)
- Relationship dynamics that shift across the conversation: trust, tension, connection
- Psychological identity that shapes how someone communicates under pressure - Their belief that the other person withholds information (confidence: 0.74)
- A goal to seek validation rather than escalate
- A relationship state where trust has been eroding across the last four turns - Tracking beliefs across a multi-turn conversation
- Understanding how trust changes during conflict
- Simulating how different personalities handle the same situation
- Evaluating whether an agent's internal reasoning matches its output - Belief trajectory — how each belief changes turn by turn
- Relationship trajectory — how trust and tension evolve across the arc
- Behavioral entropy — how varied the speaker's communication acts are - belief_consistency — correlation between communication acts and belief deltas (numpy)
- identity_stability — cosine similarity of communication distributions across turns (sentence-transformers)
- behavioral_entropy — Shannon entropy over communication act distributions
- noise_rejection_rate — fraction of injected noise correctly isolated - stratasynth-social-reasoning — family conflict, romantic trust repair, caregiver stress
- stratasynth-agent-stress-test — jealousy escalation, performance reviews, estrangement
- stratasynth-belief-dynamics — career transitions, mentorship conflict, relationship dissolution - Evaluating whether an agent tracks belief changes correctly
- Training models that need to reason about trust and conflict
- Stress-testing conversational systems with psychologically defined personas
- Alignment research that requires explicit internal state as ground truth - 4 relationship dimensions
- 10 communication acts
how-totutorialguidedev.toaillm