Am I a Bad Friend?

D drobinin.com ↗

▲ 255 points • 156 comments • by valzevul • 4w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

4 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,735

PEAK AI % 17% · §4

Analyzed

May 28

backend: pangram/v3.3

Segments scanned

5 windows

avg 347 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,735 words · 5 segments analyzed

Human AI-generated

§1 Human · 0%

In 2014, Tim Urban of WaitButWhy published Your Life in Weeks - a grid where each square is one week of one's life, and most of the grid is already filled. The image bothered me for years. I started tracking things partly because of it - I wanted the grid to mean something, not just count down. But the biometric data is an odd representation of how fulfilling my life has been. The grid suggests it's the events that matter - jobs, trips, schools, marriages - and those are easy to mark. But they hardly tell how I felt during those weeks, or what I was like to the people around me. That was what I wanted to measure. So I tried journaling. Paper first, then text files, then daily notes in Obsidian. The journal captured what I thought was important on the day I wrote it. It missed the conversations I forgot to jot down or the slow-moving patterns I couldn't see at the time.

My notes and their connections growing over the years.

Tired of being bad at maintaining relationships[1]1. Not bad per se - I just procrastinate a lot. Once I learnt to shoot and stalk deer because I wanted to cook a steak - and cooking is way easier than human interactions. and wanting the data to compensate, I set off on a quest to build a personal CRM of sorts, built from the record rather than from memory - thanks to the trail left by my prolific time-wasting on the Internet for the past few decades. My digital history ¶ My online presence breaks into roughly three eras:

ICQ, IRC, DC++ in 2000s: midnight channels for script kiddies and banter - all gone, and probably for the best. The ten-year-old I was in those chats doesn't need a structured archive. VK[2]2. A now-obscure social network, popular in the post-Soviet space in the noughties. I haven't been to Russia for a decade or so, but the archives going back to 2008 are still there. Gotta love totalitarian states, eh?, Twitter, Facebook in 2010s: school, university, early career - evenly spread.

§2 Human · 0%

Instagram and Telegram in 2010s-2020s: surprisingly, even though I don't post much on Instagram, it's often easier to catch up with people in DMs, and there are more and more people swapping WhatsApp for Telegram too.

Armed with GDPR and data access laws, I got myself archives with all my messages, reactions, and social graphs. Data archives ¶ Parsing a bunch of JSONs and HTMLs wasn't hard but wasn't fun either. Instagram double-encodes Cyrillic through latin-1. Telegram assigns different internal message IDs between exports taken at different dates. Facebook introduced E2E encryption at some point, so the same messages show up in three different folders. Telegram lets you export group chats or just your own messages. VK exports everything without asking. Instagram doesn't differentiate between broadcasts and personal chats at all. Once parsed into a uniform tab-separated format, the five exports produce different kinds of signal. Telegram and VK are mostly DMs. Instagram adds story interactions and a follower graph. Twitter is its own thing: standalone tweets are a publication corpus, DMs are half support requests and half conference coordination, so I needed the reply/mention graph to catch real signals. I wanted to capture a daily note per conversation-day, a profile per person, a stub per place, a life timeline, and whatever else surfaces - recipes, cocktails, meeting notes. Drowning in noise ¶ Before worrying about classification, you have to deal with the fact that most of the data is noise. In my longest thread - 486,000+ messages with my partner across ten years - the content has 2.4% links, 9.1% media, 1.5% emoji-only messages, 28.4% of short fillers, and 58.7% of substantive text. This means, 41% is noise for the purpose of this exercise. Emojis, links, and media were easy to filter, but catching conversational filler words - short words that look like content until you see them hundreds of times per month - is harder. My first idea was filtering out all messages shorter than three words, but there is a lot that can be said in two (he died, we lost, etc).

§3 Human · 1%

Building a denylist of hahahas and noices didn't work either, especially across languages. What worked was sampling from five offset positions across the chat, frequency-counting every short token, reviewing the top 80 manually, and pair the denylist with a protected set for short messages that are life events. Across all platforms and years, the cleaned corpus contains roughly 52,000 unique lemmas. The novelty rate - the share of words I hadn't used before in any chat - has been declining since 2008 and plateaued at 6% six years ago. Most of my vocabulary was locked in my early 20s.

Bars: new unique words per year (never used before). Line: those new words as a share of that year's total vocabulary. 2016 has the most new words but a low novelty rate because the total vocabulary that year was enormous - I guess I was very social.

With the noise filtered, the cleaned messages need classification: what's a life event, what's banter, who's being mentioned, what's the emotional temperature. But before any of that, there's a more basic problem. Which Sasha ¶ Most people I interact with use more than one platform, and often don't share usernames across them. If I were to maintain a profile for each known person, I'd need to map them (and mentions of them) across all chats. Cue diminutives and nicknames: the same Alexander might turn into Al, Alex, Xander, Sandy, and Alec(k). It can also be Sasha, if they're from Eastern Europe - and in Slavic languages Sasha is gender-neutral[3]3. Slavic languages often use a "-sha" suffix to create endearing diminutives, e.g Paul = Pavel = Pasha, Maria = Masha, Innokentiy = Kesha.. Morphological analysers help with case inflection but won't handle slang, and "Sasha" in my chats means a handful of different people depending on when the message was sent and who I'm talking to. Heuristics and NER models won't cut it for thousands of first-name-only mentions in group chats. A classifier trained on message content could work[4]4. Fine-tune a BERT model on labelled name-resolution pairs, predict which "Sasha" based on surrounding topics.,

§4 Human · 17%

but the training set would need to be hand-labelled from my own chats - exactly the kind of work I was trying to avoid. Parsing them all ¶ The same problem is with classifying what matters. The obvious approach is keyword matching on first-person verbs (bought, moved, signed) piped through NER to extract names and places, but it produces a lot of false positives. "I moved" in a message to my mom is a relocation, while "I moved" in a friends' chat is interior design, and "I moved" after a breakup is an emotional milestone. Fine-tuning a classifier on hand-labelled messages would give me ~70-80% accuracy at best[5]5. BERT tops out at 75.6% F1 on event detection (Xi et al, MUSIED 2022) with a professionally annotated corpus in a single domain - I suspect multilingual banter with a small hand-labelled training set would do much worse. - and at 1.2 million messages, even 1% false-positive rate means 12,000 fake events in the vault. So I ended up using LLMs[6]6. In total I ran 200+ sessions, roughly 15-20 billion tokens including context. On Opus, that's around $15k. On an M5 Pro 32 GB running Qwen3-30B-A3B locally via MLX, it's around 10-15 weeks of continuous inference. Pick your poison. for both name-resolution and classification. Measured against a 200-event holdout set, the false-positive rate was under 1% when processing chunks below 6,000 messages. The LLM doesn't write to the vault. It reads a chunk of messages and produces a structured JSON manifest - daily note bullets with dates and sentiment tags, entity profile facts, life timeline events, place updates, and a list of ambiguities it couldn't resolve ("msg 833006: 'John' without surname - which John?"). A deterministic script reads this JSON and injects the bullets. Each bullet carries a (chat:: tg/chat_NNN) (msg:: 730372 - 730650) provenance marker pointing back to the source.

§5 Human · 13%

An SQLite provenance store tracks every output bullet back to its source message, so a bad session can be rolled back surgically. Everything deterministic - parsing, filtering, deduplication, provenance tracking - stays in Python, so no actual messages make it into the vault but I can always track their content down using original archives as the source of truth. Training the prompt ¶ The prompt file that governs the LLM's behaviour started at 8 KB but quickly grew tenfold, primarely from mistakes. For example, the model read a thread where I walked a friend through iPhone Uprade Program pricing math and wrote a purchase event to my life timeline, so I had to add a first-person possession test - no life-event classification without explicit first-person markers in the source ("I bought", "I signed"). A closure gate - a validation script that runs before marking any chat as done - catches some of this mechanically: orphan wikilinks, duplicate citations, language bleed. But it can't catch confabulation, so I've added sampling: pick 5-10 outputs at random after each batch, check them against the source. The model's self-reported confidence should never be a quality signal.

Directional sentiment ¶ At this point I had structured data - people, places, events, hobbies, recipes. But I also wanted to know how my relationships felt. Standard sentiment analysis assigns one polarity per message: positive, negative, neutral. If one person is enthusiastic and the other is giving one-word replies, VADER would tag the conversation as positive, but the reality is asymmetric[7]7. Poria et al. (2019) showed you can't assign emotion to a conversation without tracking who said what. A message from Person A might read as angry in isolation but is sarcasm in context of their usual tone with Person B.: one side is warm, the other is flat, and that delta is what makes it interesting. You could build this with classical ML - per-speaker emotion classification, then combine into pairs - but close friendships are warm by default. The signal isn't absolute emotion, it's departure from baseline. A message tagged joy means nothing if every message in this relationship gets tagged joy. You need the model to understand what normal looks like for this specific pair, or you'll get friendly banter tagged as "flirting".