00 / HERO scroll 000%

FOUNDER & CEO @ MAXORA AI ·PH.D. — NTU EE

Yun-Yen
Chuang

教 AI 自己學會怎麼探索

I build explorer–exploiter systems that learn how to explore — a second network that schedules noise, samples, and probes, so the generator can focus on what it does best. Diffusion, GANs, and meta reinforcement learning for natural language generation.

Meta-ExplorationText Diffusion Language GANsMeta RLNLG
SCROLL TO BEGIN
01

About

Yun-Yen Chuang (莊昀諺)
莊昀諺 · “Kloud” Yun-Yen Chuang Data Scientist · Founder & CEO @ Maxora AI · Ph.D. — NTU EE

I am the Founder & CEO of Maxora AI, where I turn research into product. My doctoral research at National Taiwan University, advised by Prof. Hung-yi Lee, works at the intersection of generative modeling and reinforcement learning for language.

My work asks a single question: instead of hand-crafting how a model adds noise, samples tokens, or explores — what if a second network learned that policy? The generator becomes the exploiter; a meta-trained scheduler or explorer does the searching.

▸ Research thread · one idea across the work

A recurring move runs through my papers: a second network that learns how to explore or schedule, so the main model can focus on generating or detecting. Meta-DiffuB and MetaEx-GAN make this explicit with a scheduler / explorer–exploiter pair; QMVDet uses a query-based-learning scheduler to decide when to guide a detector; and it traces back to RapGAN and SODM. Each method below is laid out as an instrument you can scroll through.

02

What we're building

At Maxora AI I'm turning the meta-exploration idea into a long-running growth agent: an autonomous system that designs, ships, measures, and re-designs marketing — and never stops learning from what it launches.

Landing-page generationAd-creative generation Trend sensingMeta / Google performance feedback On-device audience tuningHarness-engineered agents
▸ RESEARCH NOTES Harness Engineering in production How I keep a long-running AI agent stable, observable, and self-improving — the engineering behind the loop. Read the notes → ▸ ENGINEERING NOTES Building a local Claude Code Self-hosting a small model, beating a tiny context window, and an async agent loop — how to build a privacy-first coding agent at Claude-Code level. Read the notes →
03

How the methods work

NeurIPS 2024

Meta-DiffuB

A contextualized sequence-to-sequence text diffusion model with meta-exploration. Chuang et al., 2024.

code ↗
METHOD · scheduler–exploiter
Scheduler Bψ · meta-explorer βˣ contextualized noise Exploiter Dθ · S2S diffusion diffusion step t → βₜ noise √-schedule one fixed schedule · every sentence treated alike ι = Bψ(wˣ) — Meta-Instructions T F T T F T F ‘skipping’ · T → step noise up · F → hold level βˣ = skipping(ι, β√) · trained by policy gradient diffusion step t → easy → more noise hard → less noise βˣ adapts per sentence — the key move reverse diffusion with βˣ zT ~ N(0,I) z₀ → ŷ generating rounding → discrete target sentence Dθ → BLEU r before update Dθ′ → BLEU r′ after update R_β = r′ − r meta-reward ∇ψ J(ψ) · policy gradient → update Scheduler STATE OF THE ART 4 / 4 Seq2Seq benchmarks · vs DiffuSeq, SeqDiffuSeq, Dinoiser, PLMs CC QT WA QQP scheduler = plug-and-play · no fine-tuning at inference
DiffuSeq imposes one fixed noise schedule on every sentence — regardless of how hard it is to generate.
STEP 00 · the problem

One schedule for every sentence

Standard S2S-diffusion (DiffuSeq) adds noise on a fixed √-schedule. A trivial paraphrase and a hard open-domain reply are corrupted exactly the same way.

But sentences differ in difficulty — non-contextualized noise leaves performance on the table.

STEP 01 · the scheduler

A network that reads the sentence

The Scheduler Bψ — a small Seq2Seq model — reads the conditioning sentence wˣ and emits a sequence of Meta-Instructions ι, each labelled True or False.

A ‘skipping’ rule turns those labels into noise: True steps the noise up, False holds it.

ι = Bψ(wˣ) → βˣ = skipping(ι, β√)
STEP 02 · contextualized noise

Less noise for hard sentences

The result is a per-sentence schedule βˣ that bends away from the fixed baseline. Harder sentences get less noise to preserve signal; easier ones get more to boost diversity.

This is the move that non-contextualized schedulers can't make.

STEP 03 · the exploiter

Generate with the scheduled noise

The Exploiter Dθ — the S2S-diffusion model — diffuses and denoises using βˣ, recovering z₀ step by step, then rounds it back into a discrete target sentence ŷ.

STEP 04 · meta-reward

Generation quality teaches the scheduler

How much did the exploiter improve? Compare BLEU before and after an update: the meta-reward R_β = r′ − r flows back through a policy gradient to train the scheduler.

The scheduler learns how to noise — never touching the generator's loss directly.

STEP 05 · the result

State of the art, plug-and-play

Meta-DiffuB sets a new bar across four Seq2Seq benchmarks, beating prior diffusion models and fine-tuned PLMs.

Better still, the trained scheduler drops into existing models like DiffuSeq as a plug-and-play module — no fine-tuning required.

IEEE/ACM TASLP

MetaEx-GAN

Meta-exploration to improve natural language generation via generative adversarial networks. Chuang et al.

code ↗
METHOD · explorer–generator–discriminator
Explorer teacher · meta-exploration Generator Gθ · student · learns Discriminator Dϕ · real / fake sampled batch generated Y reward → ∇J(θ) learning effectiveness explore + exploit (both) one network samples AND learns → sparse reward, weak diversity mode collapse split the two jobs Explorer → sampling Generator → learning explorer rolls out a diverse batch w₁w₂w₃ real? p = 0.5x discriminator score reward updates the generator by policy gradient meta-reward = student's learning effectiveness how much did the generator improve from the explorer's batch? → updates the explorer's exploration policy QUALITY + DIVERSITY improved together, without more sampling quality → diversity → GANs MetaEx-GAN generalizes to GPT-2-based generators
In a classic Language GAN, a single Generator both samples and learns — the same policy explores and exploits.
STEP 00 · the problem

One network, two jobs

Language GANs train a generator with reinforcement learning. But that single generator must both sample (explore) and learn (exploit) — with the same policy.

Rewards are sparse, so exploration is poor: quality and diversity can't improve together, and the model drifts toward mode collapse.

STEP 01 · split the roles

A dedicated explorer

MetaEx-GAN adds a meta-trained Explorer (the teacher), whose only job is sampling. The Generator (the student) is freed to just learn.

STEP 02 · explore

Sample a richer space

The explorer rolls out a diverse batch of candidate sequences — searching parts of the space the generator's own policy would never reach.

STEP 03 · learn & judge

Generator learns, discriminator scores

The generator learns from that batch; the Discriminator scores real vs. generated and returns a reward that updates the generator by policy gradient.

STEP 04 · meta-reward

The student's progress trains the teacher

The generator's learning effectiveness — how much it improved on the explorer's batch — becomes the meta-reward that updates the explorer's policy.

The teacher learns to explore exactly where the student learns most.

STEP 05 · the result

Quality and diversity, together

MetaEx-GAN reaches state-of-the-art NLG, improving sampling quality and diversity at once — without generating more sequences.

It also generalizes to large pre-trained generators like GPT-2.

SENSORS ’24

QMVDet

Query-based multiview detection with a camera-aware attention scheduler. Hsu, Yuan, Chuang, Sun, Chang — co-author.

DOI ↗
METHOD · 2D guides 3D via QBL
many cameras → one ground plane (bird's-eye) w=1/C w=1/C w=1/C MVDetr weights every camera equally — occlusion varies 2D single-view detection · FairMOT (DLA-34) detection-by-tracking → reliable 2D foot points heatmap + box heads · tracklets fill missed detections project feature maps → deformable transformer cam feature γ[u v 1]ᵀ = P[x y z 1]ᵀ per-camera BEV feature maps via deformable attention 2D–3D consistency = camera reliability 3D ĝ → 2D 2D det g̃ d = argminⱼ ‖ g̃²ᴰ − f₃ᴅ→₂ᴅ(ĝ) ‖ · small d = reliable c₁c₂ c₃c₄ c₅ per-camera attention Ac from average discrepancy camera-aware attention · QBL scheduler softmax(Ac·ξc) QBL scheduler activate if (1−Hₜ/Hₜ₋₁) > 0.1 F = (1/C) Σ Ac · fc steer learning only when camera weight order shifts → saves the costly 2D–3D consistency computation STATE OF THE ART MODA · vs MVDet · MVDetr · 3DROM MVDet MVDetr 3DROM QMVDet Wildtrack 93.1% · MultiviewX 95.1% MODA robust even when cameras drop out
Multiview detection fuses many cameras onto one ground plane — but weighting them equally ignores occlusion.
STEP 00 · the problem

Not every camera is equally reliable

Multiview detection projects feature maps from many cameras onto one ground plane to beat occlusion. The strong baseline, MVDetr, weights every camera equally.

But occlusion depends on object position and camera angle — equal weighting leaves accuracy on the table.

STEP 01 · 2D detection

A 2D network to anchor the truth

A single-view detection-by-tracking network (FairMOT on DLA-34) produces reliable 2D foot points, using tracklets to fill in missed detections under occlusion.

STEP 02 · project & encode

Lift each view to the ground plane

Perspective transformation maps each camera's feature map onto the bird's-eye ground plane; a deformable transformer encodes per-camera BEV features.

STEP 03 · 2D–3D consistency

Reliability = agreement between 2D and 3D

Project the predicted 3D foot point back to 2D and measure its discrepancy against the 2D detection. Small discrepancy means a trustworthy camera.

Averaged per camera, this becomes a camera-aware attention weight Ac.

STEP 04 · QBL scheduler

Guide learning — only when it matters

Cameras are aggregated by attention-weighted averaging, F = (1/C)·Σ Ac·fc. Because the consistency computation is expensive, a query-based-learning scheduler only steps in when the camera weight order shifts — gated by an entropy change above 0.1.

The 2D network acts as an oracle that schedules when to teach the 3D detector — a scheduler, just like in my generation work.

STEP 05 · the result

A new benchmark on both datasets

QMVDet sets state-of-the-art MODA on Wildtrack (93.1%) and MultiviewX (95.1%), beating MVDet, MVDetr and 3DROM.

It even stays robust when some cameras drop out.

IEEE · RLG

RapGAN

Adversarial rap lyric generation with phrase-level roll-out and attention rewards. Chuang, Hsu, Chang, Lee.

details ↓
METHOD · PRO + AREGS
Generator Gθ · word-by-word Discriminator attention-LSTM generated ŷ reward → ∇J(θ) SeqGAN: roll out the whole sentence (MCTS) O(N) roll-outs every step — expensive using D(x,ŷ) directly → mode collapse, monotonous lyrics PRO · segment into meaningful phrases (TextRank) I am doing wonderful , thank you ρ′(y) = (p₁, p₂, …, p_T′) · meaningful phrases only drop meaningless cuts → preserve fluency reward only meaningful phrases roll out to phrase end t′_end — not sentence end streamlined training · faster, more fluent AREGS · attention-LSTM weights α α = softmax(wᵀ·tanh(H)) · r = αᵀH captures local phrases + global sentence semantics match attention weights, not D directly α* real α̂ gen S(α*, α̂) → reward cosine feature matching · prevents mode collapse STATE-OF-THE-ART RLG vs SeqGAN · MaliGAN · Ghost Writer DIV ORIG FLU + open 160k-song Chinese RLG dataset
SeqGAN rolls out entire sentences by MCTS at every step — expensive, and prone to mode collapse.
STEP 00 · the problem

Rolling out whole sentences is expensive

To reward a partial sequence, SeqGAN runs Monte-Carlo roll-outs to the end of the sentence at every generation step — computationally heavy.

And using the discriminator's score directly invites mode collapse: rap lyrics that all sound the same.

STEP 01 · PRO

Roll out meaningful phrases

Phrase Roll-Out segments each line into meaningful phrases with TextRank, ρ′(y) = (p₁…p_T′), instead of arbitrary or full-sentence cuts.

Meaningless fragments are filtered out, which keeps the generated lyric fluent.

STEP 02 · streamlined reward

Reward to the phrase boundary

Each phrase is rewarded only out to its own end step t′_end — not the end of the whole sentence. The roll-out is far cheaper while staying meaningful.

STEP 03 · AREGS

An attention-LSTM discriminator

The discriminator learns attention alignment weights α = softmax(wᵀ·tanh(H)), capturing both local phrase features and global sentence meaning across variable-length phrases.

STEP 04 · feature matching

Match the weights, not the score

Instead of the raw score D(x,ŷ), the generator is rewarded by the cosine similarity S(α*, α̂) between real and generated attention weights.

This feature-matching signal prevents mode collapse and lifts diversity and originality.

STEP 05 · the result

Better lyrics — and a dataset

RapGAN beats SeqGAN, MaliGAN and Ghost Writer on diversity, originality and fluency, validated against human evaluation.

It also releases an open 160,000-song Chinese rap dataset for future research.

ASONAM ’17

SODM & SOS

Detect & rearrange social-overloaded posts to prevent social overload. Chuang, Hsu, Lin, Chang.

DOI ↗
SYSTEM · detect → rearrange
first-in-first-out feed reader load support-seeking posts cluster → reader load spikes the “cost of caring” — stress, depression label boards · learn embeddings Prozac Hate Others Word2vec · Skip-Gram overload = {Prozac, Hate} · normal = others balanced sampling avoids board imbalance SODM · CKDGNN detector wordembed CNNfilters K-maxpool(doc) GRNNseq soft-max 95.15% overload-detection accuracy · 5-fold CV beats DCNN · CNN-GRNN · LSTM-GRNN score every post · threshold 0.5 θ = 0.5 0.18 0.91 0.78 0.32 0.83 0.27 posts above 0.5 → flagged social-overload before after · SOS max 3 consecutive overload → insert a calm post reader load stays under the tolerance line 95.15% social-overload detection accuracy 75% of participants reported reduced stress detection + rearrangement = a calmer feed
A first-in-first-out feed lets support-seeking posts cluster — overwhelming the reader.
STEP 00 · the problem

Feeds let negativity pile up

Social feeds display posts first-in-first-out. When many support-seeking, negative posts land together, readers absorb them all at once.

Psychologists call it the cost of caring — repeated exposure drives stress and depression.

STEP 01 · the data

Label boards, learn embeddings

Posts crawled from a BBS are labelled by source: the Prozac and Hate boards as social-overload, others as normal — balanced to avoid skew.

Words are embedded with Word2vec Skip-Gram.

STEP 02 · SODM

A document-level detector

CKDGNN stacks word embeddings → CNN filters → K-max pooling at the document level → a GRNN → softmax, scoring each post's overload probability.

It reaches 95.15% accuracy, beating DCNN, CNN-GRNN and LSTM-GRNN.

STEP 03 · scoring

Flag with a threshold

Every post gets a probability. Anything above the 0.5 threshold is flagged as social-overload — the rest are normal load.

STEP 04 · SOS

Rearrange to protect the reader

The Social-Overload prevention System re-sorts the feed so no more than three overload posts appear in a row — inserting a calmer post to break the streak.

Reader load stays under the tolerance line.

STEP 05 · the result

A measurably calmer feed

Detection hits 95.15% accuracy, and after rearrangement 75% of participants reported that social-overload stress was reduced.

04

Education

Ph.D., Electrical Engineering

National Taiwan University

SEP 2022 – 2025 · SPEECH LAB · ADVISOR — PROF. HUNG-YI LEE (李宏毅)

M.S., Engineering Science & Ocean Engineering

National Taiwan University

SEP 2015 – JUN 2017 · iCAN LAB · ADVISOR — PROF. RAY-I CHANG (張瑞益)

B.S., Computer Science

National Changhua University of Education

SEP 2011 – JUN 2015

05

Selected publications

06

Get in touch

Open to research collaborations, talks, and conversations about meta-exploration, generative models, and applied AI.