2011年2月24日木曜日

A reservoir of time constants for memory traces in cortical neurons.

Nat Neurosci. 2011 Feb 13.
Bernacchia A, Seo H, Lee D, Wang XJ.

適切な学習のためには、報酬を適切な時間スケールで評価することが重要。環境変動が激しい(激しくない)ときは短い(長い)スケールで評価すべき。

ACC、dlPFC、LIP、の三つの部位全てに、異なる時間スケールで報酬情報を保持するニューロンが存在する。そしてその分布は「べき分布」に従う。

According to reinforcement learning theory of decision making, reward expectation is computed by integrating past rewards with a fixed timescale. In contrast, we found that a wide range of time constants is available across cortical neurons recorded from monkeys performing a competitive game task. By recognizing that reward modulates neural activity multiplicatively, we found that one or two time constants of reward memory can be extracted for each neuron in prefrontal, cingulate and parietal cortex. These timescales ranged from hundreds of milliseconds to tens of seconds, according to a power law distribution, which is consistent across areas and reproduced by a 'reservoir' neural network model. These neuronal memory timescales were weakly, but significantly, correlated with those of monkey's decisions. Our findings suggest a flexible memory system in which neural subpopulations with distinct sets of long or short memory timescales may be selectively deployed according to the task demands.

2011年2月21日月曜日

Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex.

サルの損傷研究。報酬に基づく意思決定において、mOFC(内側眼窩前頭野)はdecisionに、lOFC(背側眼窩前頭野)はlearningに関与。 http://www.ncbi.nlm.nih.gov/pubmed/21059901 http://www.ncbi.nlm.nih.gov/pubmed/20346766

「lOFCがないと、credit assignmentに失敗する」とかすごくおもしろい。単純な「条件付け」にもまだまだフロンティアがありそう。

Proc Natl Acad Sci U S A. 2010 Nov 23;107(47):20547-52. Epub 2010 Nov 8.
Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex.
Noonan MP, Walton ME, Behrens TE, Sallet J, Buckley MJ, Rushworth MF.

Uncertainty about the function of orbitofrontal cortex (OFC) in guiding decision-making may be a result of its medial (mOFC) and lateral (lOFC) divisions having distinct functions. Here we test the hypothesis that the mOFC is more concerned with reward-guided decision making, in contrast with the lOFC's role in reward-guided learning. Macaques performed three-armed bandit tasks and the effects of selective mOFC lesions were contrasted against lOFC lesions. First, we present analyses that make it possible to measure reward-credit assignment--a crucial component of reward-value learning--independently of the decisions animals make. The mOFC lesions do not lead to impairments in reward-credit assignment that are seen after lOFC lesions. Second, we examined how the reward values of choice options were compared. We present three analyses, one of which examines reward-guided decision making independently of reward-value learning. Lesions of the mOFC, but not the lOFC, disrupted reward-guided decision making. Impairments after mOFC lesions were a function of the multiple option contexts in which decisions were made. Contrary to axiomatic assumptions of decision theory, the mOFC-lesioned animals' value comparisons were no longer independent of irrelevant alternatives.

Neuron. 2010 Mar 25;65(6):927-39.
Separable learning systems in the macaque brain and the role of orbitofrontal cortex in contingent learning.
Walton ME, Behrens TE, Buckley MJ, Rudebeck PH, Rushworth MF.

Orbitofrontal cortex (OFC) is widely held to be critical for flexibility in decision-making when established choice values change. OFC's role in such decision making was investigated in macaques performing dynamically changing three-armed bandit tasks. After selective OFC lesions, animals were impaired at discovering the identity of the highest value stimulus following reversals. However, this was not caused either by diminished behavioral flexibility or by insensitivity to reinforcement changes, but instead by paradoxical increases in switching between all stimuli. This pattern of choice behavior could be explained by a causal role for OFC in appropriate contingent learning, the process by which causal responsibility for a particular reward is assigned to a particular choice. After OFC lesions, animals' choice behavior no longer reflected the history of precise conjoint relationships between particular choices and particular rewards. Nonetheless, OFC-lesioned animals could still approximate choice-outcome associations using a recency-weighted history of choices and rewards.

Ventral Striatum and Orbitofrontal Cortex Are Both Required for Model-Based, But Not Model-Free, Reinforcement Learning

The Journal of Neuroscience, February 16, 2011, 31(7):2700-2705
Michael A. McDannald, Federica Lucantonio, Kathryn A. Burke, Yael Niv, and Geoffrey Schoenbaum

ラットの損傷研究。Unblockingを用いて「vStriatumはモデル・ベースド/モデル・フリー強化学習両方に、OFC は前者のみに効いている」。エレガントだけど、model-based RLの理解が深まったとは感じられなかった…

In many cases, learning is thought to be driven by differences between the value of rewards we expect and rewards we actually receive. Yet learning can also occur when the identity of the reward we receive is not as expected, even if its value remains unchanged. Learning from changes in reward identity implies access to an internal model of the environment, from which information about the identity of the expected reward can be derived. As a result, such learning is not easily accounted for by model-free reinforcement learning theories such as temporal difference reinforcement learning (TDRL), which predicate learning on changes in reward value, but not identity. Here, we used unblocking procedures to assess learning driven by value- versus identity-based prediction errors. Rats were trained to associate distinct visual cues with different food quantities and identities. These cues were subsequently presented in compound with novel auditory cues and the reward quantity or identity was selectively changed. Unblocking was assessed by presenting the auditory cues alone in a probe test. Consistent with neural implementations of TDRL models, we found that the ventral striatum was necessary for learning in response to changes in reward value. However, this area, along with orbitofrontal cortex, was also required for learning driven by changes in reward identity. This observation requires that existing models of TDRL in the ventral striatum be modified to include information about the specific features of expected outcomes derived from model-based representations, and that the role of orbitofrontal cortex in these models be clearly delineated.

2011年2月4日金曜日

Prefrontal coding of temporally discounted values during intertemporal choice.

Kim S, Hwang J, Lee D.
Neuron. 2008 Jul 10;59(1):161-72.

Reward from a particular action is seldom immediate, and the influence of such delayed outcome on choice decreases with delay. It has been postulated that when faced with immediate and delayed rewards, decision makers choose the option with maximum temporally discounted value. We examined the preference of monkeys for delayed reward in an intertemporal choice task and the neural basis for real-time computation of temporally discounted values in the dorsolateral prefrontal cortex. During this task, the locations of the targets associated with small or large rewards and their corresponding delays were randomly varied. We found that prefrontal neurons often encoded the temporally discounted value of reward expected from a particular option. Furthermore, activity tended to increase with [corrected] discounted values for targets [corrected] presented in the neuron's preferred direction, suggesting that activity related to temporally discounted values in the prefrontal cortex might determine the animal's behavior during intertemporal choice.

サルの異時点間意思決定。 http://bit.ly/eHK9LF 選択行動は双曲割引で説明されて、前頭前野外背側部(DLPFC)のニューロンが割引現在価値を保持。しかし、Daeyeol Lee、相変わらず力技だなあ。

2011年2月3日木曜日

Temporal discounting predicts risk sensitivity in rhesus macaques.

Hayden BY, Platt ML.
Curr Biol. 2007 Jan 9;17(1):49-53.

サルのリスク選好は時間選好から予測できる。
「リスク〜報酬を得るまでの時間が長い」と解釈できて、次の選択までの時間(ITI)が長いとリスク回避的になる(時間割引で説明できる)。全体的な傾向としてサルはリスク愛好的らしい。ちょっと意外。

この論文を読むと、ヒトを対象としたした実験と動物実験では、似ているようで全然違う枠組みを使っていることが分かる(経済実験にはITIっていう概念ないし)。比較・解釈には注意が必要だなあ。

「リスク選好は時間選好で予測できる」となると、時間割引の起源をどう説明すれば良いんだろう?利子の存在で説明するのはトートロジーっぽいしなあ。「現在価値が発散するのを防ぐため」という身も蓋もない説明をした先生もいたけど…

2011年2月2日水曜日

Dopamine-Mediated Reinforcement Learning Signals in the Striatum and Ventromedial Prefrontal Cortex Underlie Value-Based Choices

Gerhard Jocham, Tilmann A. Klein, and Markus Ullsperger
The Journal of Neuroscience, February 2, 2011, 31(5):1606-1613; doi:10.1523/JNEUROSCI.3904-10.2011

A large body of evidence exists on the role of dopamine in reinforcement learning. Less is known about how dopamine shapes the relative impact of positive and negative outcomes to guide value-based choices. We combined administration of the dopamine D2 receptor antagonist amisulpride with functional magnetic resonance imaging in healthy human volunteers. Amisulpride did not affect initial reinforcement learning. However, in a later transfer phase that involved novel choice situations requiring decisions between two symbols based on their previously learned values, amisulpride improved participants' ability to select the better of two highly rewarding options, while it had no effect on choices between two very poor options. During the learning phase, activity in the striatum encoded a reward prediction error. In the transfer phase, in the absence of any outcome, ventromedial prefrontal cortex (vmPFC) continually tracked the learned value of the available options on each trial. Both striatal prediction error coding and tracking of learned value in the vmPFC were predictive of subjects' choice performance in the transfer phase, and both were enhanced under amisulpride. These findings show that dopamine-dependent mechanisms enhance reinforcement learning signals in the striatum and sharpen representations of associative values in prefrontal cortex that are used to guide reinforcement-based decisions. 

強化学習とドーパミン。D2 antagonistを投与すると、学習に影響はないけど、学習結果を使った行動選択が一部改善される。また、予測誤差や価値関連のfMRI信号も増大する。データはきれいだけど、解釈はすっきりしない感じ。議論がムズい…

今日から来週火曜にかけて、行動実験×15人。がんばろう。