spacelove.xyz - Entropy and coarse graining

Say I place a quarter, face up, in my palm. Then I close my hand and shake it. I hold out my hand again, still closed. Now neither of us know if the quarter is face up or not. The energy of shaking my hand has changed the amount of information we know about the state of the quarter.

The point is that entropy (or "missing information"), energy, information, and heat are all related. You can create entropy from adding energy.

and you can even get back energy from pure information.

It's not as intuitive, but you can also go the other way. Strangely, if I know the state of the quarter, I can make my hand shake. (I guess it's not unusual for money to make people's hands shake.)

Coarse graining

Ignoring some historical detail, here's a primer about entropy. As long as the energy of a system is fixed, the entropy is equal to the logarithm of the number of states a system can take.

$$ S = \log ( \Omega ) $$

The change in entropy is also equal to the amount of heat absorbed divided by the temperature, at constant temperature.

$$ dS = \frac{dQ}{T} $$

It's hard to see how these are related. We'll talk about that later.

So let's say you want to start computing the entropy. The simplest case is to a single particle in a box. Fix the energy so that the phase space is constrainted to a finite volume. Straightaway you can see that there are an infinite number of states (or points), $ \Omega $, in this volume, because there are infinitely-many points in a continuous volume. So that's no good.

Here's how people handle this typically. They partition up the phase space into finitely many chunks $ \{\gamma_i\} \in \Gamma $ and then say that every state in a single chunk $ \gamma_i $ can be considered, for all intents and purposes, the same state. There -- now you only have one state for each $ \gamma_i $, so you have finitely many states. The problem is that someone else can choose a different set of $ \{\gamma^\prime_i\} \in \Gamma $ and now the entropy you calculate is different,

$$ S = \log(\Omega) \neq \log (\Omega^\prime) = S^\prime $$

The partitioning scheme is totally arbitrary, and there's nothing physical about it. What's also weird is that the entropy associated with the choice of $ \{\gamma_i\} $ sort of doesn't matter. For example, suppose you're interested in the entropy associated with the volume of an ideal gas. You discretize the volume into uniform chunks of size $ \gamma $ so that the volume entropy is

$$ S_\text{Vol} = \log (V/\gamma) = \log(V) + \log(1/\gamma) $$

and you get an entropy associated with the total size of the volume, as well as a contribution from the size of bins you picked. The smaller you make $ \gamma $, the more "entropy" you get from that part, despite the fact that nothing material was changed. (I'll go into detail there in a second.)

Note that the separability is what's key here. After all is said and done, choosing a different partitioning scheme just adds a constant to the entropy. When you go back to the Clausius entropy example,

$$ \Delta S = \Delta Q / T $$

any $ \Delta S $ is going to subtract the entropy of partitioning (or coarse-graining) and so it will be completely negligible. It's only useful insofar as defining a zero of entropy.

Notably, in order for this coarse graining to make sense, the volume of the phase space $ \Gamma $ should be much bigger than the coarse graining size. This is where the Sackur-Tetrode formula fails, when the temperature gets low, the entropy goes to negative infinity instead of zero, because the volume of phase space occupied becomes much less than $ h $ per degree of freedom in phase space.

Coarse-graining is also equivalent to throwing away information that you consider useless. Pick some phase space point inside some $ \gamma_i $ in phase space. You throw away all the other information about that point -- the exact details of where it is inside that $ \gamma_i $, and only keep that one fact. By the end of the coarse graining procedure, all you know is how many points were in each parition of phase space, nothing more. Because it takes infinitely many bits to specify a point in phase space, this is equivalent to removing an infinite amount of information. Then you let your system time-evolve. Any change in entropy will be relative to this baseline of finite entropy.

So it may seem like coarse-graining is a bad idea. On the other hand if you get rid of coarse-graining, you get even bigger problems.

The relative entropy

Gibbs/Von Neumann/Shannon have an equation for the entropy of a system with finitely many states $ \{ p_i \}$,

$$ S = - \sum p_i \log p_i. $$

In our single-particle-in-a-box example, we might associate a single $ p_i $ with the probability that the particle is in phase-space volume $ \gamma_i $. In other words, if you start with a particle with a continuous degree of freedom, you're already assuming coarse-graining when you write down this entropy formula.

Can we make this continuous? Let's see what happens when we try. We can use the ordinary Riemann-style summation procedure. We take $ p_i \approx p(\gamma_i) \Delta \gamma_i $, where $ p(\gamma_i) $ is a continuous function, and $ \Delta \gamma_i $ is the volume of that phase space chunk, so the discrete integral can be written, exactly, as

$$ S = -\sum \Delta \gamma_i p(\gamma_i) \log (p(\gamma_i) \Delta \gamma_i) $$

Just as before you can separate out the entropy of the coarse-graining. The final result you get, exactly, for a very fine coarse grained mesh, limits to

$$ S = -\int d \gamma p \log (p) + S_\text{coarse-graining} $$

where $ p = p(\gamma) $ and $ S_\text{coarse-graining} = \langle - \log (\Delta \gamma) \rangle$. The coarse-graining entropy diverges to infinity as the $\gamma_i$ get smaller. As before the entropy does change under a different choice of binning, and this is no different. It's just seeing it from another point of view.

However, what this does show us is that we can remove the entropy of coarse-graining by considering a "relative entropy"

$$ S_\text{rel} = -\sum \Delta \gamma_i p(\gamma_i) \log \left( \frac{p(\gamma_i) \Delta \gamma_i}{\mu(\gamma_i) \Delta \gamma_i} \right) $$

The cancellation in the $ \Delta \gamma_i $ means that this actually has no divergences, so we can safely take the limit

$$ -\sum \Delta \gamma_i p(\gamma_i) \log \left( \frac{p(\gamma_i)}{\mu(\gamma_i)} \right) \rightarrow - \int d\gamma p \log (p / \mu) $$

but that raises another question: what is the function $ \mu $ in the denominator? Apparently we can get rid of the entropy of coarse-graining by considering the relative entropy against any arbitrary function $ \mu $! This equates to a Shannon entropy of

$$ S_\text{rel} = - \sum p_i \log \left( p_i / \mu_i \right) $$

We call this the relative entropy because it's exactly equal to the difference between the entropy of the "true" distribution $ p $ relative to what we'd measure for the entropy with some reference distribution $ \mu $.

Going back to our single particle, let's take $ \mu $ to be the uniform distribution. For the particle in the box, there's no entropy associated with the energy, but there is some associated with the momentum. It can be anywhere, uniformly, on a circle of constant momentum, and it can be anywhere, uniformly, inside the box. Therefore the $ p_i $ are all equal to the $ \mu_i $ and the relative entropy is just zero.

In other words, all the entropy of a classical particle in a box at fixed energy, relative to the uniform distribution, is due to coarse-graining.

Maxwell's simulation demon

If you have perfect knowledge about the microphysics of a system at any one point in time, you can reverse the second law. Specifically, if had a gas in a box with a membrane down the middle, you could take a snapshot of the system at some time $ t $, and simulate it for any future time. Then you could remove the membrane just as a particle was moving to the left across it, and thus concentrate the gas on one side. This is Maxwell's simulation demon.

You're not a man until you have your own Maxwell's demon. Here's mine, (though it's pretty much the same as Charles Bennett's.)

Suppose you have a single particle in a box (again!) at fixed energy. It can be anywhere in the box. This Maxwell's demon is a computer program that can measure the position and momentum of the particle, reversibly, to whatever accuracy it wants. The information is stored as ones and zeros on her magnetic tape, which we assume is as long as desired.

Once the phase space of the particle is known, the computer can predict, to arbitrary accuracy, whenever the particle is in the left side of the box. Then it can quickly insert a membrane and trap it. The result is the particle's entropy has decreased by $ \log 2 $ bits.

The only difference now is in the mind of the demon. Her magnetic tape is now full of information.

Getting rid of coarse-graining?

There's a problem with getting rid of coarse-graining. Liouville's theorem says that the phase space volume is conserved for any Hamiltonian system, and since the entropy is the logarithm of the space space volume, it never increases, and you get no sense of the second law of thermodynamics. Scrambled eggs get unscrambled, spilled water gets unspilled, sand spontaneously forms into mountains, etc.

It feels like we're doomed either way. Introduce an arbitrariness in the entropy, or else it won't behave like we expect it to. But we know entropy is a functioning concept. Something's gotta give.

Let's talk about information and coarse graining. Entropy is equal to the amount of information you need to know the exact state the system is really in. If the logarithm is base-two, this information is measured in bits. If it's base $ e $, it's nats. You can always convert from one to another, so it's the same idea either way. Let's choose to measure entropy in bits, so that each piece of information is either a $ 1 $ or a $ 0 $.

Picture you have a particle in a square box. Here's how you might go about mapping its position information to a string of ones and zeros. You start by asking the question if the particle is in the left side or right side of the box. If it's on the left, you start the string with a $ 0 $, otherwise a $ 1 $. This gives you one bit of information. So if the string is on the right, we can say the position is

$$ \text{state} = 1... $$

where the dots indicate missing information. So it's on the right. Now we ask if it's either in the upper right or lower right. If it's in the lower right we add a zero, and if it's in the upper right, we add a one.

$$ \text{state} = 10... $$

Now we're back to putting the particle in a square box and we can repeat the process as many times as we like.

$$ \text{state} = 10111001101... $$

If we define the entropy to be "the information necessary to know the particle's exact position" then the entropy is still infinite (that's the trailing ellipsis.) It takes an infinite amount of information to know the position to perfect precision, so the "missing" information to specify that position will always be infinite.

You can think about adding a membrane along the axis of division, so that the particle position gets further and futher constrainted every time you add a one or a zero to the state function. Every time you measure which half of the space the particle is in, you do the measurement reversibly, so that you don't increase the entropy.

By the end of all of this, you'll have $ n $ bits of information describing the particle's state, and you'll have it in a tiny box that's $ 2^n $ times smaller than before. As

$$ pV = kT $$

for this one particle ideal gas, obviously, the pressure will be extremely high if V is small, even at this fixed temperature. We can take advantage of this high pressure by letting the particle do work. By taking that part of the box that the particle is in, translating it to the edge of the heat bath, and isothermally expanding it again, you can recover work. If we allow the particle to do work on the membrane walls, its entropy will increase by $ n \ln(2) $ bits. The end result is that the entropy has increased (we've lost all the information about the position of the particle) but we've done useful work. The amount of work we did was

$$ W = kT n \ln(2) $$

To recap: we reversibly measured the position of the particle in the box down to a string of $ n $ bits. We put it in a tiny box with wall size $ \omega = \Omega / 2^n $. Then we put that tiny box in contact with a reservoir and allowed it to isothermally expand. We recovered work equal to $ n k T \ln (2) $.

When it comes to the ideal case, the exact entropy, the number of microstates associated with that system is just one, so the entropy is zero. For all time, the entropy stays zero, because the system's state can always be inferred from the initial state, plus the dynamics of time.

Put a few dozen racers on a starting line, and tell them to start running at a constant speed. After a while, they'll be mixed up all over the track. But if you measured their speeds accurately enough, then by blowing the whistle and telling them to turn around, eventually, they'll all end up lined up at the starting line, just as they started. (This is analogous to spin-echo experiments.) It's also like mixing two high viscosity fluids -- you can always unmix them, just by stirring the other way. The number of microstates appears to increase, and the system appears to be more chaotic, but this is just an illusion.

The important point here is that there's an observer watching this system with perfect information about the starting conditions. But what makes entropy "entropic" is that the imperfect information about the starting conditions makes it very hard to reverse course and bring the system back to its initial state. The very best you can do is bring it back to one of the states in the same slice of phase space as the initial point.

The next question I want to address is whether the Carnot entropy can be related to the informational entropy. The Carnot entropy equation says that by "heating" a system at constant temperature, means that you're increasing the number of microstates states associated with that system, while keeping the temperature constant. An example of this is heating the (even single-particle) ideal gas at constant temperature, increasing the volume.

Newton's laws are time-symmetric, but thermodynamics is not. The solutions that have been proposed to resolve this issue involve coarse graining to remove information. The part that doesn't sit well with me is: coarse graining only removes my knowledge of the system. How is why I think at all relevant for the arrow of time? Entropy should increase regardless of what I think or know.

Let's frame the problem a la Shannon. The problem of increasing entropy is the same as saying the amount of "missing information", the information needed to determine the exact state of a system, should increase over time.

But suppose you had perfect information about a system. Then you could make the entropy get lower over time. This is maxwell's demon.

Consider the humble light switch. It can be OFF or ON. It takes exactly one bit of information to completely specify the state of the switch. That amount of information is called the entropy, $ S$. If there are $n$ switches, then $S$ equals $n$ bits. Simple enough.

Let's change the situation a little. You have the same $n$ switches, but now the first switch is frozen in the ON position. How much more information do you need to completely specify the state of the system (in other words, what's $ S$)? It's now $ n-1 $ bits. What if someone installed a rod connecting all the switches, so that they all move together? Then you're back to only needing one bit.

In general, the number of bits needed to specify a system with $ n$ equally likely configurations is

$$ S = \log_2(N) $$

where $N$ is the number of configuations. When there are $n$ ON/OFF switches, $N = 2^n$, and $\log_2(2^n) = n$, so everything works out nicely.

$ S $ is the missing information about a system. Whether we know a little bit about a system, or a lot, we know that with $S $ more bits, we could know the exact state that the system is currently in. When $ S = 0$, we've specified everything about the system, i.e. there's no missing information.

What stands out about this, and something Gibbs noticed back in the stone ages of physics (the 1800s) is that how much information is "missing" can depend on how much information you assume is there. Here's what I mean:

Let's take your system to be a sequence of $n$ US quarters. Say you know in advance that $h$ are heads and the rest are tails. Any given configuration looks like one of

$$ \text{HHTHTTTTHTHTT...} $$

How many of these are there? You can arrange the $n$ coins $n!$ different ways, but since swapping a heads with another heads (or tails with tails) doesn't change the configuration, you divide by $h!$ and $(n-h)!$. The total number of configurations is $ n!/h!(n-h)!$, and so the missing information $S $ is

$$ S = \log\left(\frac{n!}{h!(n-h)!}\right) $$

Then someone comes along and points something out. They're looking at your coins, and they notice that one says 1964, another says 1997... so that there are actually more unique configurations than you intially thought. "Not every coin is identical," they say, "and so if I swap the 1964 heads with the 1997 heads, I'll get a different configuation. The number of configurations is actually higher." They claim that a sequence looks like

Then a metallurgist comes along and says, "it's worse than that. I've measured each of these coins, and each one has a distrinct radioactive signature. In fact, all the coins are unique!" The metallurgist claims

$$ S = \log n! $$

who is right? And then what if someone else comes along and say, "well actually, each of the coins is slightly rotated off axis... and each small rotation of the coin should count as a different state. So there's actually infinitely many states the coins could be in." Then what? that's actually true. The concept of missing information feels doomed.

This is something I was wondering about. Is it always true that you can keep abstracting more and more? That there's always infinite missing information in a system? Or is there a kind of limit for a closed system where once you start talking about it at a certain level of detail, you've described all the information?

So this is what I want to think about regarding the concept of entropy. There are two questions:

The classical case: why is it that entropy seems to depend on how much we know about a system
The quantum case: is there a floor for how much entropy there is in the world, a limit to how much we can know about a system? And if there is, is the amount of "missing information" something we can measure?

Relative missing information

Let's generalize a little the concept of missing information, $ S$.

You may not need an integer number of bits to specify the state of a system. Take a deck of cards, and draw one card from it. This card is now your system. This card has 52 possible "states" it can be in. That means it takes

$$ S = \log_2 (52) \approx 5.7 \ \text{bits} $$

to specify the system (card's) state.

There's the missing information of the number of the card (Ace, Two, Jack) and the suit. We can see how those break down:

$$ S = \log_2 (56) = \log_2(4 * 13) = \log_2(4) + \log_2(13) $$

so we have $ \log_2(4) = 2 $ bits of information associated with the suit, and $ \log_2(13) \approx 3.7 $ bits associated with the number, for a total of 5.7 bits. Missing information works this way. The missing information of the whole is just the sum of the missing information of the parts.

So this allows us to generalize. If we have discrete pieces of missing information $ S_i $, then the total missing information is the sum of all of them:

$$ S = S_1 + S_2 + S_3 + \dots $$

The problem is, you can reasonably argue that for most systems, this sum is infinite. But if we have two similar systems, we can compare the differences in their missing information by subtracting them.

For our card example, let's call the $ S_1 $ the missing information associated with the suit, and $ S_2 $ the missing information associated with the number. Suppose $ S_3 $ and on measure missing information associated with stuff we don't care about, like the orientation of the card when it's placed on the table.

Suppose then that I know the orientation of the card when it's placed on the table, and all that. I can call the missing information associated with that stuff $ S^\prime $. This means

$$ S - S^\prime = S_1 + S_2 $$

which is what we wanted. It's the missing information of $ S$ relative to $S^\prime $. If there are $n $ states associated with $S$ and $ m $ states associated with $ S^\prime $, then this is the same as defining the missing information as

$$ S = \log n - \log m = \log \left( \frac{n}{m} \right) $$

This is called the entropy of the state $ S $ relative to $ S^\prime $. Or, how much more information is missing from $ S$ than $ S^\prime $.

Probabilities

So far we've talked a lot about the missing information for a system with equally-likely states. Let's try to generalize this.

In our original two-state system, the switch, we said

$$ S = \log_2(2) $$

which we can suggestively rewrite as

$$ S = -\frac{1}{2}\log_2\left(\frac{1}{2}\right) -\frac{1}{2}\log_2\left(\frac{1}{2}\right). $$

This is still one bit, just like before, but this expression breaks up the (missing) information into two parts, each of which are worth .5 bits. Each half a bit represents the information associated with each state that the switch can be in. Evidently, the fact that the switch can be in the ON state contributes a half a bit, as does the OFF state.

$$ S = -p_\text{ON}\log_2(p_\text{ON})-p_\text{OFF}\log_2(p_\text{OFF}) $$

This reduces to our original "missing information" formula, but it's more general.

For example, let's suppose that in a given room with two light switches, at any given time, it's 1/2 likely that both switches are ON, and 1/6 likely that either one switch is OFF, and 1/6 likely that both are OFF.

This is all you know about the room. How much missing information is there? How much information do we need to completely specify the system?

How do we even reason about this?

First what's important is to recognize that we have four distinct states: ON/ON, ON/OFF, OFF/ON, and OFF/OFF, and that they have different probabilities attached to them.

What are the qualities of missing information? We want to preserve the same feature as before, that if we have two separate systems, that the missing information adds.

$$ S(X, Y) = S(X) + S(Y) $$

if X and Y are independent. This implicates the logarithm, just as before. Similarly, if the odds that something happens is one, then we have complete information about the system. Also logarithm.

Let's suppose that we've drawn a card, but this time we're told that the card as a $ 4/5 $ probability of being a heart, and only $ 1/5 $ probability of being some other suit (diamonds, spades, or clubs.)

Intuition says that there should be less missing information. After all, if we knew for sure that the card was a heart, then we'd have eliminated all the missing information about the suit. So knowing more about the suit must decrease the missing information.

With no prior information the missing information associated with a suit is $ S = 2 $ bits of information.

$$ S = \frac{4}{5} \log_2 13 + \frac{1}{5} \log_2 39 $$

If the card is indeed red, the expected missing information is

$$ S_{\text{red}} = \log_2 26 \approx 4.7 $$

and this occurs with probability $ 2/3 $. If the card is black,

$$ S_{\text{black}} = \log_2 26 \approx 4.7 $$

but the total information is now different:

$$ S_{\text{black}} = \log_2 26 \approx 4.7 $$

where $ n $ is the number of states of the system. If all the states $ p \in \Omega $ are equally likely, then each one has a probability of occurring of $ 1/ n $, and

$$ S = - \log (p) $$

Adam G. (as usual) pointed me to an interesting problem in the definition of entropy. Let's say you have some discrete probability distribution $p_i \in p $. The entropy $ S $ of that distribution is

$$ S(p) = - \sum p_i \log p_i. $$

Consider the case where your discrete $ \{p_i\} $ come from a continuous distribution $ p(x) $ that you approximate by binning, so that $ p_i \approx p(x_i) \Delta x $.

$$ S(p) \approx -\sum p(x_i) \log (p(x_i) \Delta x) \Delta x $$

The question is: is it meaningful to talk about $ S(p) $, when $ p $ is continuous?

This is a more subtle question than I thought it would be. In physics, we're used to being able to compute integrals with a Riemann sum. We take the bin spacing to be $ \Delta x $, and make that bin spacing smaller and smaller until we're happy with the result.

If you could do that with $ S $, then the entropy of $ p $ could be viewed as the limiting of taking a finer and finer mesh. Alas, passing to the continuum limit by taking $ \Delta x \rightarrow 0 $ increases the entropy until eventually it becomes infinite. The entropy is sensitive to the choice of the bin size $ \Delta x $.

You could have seen this right away. With large bins, lots of observations get lumped into the same bin, and can't be distinguished from each other. Opposite that, if you make the bins smaller, you can distinguish more events. With small enough bins, you can distinguish all of them. This ability to distinguish events increases the entropy you measure. This is why you have to coarse-grain the entropy for an ideal gas, a la Sackur-Tetrode.

We can get a sense for how quickly the thing diverges by separating out the divergent term. Let's do this carefully to make sure that we don't run into any problems with units, since it absolutely must be the case that whatever is inside the log is dimensionless. Since $ p(x) $ must have inverse units of $ x $ (call those units $ \left[ x \right] $, which could be meters, inches, pounds, whatever), we can't separate the log of a fraction into the difference of logs without first addressing that. What we'll do is say

$$ \log(p(x_i) \Delta x) = \log \left( p(x_i) \Delta x \cdot \frac{ 1 \left[ x \right]}{1 \left[ x \right]} \right) $$

before we separate out the log, so that we get something like

$$ \log(p(x_i) \Delta x) = \log \left( p(x_i) \cdot 1 \left[ x \right] \right) - \log \left( \frac{\Delta x} { 1 \left[ x \right]} \right) $$

so that everything inside the log is unitless. I'll use $ p(x_i) ^\star $ to mean the unitless sibling of $ p(x_i) $, and same for $ \Delta x ^\star $. Then

$$ S \approx -\sum_i p(x_i) \log p(x_i)^\star \Delta x - \ln \Delta x^\star $$

The first term can be written as a integral and looks like

$$ S \approx \int p(x) \log p(x)^\star d x. $$

That doesn't look too bad. No matter what binning you use, the sum limits to this integral if you take $ \Delta x $ small enough. It's tempting to call this the "continuous entropy" and walk away. Do that, though, and you end up with an entropy with some properties that aren't so nice. It can be negative, for one, which seems odd. It also loses its invariance under coordinate transformations, as in, it's literally equal to something else if you choose to use inches instead of meters in your probability distribution.

How does the invariance get lost, exactly? For any change of variables $ y(x),$ the part $ p(x_i) \Delta x = p(y_i) \Delta y$, so that part is invariant. The loss of invariance comes from separating out the $ \log \Delta x^\star $ part. Include that, and the invariance comes back. Of course, if you include it, you also get infinite entropy. There's no way to pass to the continuum limit and keep all the properties you'd like an entropy to have.

If you choose $ \Delta x = 2^{-n}$, and take $ n \rightarrow \infty $, then the log of the bin spacing is just the number of bits you've used to specify the $ { p_i } $. In other words, the "bintropy." With infinite precision, the bintropy is infinite. From a physical point of view, the bintropy tells you about the apparatus you use to measure $ p(x) $, but nothing at all about the shape of $ p(x) $ itself. Yet you need to keep it in for the entropy to have the nice properties you'd like it to have.

Thinking about how to get around this, the divergence comes from the "naked $p_i $" in the log. If we could consider ratios of probabilities instead, the problem would go away.

You can consider another quantity called the "mutual information". The mutual information is the expected value of the log of the ratio of two quantities:

$ P(x, y) $, the combined probability of drawing x and y, and
the product $ P(x)P(y) $, which is the probability of drawing x and y if they were independent.

$$ I(X;Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) \ln \frac{P(x, y)}{P(x)P(y)} $$

This isn't sensitive to a choice of binning,

$$ I(X;Y) = \int \int P(x, y) \ln \frac{P(x, y)}{P(x)P(y)} dx dy $$

Entropy as blur

Microscopic physical system has no increase in entropy ever
If you don't know the initial conditions exactly, there's a little blur, that accumulates over time
Landauer says that to change a bit, you need an irreversible process. The only way you can have an irreversible bit change is if you lose energy to dissipation (heat)
PROVE he says the amount of heat loss is 1/2kT

Landauer's argument

Consider a statistica elensemble of bits in thermal equilibrium. If these are all reset to ONE, the number of states covered in the ensemble has been cut in half. The entropy therefore hasbeen reduced by k ln(2) =0.693 1 k per bit. The entropy of a closed system, e.g., a computer with its own batteries, cannot decrease; hence thisentropy must appear elsewhereasa heating effect, supplying 0.6931 kT per restored bit to the surroundings. This is, of course, a minimum heating effect, and our method of reasoning gives no guarantee thatthis minimum is in fact achievable.

The "set to one" operation clearly deletes information. If you take a bit that can be in either zero or one (thermally) and set it to one, you lower the entropy of the computer by a half a bit. Since a computer cannot lower its entropy, there must be an increase in entropy elsewhere, in the form of heat, equal to half a bit times T.

Question

What about if you KNOW a bit is in the one position, but you want to erase it by putting it in either the one or the zero position? Isn't this also information erasure? How would the argument work here?

In other words, you want to erase good information by setting it equal to random (unknown) information. Blur everything.

The initial entropy is zero (we know the state of the bit.)

The final entropy is equal to NkTln(2).

So the entropy change per bit is kln(2). It takes energy to jiggle the bits to incrase their entropy.

dU = TdS = kTln(2)

Question

What about the Clausius definition of entropy applied to a single computer bit?

dU = PdV - TdS = 0

The entropy change of a closed system is zero.

dS >= dQ/T => TdS >= dQ

for an irreversible process

Erasing a bit of information adds k ln(2) bits of entropy. dS = ln(2)

kTln(2) >= dQ

dQ <= kTln(2)

This is for an additional dQ added to the system from outside.