Skip to content

Commit 4eff252

Browse files
authored
Update 287-find-the-duplicate-number.js
1 parent 86a93f8 commit 4eff252

File tree

1 file changed

+192
-0
lines changed

1 file changed

+192
-0
lines changed

287-find-the-duplicate-number.js

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,195 @@ const findDuplicate = function(nums) {
5959
}
6060
}
6161
};
62+
63+
64+
/**
65+
66+
# File: FindDuplicate.py
67+
# Author: Keith Schwarz ([email protected])
68+
#
69+
# An algorithm for solving the following (classic) hard interview problem:
70+
#
71+
# "You are given an array of integers of length n, where each element ranges
72+
# from 0 to n - 2, inclusive. Prove that at least one duplicate element must
73+
# exist, and give an O(n)-time, O(1)-space algorithm for finding some
74+
# duplicated element. You must not modify the array elements during this
75+
# process."
76+
#
77+
# This problem (reportedly) took CS legend Don Knuth twenty-four hours to solve
78+
# and I have only met one person (Keith Amling) who could solve it in less time
79+
# than this.
80+
#
81+
# The first part of this problem - proving that at least one duplicate element
82+
# must exist - is a straightforward application of the pigeonhole principle.
83+
# If the values range from 0 to n - 2, inclusive, then there are only n - 1
84+
# different values. If we have an array of n elements, one must necessarily be
85+
# duplicated.
86+
#
87+
# The second part of this problem - finding the duplicated element subject to
88+
# the given constraints - is much harder. To solve this, we're going to need a
89+
# series of nonobvious insights that transform the problem into an instance of
90+
# something entirely different.
91+
#
92+
# The main trick we need to use to solve this problem is to notice that because
93+
# we have an array of n elements ranging from 0 to n - 2, we can think of the
94+
# array as defining a function f from the set {0, 1, ..., n - 1} onto itself.
95+
# This function is defined by f(i) = A[i]. Given this setup, a duplicated
96+
# value corresponds to a pair of indices i != j such that f(i) = f(j). Our
97+
# challenge, therefore, is to find this pair (i, j). Once we have it, we can
98+
# easily find the duplicated value by just picking f(i) = A[i].
99+
#
100+
# But how are we to find this repeated value? It turns out that this is a
101+
# well-studied problem in computer science called cycle detection. The general
102+
# form of the problem is as follows. We are given a function f. Define the
103+
# sequence x_i as
104+
#
105+
# x_0 = k (for some k)
106+
# x_1 = f(x_0)
107+
# x_2 = f(f(x_0))
108+
# ...
109+
# x_{n+1} = f(x_n)
110+
#
111+
# Assuming that f maps from a domain into itself, this function will have one
112+
# of three forms. First, if the domain is infinite, then the sequence could be
113+
# infinitely long and nonrepeating. For example, the function f(n) = n + 1 on
114+
# the integers has this property - no number is ever duplicated. Second, the
115+
# sequence could be a closed loop, which means that there is some i so that
116+
# x_0 = x_i. In this case, the sequence cycles through some fixed set of
117+
# values indefinitely. Finally, the sequence could be "rho-shaped." In this
118+
# case, the sequence looks something like this:
119+
#
120+
# x_0 -> x_1 -> ... x_k -> x_{k+1} ... -> x_{k+j}
121+
# ^ |
122+
# | |
123+
# +-----------------------+
124+
#
125+
# That is, the sequence begins with a chain of elements that enters a cycle,
126+
# then cycles around indefinitely. We'll denote the first element of the cycle
127+
# that is reached in the sequence the "entry" of the cycle.
128+
#
129+
# For our particular problem of finding a duplicated element in the array,
130+
# consider the sequence formed by starting at position n - 1 and then
131+
# repeatedly applying f. That is, we start at the last position in the array,
132+
# then go to the indicated index, repeating this process. My claim is that
133+
# this sequence is rho-shaped. To see this, note that it must contains a cycle
134+
# because the array is finite and after visiting n elements, we necessarily
135+
# must visit some element twice. This is true no matter where we start off in
136+
# the array. Moreover, note that since the array elements range from 0 to
137+
# n - 2 inclusive, there is no array index that contains n - 1 as a value.
138+
# Consequently, when we leave index n - 1 after applying the function f one
139+
# time, we can never get back there. This means that n - 1 can't be part of a
140+
# cycle, but if we follow indices starting there we must eventually hit some
141+
# other node twice. The concatenation of the chain starting at n - 1 with the
142+
# cycle it hits must be rho-shaped.
143+
#
144+
# Moreover, think about the node we encounter that starts at the entry of the
145+
# cycle. Since this node is at the entry of the cycle, there must be two
146+
# inputs to the function f that both result in that index being generated. For
147+
# this to be possible, it must be that there are indices i != j with
148+
# f(i) = f(j), meaning that A[i] = A[j]. Thus the index of the entry of the
149+
# cycle must be one of the values that is duplicated in the array.
150+
#
151+
# There is a famous algorithm due to Robert Floyd that, given a rho-shaped
152+
# sequence, finds the entry point of the cycle in linear time and using only
153+
# constant space. This algorithm is often referred to as the "tortoise and
154+
# hare" algorithm, for reasons that will become clearer shortly.
155+
#
156+
# The idea behind the algorithm is to define two quantities. First, let c be
157+
# the length of the chain that enters the cycle, and let l be the length of the
158+
# cycle. Next, let l' be the smallest multiple of l that's larger than c.
159+
# I claim that for any rho-shaped sequence l' defined above, that
160+
#
161+
# x_{l'} = x_{2l'}
162+
#
163+
# The proof is actually straightforward and very illustrative - it's one of my
164+
# favorite proofs in computer science. The idea is that since l' is at least
165+
# c, it must be contained in the cycle. Moreover, since l' is a multiple of
166+
# the length of the loop, we can write it as ml for some constant m. If we
167+
# start at position x_{l'}, which is inside the loop, then take l' more steps
168+
# forward to get to x_{2l'}, then we will just walk around the loop m times,
169+
# ending up right back where we started.
170+
#
171+
# One key trick of Floyd's algorithm is that even if we don't explicitly know l
172+
# or c, we can still find the value l' in O(l') time. The idea is as follows.
173+
# We begin by keeping track of two values "slow" and "fast," both starting at
174+
# x_0. We then iteratively compute
175+
#
176+
# slow = f(slow)
177+
# fast = f(f(fast))
178+
#
179+
# We repeat this process until we find that slow and fast are equal to one
180+
# another. When this happens, we know that slow = x_j for some j, and
181+
# fast = x_{2j} for that same j. Since x_j = x_{2j}, we know that j must be at
182+
# least c, since it has to be contained in the cycle. Moreover, we know that j
183+
# must be a multiple of l, since the fact that x_j = x_{2j} means that taking j
184+
# steps while in the cycle ends up producing the same result. Finally, j must
185+
# be the smallest multiple of l greater than c, since if there were a smaller
186+
# multiple of l greater than c then we would have reached that multiple before
187+
# we reached j. Consequently, we must have that j = l', meaning that we can
188+
# find l' without knowing anything about the length or shape of the cycle!
189+
#
190+
# To complete the construction, we need to show how to use our information
191+
# about l' to find the entry to the cycle (which is at position x_c). To do
192+
# this, we start off one final variable, which we call "finder," at x_0. We
193+
# then iteratively repeat the following:
194+
#
195+
# finder = f(finder)
196+
# slow = f(slow)
197+
#
198+
# until finder = slow. We claim that (1) the two will eventually hit each
199+
# other, and (2) they will hit each other at the entry to the cycle. To see
200+
# this, we remark that since slow is at position x_{l'}, if we take c steps
201+
# forward, then we have that slow will be at position x_{l' + c}. Since l' is
202+
# a multiple of the loop length, this is equivalent to taking c steps forward,
203+
# then walking around the loop some number of times back to where you started.
204+
# In other words, x_{l' + c} = x_c. Moreover, consider the position of the
205+
# finder variable after c steps. It starts at x_0, so after c steps it will be
206+
# at position x_c. This proves both (1) and (2), since we've shown that the
207+
# two must eventually hit each other, and when they do they hit at position x_c
208+
# at the entry to the cycle.
209+
#
210+
# The beauty of this algorithm is that it uses only O(1) external memory to
211+
# keep track of two different pointers - the slow pointer, and then the fast
212+
# pointer (for the first half) and the finder pointer (for the second half).
213+
# But on top of that, it runs in O(n) time. To see this, note that the time
214+
# required for the slow pointer to hit the fast pointer is O(l'). Since l' is
215+
# the smallest multiple of l greater than c, we have two cases to consider.
216+
# First, if l > c, then this is l. Otherwise, if l < c, then we have that
217+
# there must be some multiple of l between c and 2c. To see this, note that
218+
# in the range c and 2c there are c different values, and since l < c at least
219+
# one of them must be equal to 0 mod l. Finally, the time required to find the
220+
# start of the cycle from this point is O(c). This gives a total runtime of at
221+
# most O(c + max{l, 2c}). All of these values are at most n, so this algorithm
222+
# runs in time O(n).
223+
224+
def findArrayDuplicate(array):
225+
assert len(array) > 0
226+
227+
# The "tortoise and hare" step. We start at the end of the array and try
228+
# to find an intersection point in the cycle.
229+
slow = len(array) - 1
230+
fast = len(array) - 1
231+
232+
# Keep advancing 'slow' by one step and 'fast' by two steps until they
233+
# meet inside the loop.
234+
while True:
235+
slow = array[slow]
236+
fast = array[array[fast]]
237+
238+
if slow == fast:
239+
break
240+
241+
# Start up another pointer from the end of the array and march it forward
242+
# until it hits the pointer inside the array.
243+
finder = len(array) - 1
244+
while True:
245+
slow = array[slow]
246+
finder = array[finder]
247+
248+
# If the two hit, the intersection index is the duplicate element.
249+
if slow == finder:
250+
return slow
251+
252+
253+
*/

0 commit comments

Comments
 (0)