题目
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"Output: ["AAAAACCCCC", "CCCCCAAAAA"]
题意
给定一个字符串s,找出其中所有出现至少2次的长度为10的子串。
思路
比较直接的方法是使用两个HashSet去处理,一个保存已经遍历过的子串,另一个保存答案子串。
在此基础上可以使用位运算进行优化。分别用二进制的00、01、10、11来表示’A’、’C’、’G’、’T’,则一个长度为10的字符串就可以用一个长度为20的二进制数字来表示,每一次获取新的子串只需要将原来的二进制串左移2位,并将最低的两位换成新加入的字符,类似于滑动窗口的操作。其他步骤与HashSet方法相同。
代码实现
Java
HashSet
class Solution {public List<String> findRepeatedDnaSequences(String s) {Set<String> one = new HashSet<>();Set<String> two = new HashSet<>();for (int i = 0; i < s.length() - 9; i++) {String t = s.substring(i, i + 10);if (two.contains(t)) {continue;} else if (one.contains(t)) {two.add(t);} else {one.add(t);}}return new ArrayList<>(two);}}
位运算优化
class Solution {public List<String> findRepeatedDnaSequences(String s) {if (s.length() < 10) {return new ArrayList<>();}Set<String> two = new HashSet<>();Set<Integer> one = new HashSet<>(); // key类型换成整数int[] hash = new int[26];hash['A' - 'A'] = 0;hash['C' - 'A'] = 1;hash['G' - 'A'] = 2;hash['T' - 'A'] = 3;int cur = 0;// 创建初始的长度为9的子串for (int i = 0; i < 9; i++) {cur = cur << 2 | hash[s.charAt(i) - 'A'];}for (int i = 9; i < s.length(); i++) {// 每次只需要保留低20位cur = cur << 2 & 0xfffff | hash[s.charAt(i) - 'A'];if (one.contains(cur)) {two.add(s.substring(i - 9, i + 1));} else {one.add(cur);}}return new ArrayList<>(two);}}
JavaScript
/*** @param {string} s* @return {string[]}*/var findRepeatedDnaSequences = function (s) {let set1 = new Set()let set2 = new Set()for (let i = 0; i + 10 <= s.length; i++) {let t = s.slice(i, i + 10)if (set1.has(t)) {continue} else if (set2.has(t)) {set1.add(t)} else {set2.add(t)}}return Array.from(set1)}
