- Blogging is dead, but have we fixed anything?
- More quick links
- Quick links
- Kids, programming, and doing more
- Google and the right database for the job
- Quick links
- Code Maven and programming for teens
- Quick links
- Code Monster and teaching programming to kids
- Will tablets replace PCs?
- Quick links
- The computer scientist CEO
- Puzzling outcomes in A/B testing
- Quick links
- The game Stick Portal
- More quick links
- Quick links
- Ad targeting at Yahoo
- More quick links
- Quick links
- Browsing behavior for web crawling
- What mobile location data looks like to Google
- Even more quick links
- More quick links
- Quick links
Google Reader is shutting down, but most people moved on long ago. Blogging is dead. To the extent that it lives, it is dominated by professional journalists, writers backed by major organizations, or has transformed into microblogging. The original objective of an amateur form of journalism -- long articles written and published without an organization or editor -- has become archaic. I have been writing on this blog since 2004. At its peak, this blog had about 10k regular readers. Over a decade, I have watched blogging rise and fall. Nowadays, my posts here on this blog often get less attention that my tweets on Twitter. 140 characters that take two minutes to spew out sometimes get more attention than an article that takes four hours of thoughtful analysis, careful reading, and tight writing. There is nothing wrong with people moving on. Professional journalists now use blogs to air early research or analysis that will later make it into a full print article. Companies use blogs to announce changes or new features. Many use microblogging as a useful means of quick communication. That is good. But there was something charming about so many people trying to be amateur journalists. Journalistic writing is a skill; it emphasizes clear, tight, concise writing. That so many were attempting it and practicing it had a lot of value, both in the the skills bloggers gained and sometimes candid and insightful articles produced. I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication. We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal. Attention should flow to relevant and useful writing. I should see writings that are personally relevant and useful to me. When a friend does something I want to know about, when a colleague reads an article I should read too, when a company announces a useful change to a product I use, when a well-written article important for my work is published from a reputable source, when a major event occurs in the world, those should be brought to my attention. Blogging wasnt that, but neither is microblogging. We need to build something that focuses our attention, improves our communication, and finally solves the problems blogging and microblogging failed to solve.
Again, it has been too long, but here you go, what has caught my attention lately: * "Employees who ate at cafeteria tables designed for 12 were more productive than those at tables for four, thanks to more chance conversations and larger social networks. That, along with things like companywide lunch hours and the cafes Google is so fond of, can boost individual productivity by as much as 25 percent." () * "Managers avoid dealing with low performers (because they believe the conversation will be difficult), and instead assign work to the employees they enjoy — i.e. high performers ... They end up burning out those same high performers." () * "Is it really true that using someone elses invention is the actually the same thing as stealing their sheep? If I steal your sheep, you dont have them any more. If I use your idea, you still have the idea, but are less able to profit from using it. The two concepts may be cousins, but they not identical." () * Clever and simple idea: Attach a little flash memory and a small battery to memory chips ( ) * Another clever and simple idea: On touchscreens (like your phone), make a knuckle or nail tap like a right mouse click so it does something different (  ) * Most data visualizations would be more clear done as a simple bar chart () * When someone comes back to a search result page after hitting the back button, you should add more search results to the bottom of the page () * For the first time, more smartphone ship than dumbphones, which has big implications, especially for the developing world ( ) * You can identify people based on just four locations sampled from a mobility trace (cell towers and Wifi nearby) from their cell phone () * "The problem is that Apple has not been able to sustain its high margin levels" ( ) * Humor (from The Onion): Weeping Tim Cook spotted screaming for help at Steve Jobs tombstone () * Amazingly arrogant executive hired from Apple didnt understand customer base or think he had to, destroyed a major retailer ( ) * Amazon moves against Google ( ) and Google moves against Amazon (   ) * Very soon, only big players -- like Amazon, Facebook, and Google -- will be able to do personalized advertising. A change to third-party cookies will kill off all startups working on personalized advertising, but major websites get an exemption. ( ) * A new compression library from Google designed for web content, can be decompressed by existing software so no changes required on the client side to use it, just need to recompress the static content on the server to save about 5% in bandwidth () * eBay successfully moves away from auctions. "Auctions ... are less than 10% of what we do." () * "At this point, unfortunately, it seems clear that the Windows 8 launch not only failed to provide a positive boost to the PC market, but appears to have slowed the market ... Radical changes to elements like the user interface and higher costs had made PCs less attractive compared with tablets and other devices." ( ) * A MacBook Pro runs Windows faster than any PC laptop (but only because PCs have so much crapware installed) ( ) * "Aereos founders realized that [a court] ruling offered a blueprint for building [an IPTV] service that wouldnt require the permission of broadcasters. In Aereos server rooms are row after row of tiny antennas mounted on circuit boards. When a user wants to view or record a television program, Aereo assigns him an antenna exclusively for his own use." () * The vast majority of people have simple taxes, so simple that the IRS could just mail you a tax return, youd look it over to make sure everything is correct and sign it, and youd be done. Why dont we have that? Apparently, "its been opposed for years by the company behind the most popular consumer tax software—Intuit, maker of TurboTax." () * Why Redfin has been unable to undermine the absurdly high 6% commission when you sell your home ( ) * "Personal finance courses ... have no effect on financial outcomes ... [but] additional training in mathematics [does]" () * "Graduate school in the humanities: Just dont go" ( ) * At least so far, MOOCs (like Coursera and Udacity) seem to only work for people who are already highly motivated, which isnt the group in the most need () * Seems to be increasing evidence that some autoimmune diseases (including allergies) are rooted in a bored immune system incorrectly prioritizing threats. Almost a parallel with anxiety disorders, your immune system is seeing threats where none exist, incorrectly prioritizing dangers. ( ) * "Deep waters have absorbed a surprising amount of heat -- and they are doing so at an increasing rate over the last decade" () * "Resilience -- building systems able to survive unexpected and devastating attacks -- is the best answer we have right now." () * The web-based version of blackmailing people who have done something embarrassing () * Little known fact, the second most used web server is something called Allegro RomPager ( ) * For most people in the US, the vast majority of entertainment time is still spent watching normal, live TV () * Odd similarities between distributed denial of service attacks and pollution. As Ed Felten writes, misconfigured DNS servers allow massive DDoS attacks, but its hard to get people to fix it, because "the resulting harm falls mostly on people outside the organization." ( )
Its been a while since I did a Quick Links post, so theres a lot to cover. Heres the latest of what has caught my attention: * First Netflix wanted to be Blockbuster (DVDs), then a replacement for cable (streaming video), now they want to be HBO (making content). (  ) * "For raw bandwidth, the internet will probably never beat SneakerNet" () * Data caps are "a strategy for ISPs to increase their revenue per user ... The trend is driven in large part by a woefully uncompetitive market that allows the nations largest providers to generate enormous profits" () * "Maybe it will eventually dawn on [ISPs] that the only way to fight the scourge of cheap, fast broadband is to provide it themselves" ( ) * "Too many companies think of their call centers as a cost to minimize ... its a huge untapped opportunity ... [for] word-of-mouth marketing ... [and] to increase the lifetime value of the customer" () * Mary Jo Foley says, "I keep scratching my head over who Microsoft expects to buy the Surface Pro" () * "Taking the bitter pill would mean backing off the Surface idea while smoothing over the worst parts of Windows 8. Admit that being different just for the sake of being different is a losing strategy. Go back to software engineering 101. But I dont see Ballmer making that tough decision. Its just not how he rolls. Then itll be up to the board of directors to hold him responsible when this dogmatic strategy fails." () * "Dell outsourced the management of its supply chain, and then the design of its computers themselves. Dell essentially outsourced everything inside its personal-computer business—everything except its brand— to Asus ... Then, in 2005, Asus announced the creation of its own brand of computers. In this Greek-tragedy tale, Asus had taken everything it had learned from Dell and applied it for itself." () * "The Dreamliner was supposed to become famous for its revolutionary design. Instead, it’s become an object lesson in how not to build an airplane" ( ) * A deal protects Apple, Google, and a few others from being sued by Kodaks patents, but no one else. "Kodak patents may well be popping up in future patent troll suits in the future." () * Mark Cuban says, "Dumbass patents are crushing small businesses" () * Detailed technical discussion of the Superbowl power outage and what could have been done to prevent it () * The book "Thinking Fast and Slow" and implications for artificial intelligence () * "We understand the meaning of an object in terms of the meanings of other objects – other chunks of reality to which our brains have assigned certain characteristics. In the brain’s taxonomy, there are no discrete entries or files – just associations that are more strongly or more weakly correlated with other associations ... Might meaning itself simply be another word for association?" () * On global warming: "There is only one thing we can do: develop renewable technologies that are substantially cheaper than coal, and give these technologies to the developing countries." () * Good summary of a Davos panel on education ( ) * Funding at Garfield High School in Seattle is just $5,600/year/student ( ) * Fascinating example of novel work in a field (in this case, literature) by blending it with computer science. (  ) * Companies should stop talking about "mobile", start splitting out tablets and smartphones separately. () * People talk about tablets killing the PC, should be talking about tablets killing the e-reader () * Clever optimization idea from Google: "sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests." ( ) * "Any time you access Google, you probably are in a dozen or more experiments" () * What could we do in a distributed database if we could rely on all servers having exactly the right time? ( ) * Spotify rediscovers what others found a decade ago, social recommendations dont work, that "no matter who you are, someone you dont know has found the coolest stuff." (  ) * "Amazon sells things to people at prices that seem impossible because it actually is impossible to make money that way .... Competition is always scary, but competition against a juggernaut that seems to have permission from its shareholders to not turn any profits is really frightening." () * Amazon goes after personalized ads: "This platform lets the company retarget its users across the Web based on their browsing and purchase habits on Amazon’s owned-and-operated properties. That could be a game changer ... given Amazons recommendation engine" () * "Consumers want more targeted and humorous ads ... 67 percent of respondents would be willing to be answer a question to make their ads more personalized and enjoyable ... Consumers understand the exchange of free content for advertising, but they want to make sure their time tradeoff of watching ads also benefits them. They found coupons, contests and links as the most positive forms of engagement." () * "Advertisements are 182 times as likely to deliver malicious content than pornography" () * Dilbert on effective mobile advertising () * The future of maps on smartphones: "Itll be like youre a local everywhere you go. Youll know your way through the back alleys and hutongs of Beijing, youll know your way all around Paris even if youve never been before. Signs will seem to translate themselves for you. This kind of extra-smartness is coming to people." () * Shocking to see Acer bragging about Google Chromebook sales while lambasting slow Windows 8 sales () * Chromebook is the #1 selling laptop on Amazon.com right now, not Apple, not Microsofts Windows 8. () * Marissa Mayer says, "In the future, youll be the query" (  ) * Recommendation algorithms work by finding things other people loved that you havent found yet and bringing them to your attention. Its computers helping humans help humans. () * A good UX can make people very forgiving of high error rates ( ) * Stephen Wolfram says, "If heuristics are done well, with serious computation and knowledge behind them, they actually do work, and people like them very much ... So long as everything just works, people never think about the heuristics, never try to deconstruct them, and never notice or get confused by the lack of ultimate consistency." () * Google discovered the optimal length of an interview loop is 4 interviews. Any more hits diminishing returns. ) * "Granting mothers five months of leave doesnt cost Google any more money." () * "Software development at Google is big and fast. The code base receives 20+ code changes per minute and 50% of the files change every month" () * Worth knowing and understanding: Android has 42% market share of computing devices, but only generates 5% of Wikipedias traffic () * "Why the Google+ long game is brilliant" () * Snarky: "The real sign of Google Apps making a big dent in the business world will be when its own hiring managers are able to stop treating Microsoft Office as the de facto standard." () * "When everything is in flux, predicting what will be hot a year from now -- skating to where the puck is going to be, to quote Steve Jobs quoting Wayne Gretzky -- becomes all but impossible. Samsungs strategy is to put a man at every spot on the ice. Be in enough places and youre bound to catch something no one was predicting -- like, for instance, the world’s bizarre love affair with phablets." () * Much lower power consumption on GPS trails on smartphones by offloading processing to the cloud () * Clever combination of GPS trails and a game: "The idea of cyclists recording ride data is nothing new ... What Strava did was turn ... [that] into a rigorously measured, database-matched, global community with the sudden ability to turn the most banal ride into a race ... Get that satisfaction without turning up at the starting line, in the rain, on a Saturday morning at 6 a.m." () * Interesting theory: "I had a small epiphany. The cyclists were hated because they are [viewed as] cheats. They are getting away with something that car drivers cannot." () * I love this idea of a bicycle frame completely covered in reflective paint () * Out of control: "The American Civil Liberties Union filed a Freedom of Information Act request with the FBI seeking details of its surveillance policy -- who it spies upon, and how, and under what circumstances. The FBI sent back two 50+ page memos in reply, each of them totally blacked out except for some information on the title page" ( ) * On hedge funds: "The S&P 500 has now outperformed its hedge-fund rival for ten straight years, with the exception of 2008 when both fell sharply. A simple-minded investment portfolio—60% of it in shares and the rest in sovereign bonds—has delivered returns of more than 90% over the past decade, compared with a meagre 17% after fees for hedge funds (see chart). As a group, the supposed sorcerers of the financial world have returned less than inflation. Gallingly, the profits passed on to their investors are almost certainly lower than the fees creamed off by the managers themselves." ) * Appears both Vikings and Polynesians reached the Americas around 1000, well before Christopher Columbus ( ) * The weight of glaciers during ice ages might cause an increase in volcanic eruptions () * Moderate amounts of play of first person shooters (and similar action games) improve vision, attention, and spatial skills () * Randall Munroe (author of xkcd): "Ive never seen the Icarus story as a lesson about the limitations of humans. I see it as a lesson about the limitations of wax as an adhesive." () * An art project with a visible pile of pennies and a crank, that "allows anyone to work for minimum wage for as long as they like." Absolutely brilliant. ()
I built Code Monster and Code Maven to get more kids interested in programming. Why is programming important? Computers are a powerful tool. They let you do things that would be hard or impossible without them. Trying to find a name that might be misspelled in a million names would take weeks to do by hand, but takes mere moments with a computer program. Computers can run calculations and transformations of data in seconds that would be impossible to do yourself in any amount of time. People can only keep about seven things in their mind at once; computers excel at looking at millions of pieces of data and discovering correlations in them. Being able to fully use a computer requires programming. If you can program, you can do things others cant. You can do things faster, you can do things that otherwise would be impossible. You are more powerful. Looking two decades out, when my kids are grown and well into their careers, I expect people who can fully use computers will have a major force multiplier. A blend of computer science and another field -- medicine, microbiology, genetics, economics, astronomy, journalism, business, almost anything -- will enable you to do things others in that field cant. Already you can see this. Breakthroughs in genetics came from a collaboration between computer science and geneticists working to create new algorithms for massive scale approximate string matching. During the 2012 elections, Nate Silver redefined what it meant to be a journalist (and attracted huge amounts of traffic) by combining computing and large amounts of polling data in a new way. Astronomy is becoming a field of big data, computers analyzing huge amounts of data from a worldwide network of telescopes, pulling out promising patterns, then having humans look over the candidates to find new discoveries. Robotic probes and the massive data streams they produce are not only taking over space exploration, but also making inroads on sea exploration, marine biology, and climatology as well. Already, if you can program, you can do things others cannot, find things others cannot. Over the coming years, the collaboration between computers and machine is only going to grow. Computers will do what they are good at, large scale data processing, computation, and analysis. Humans will do what they are good at, finding patterns, intuiting promising paths forward despite noise and missing data, and collaborative problem solving. Those who can fully use computers, and especially those who can program computers, will be more productive. Computers are a powerful tool for those who can wield it. Sadly, many kids today think of programming as hard. As not fun. As not for them. The problem is particularly acute for girls, leading to the awful fact that only 14% of the computer science degrees in the US are awarded to women. So many kids not getting a chance to get excited about programming is not just unfortunate, its deeply harmful, for their future and for ours. Code Monster and Code Maven from Crunchzilla are designed to make programming easy. Make it fun. Make programming for everyone. In the couple months since launch, they have been used in schools and been getting rave reviews from both girls and boys. One girl "got totally into it" and "when she came up for air", she asked her parents, "Are there jobs you can get working with computers?" And a teacher who used this in a school told me, "A couple 6th grade girls who were not interested in programmers tore through Code Monster then started on Code Academy. It was unexpected and cool!" If you get a chance to try your children on Code Monster or Code Maven, or you use either in a school, please let me know what you think.
I finally got a chance to read "Processing a Trillion Cells per Mouse Click", a paper out of Google presented at the recent VLDB 2012 conference. It describes the rather cool PowerDrill column-oriented database at Google that is optimized for speed, x10-100 times faster than other column-oriented databases, and several orders of magnitude faster than MapReduce/Hadoop. But, of course, there are tradeoffs to get those speed gains, and the tradeoff PowerDrill makes is that it keeps a lot in memory, so it can only contain a fraction of the data of the other systems. What is so interesting about this, and what other companies need to learn from this, is the way Google builds so many databases to analyze its massive log data. The goal is to let people find stuff in the logs as fast as possible. That means you need many tools, the right tool for the job. Hadoop and similar systems allow you to scan massive amounts of log data but, cmon, all of us know that the vast majority of Hadoop jobs ignore almost all of the data. Every one of these jobs starts by selecting out a couple of the columns, the same columns almost everyone else wants, and dropping everything else. Fire up your job, waste hours of time waiting for almost all the data from a full table scan to be thrown out, and finally you get the result. Dremel and other column-oriented databases help a lot with this. If almost all log processing jobs only want a couple columns, a column-oriented database is designed to pull out just a few columns quickly, and its going to be a lot faster. PowerDrill goes a step further. If almost all log processing jobs only want the most recent logs and only a few of the columns, just create a database with only the most recent logs and a few of the columns. Add in a lot of carefully designed compression, sharding across a medium-sized cluster, and the ability to skip over much of the data when it isnt needed (instead of doing full table scans all the time), and you got yourself the ability to answer most questions people ask of the logs in seconds, not hours. And thats the point. Build a system that can answer 90% of the questions people ask of the logs in seconds. Build another than can answer 90% of the remaining, harder questions people ask of the logs in minutes. Then have a system that primarily archives all the logs, but also can answer, given enough time and power, much more complicated questions people very rarely ask. Those Google guys have many databases for asking questions of their logs. Maybe you should too. Some excerpts from the PowerDrill paper:
The column-store developed as part of PowerDrill is tailored to support a few selected datasets and tuned for speed ... Our column-store relies on having as much data in memory as possible ... PowerDrill can run interactive single queries over more rows than Dremel, however the total amount of data it can serve is much smaller. Consider a typical use case such as triggering 20 SQL queries with a single click in the UI. In our production system on average these queries process 782 billion cells in 30-40 seconds (under 2 seconds per query) .... Each month it is used by more than 800 users sending out about 4 million SQL queries ... scanning [the equivalent of] 525 trillion cells .... One of our top users ... [in] 6 hours ... [executed about] 12 thousand queries .... Our production system is running on well over 1000 machines, the distributed servers altogether using over 4T of main memory. [PowerDrill] pushes the "interactivity limit" out significantly ... The majority of queries are fairly discriminative, similar, and uniform ... The store has only a few but often explored tables (as opposed to many tables that are not used very often) ... [For many common queries] our techniques push the limit of interactivity out by one or two orders of magnitude.
More of what caught my attention recently: * Android now has 72.4% of the mobile market, up from 52.5% a year ago () * Googles new Nexus 4 smartphone is in high demand and for good reason: "The idea that a Nexus quad-core smartphone is hitting the market ... [at] $300 is simply stunning. Even more so is that its available without any contract or carrier locks, which means you can use it virtually anywhere in the world. .. The price of freedom has never been more reasonable." (  ) * Google and Amazon aim to destroy Apples high margin business model, selling hardware at cost and making money off content instead () * "Amazon is a black hole threatening to devour corporate America" () * "The ground is shifting beneath ... tech titans because of a major force: the rise of mobile devices" () * Mobile/tablets are being used for about 16% of online sales, but sales from referrals out of Twitter and Facebook are near 0% () * Google expects that 50% of traffic to Google.com will come from mobile in 2013. I wonder what that implies for Google, since it almost certainly does not mean 50% of revenue comes from mobile in 2013. () * Googles latest Chromebook laptop and Nexus 7 tablet are both in high demand, and Google is "massively ramping production". Meanwhile, Microsoft is cutting production of its Surface hybrid tablet because of low demand. (  ) * Tablets mostly are used in the evening and for games and entertainment (  ) * Surprising data (at least to me) on browser market share, I thought IE was falling rapidly, but no. Data says IE is steady, Chrome growth is stalled, and Firefox is no longer falling, actually climbing slightly. () * "Giving users the choice to view (or not view) may actually increase this advertising effectiveness" () * Experimental data is poised to kill off a big chunk of the last three decades of work in theoretical physics () * Good overview of current state of autonomous flying robots. Lots of breakthroughs recently. () * "Its actually more natural for humans to think logarithmically than linearly" () * If you dont need the actual location right away, its three orders of magnitude cheaper (in energy use) to collect raw GPS data and process it later (in the cloud) than it is to process it immediately on the mobile device () * Startups would love to get their hands on Google Fiber (especially the upload speeds) but cant. Cities should be thinking about encouraging Google Fiber (or similar) as a way to encourage startups. () * Key question is: "Do patents, in fact, provide a net incentive for innovation in the software industry?" () * Crazy data about the incredibly low cost of renting botnets, paying for someone to take out websites with DDoS attacks, sending spam, and buying various types of trojans () * "We cant be afraid to let them actually take charge and ship" () * "Only a handful of startups that are big successes. What happens along the way that causes such failure? Its like theres a tunnel full of monsters that kill them along the way. Im going to tell you what these monsters are so you know to avoid them." () * "By far the most common mistake startups make is to solve problems no one has" () * Dilbert summarizes the advice from most business books () * "People with lots of authority tend to behave like neurological patients with a damaged orbito-frontal lobe, a brain area thats crucial for empathy and decision-making" () * "Studies of the human brain demonstrate that .... some people seem to think about their future selves in the same way that they think about complete strangers" () * On why PC sales are flat: "Norvigs Law: Any technology that surpasses 50% penetration will never double again (in any number of months)." () * "To the surprise of pundits, numbers continue to be best system for determining which of two things is larger" ()
I recently launched Code Maven from Crunchzilla. It helps teens learn a little about what they can do if they learn more about programming. try Code Maven, tell your friends about it, and, if you have suggestions or feedback, please e-mail me at email@example.com Code Maven builds on Code Monster. Code Monster is for kids ages 9-14 (but many even younger have enjoyed it too, especially with a little help). Code Maven is for teens ages 13-18 (and curious adults too, especially adults who have never programmed before). Because Code Maven is built for older kids, it assumes a longer attention span, and so is a bit harder, has more explanation, and has some additional fun projects. Pick which one you like based on the age of your kids and your interest. You can try them both at crunchzilla.com!
What caught my attention recently: * The latest Google and Microsoft earnings show damage from a tech disruption, a shift to mobile that is impacting both badly but for different reasons. Google needs to crack mobile ads. Microsoft needs to get share in mobile computing. (   ) * Now "there are almost as many mobile phone subscriptions in the world as people" (  ) * Google is getting aggressive, releasing a $99 tablet and a $250 laptop (   ) * Amazon prices their tablet at cost ( ) * And decent tablets in the $50 range are already widely available in China () * But Microsoft prices its new tablet above the cost of an iPad. ( ) * Meg "Whitman liberally mixed metaphors to describe her awakening to just how screwed HP was" () * "Prepare for Windows 8 induced user rage" () * "The argument that C.E.O.s will leave if they arent compensated well, perhaps even lavishly, is bogus" () * "FTC puts a bounty on the heads of robo-telemarketers" () * On Amazon EC2, testing performance of the instances and rejecting ones with weak performance can make a huge difference ( ) * Good article in The Atlantic about the considerable lengths Google is willing to go to increase the quality of Google Maps () * "We read Apples secret Genius Training Manual from cover to cover. Its a penetrating look inside Apple: psychological mastery, banned words, roleplaying—youve never seen anything like it." () * Big data is "a process that uses data to refine our thinking. But it doesnt work without some thinking first." () * Surprisingly detailed talks on Netflix and LinkedIns recommender systems (  ) * Great talk on A/B testing, especially how to do A/B testing at large companies ( ) * Amazing speaker list at a workshop on big data for personalized education, slides from many of the talks are available ( ) * Sometimes research just confirms what we already know (or should know), in this case, that simpler websites with familiar themes in the design do better ( ) * "Savvy Internet users know that all the great stuff they get from the Internet us for free -- the searches, the social networks, the games, even the news -- isnt really free. Its an exchange, where companies are able to take user data, sell it to advertisers, and make money." ( ) * In the US, "80% of teens ... have a game console" ( ) * Meanwhile, in Estonia, "a new education program that will have 100 percent of publicly educated students learning to write code" () * Xkcd on dinosaurs () * Good TED talk on publication bias, which is caused by not publishing negative results () * Got willpower depletion? One study claims, if you believe willpower depletion exists, it does, otherwise it doesnt. () * "Ever heard of the marshmallow test? The outcome may have more to do with conditioning from a childs environment" () * "Is playtesting essential to making a good game? Yes ... [But] playtesting is like an engraved invitation that reads: You are cordially invited to tell me why I suck. Bring a friend - Refreshments served. The whole point of playtesting is to make clear to you that some of the decisions you made ... are completely wrong." ()
I recently launched Code Monster from Crunchzilla. It helps parents teach a little programming to their kids. Please try Code Monster. Its free and its fun. If you have kids (especially ages 9-14) , please have them try it. If you know people who have kids (or adults who are young at heart and might want to dabble in programming), please tell them about it (and share on Facebook, Google+, and Twitter too). Id love to get the word out about it, and its all for a good cause, its teaching kids to program. Finally, if you have any suggestions or find it useful for your kids, please post a comment here or e-mail me at firstname.lastname@example.org, Id enjoy knowing how you like it and how I can make it better.
I just bet Professor Daniel Lemire $100 that they wont. At least, any time soon. The specific terms of the bet are, "In some quarter of 2015, the unit sales of tablets will be at least twice the unit sales of traditional PCs, in the USA." Loser donates $100 USD to the charity of the winners choice. How did I get to this point? About a year and a half ago, I wrote a blog post for CACM, "Who needs a tablet?" The purposely inflammatory title overstates the main point, which is that rather than replace PCs, people are mostly buying tablets in addition to their PC ( ). Even so, predictions in the article have already proven wrong. Tablet sales did not "stall around the same level where netbook sales stalled". Netbook sales peaked and stalled around 40M units/year worldwide ( ). Tablet sales passed 60M units/year worldwide in 2011 and are projected to be twice that this year. So, tablets show no sign of stalling where netbooks did, but they are still being bought in addition to, not in replacement of, PCs. While many are taking some of the time they would have spent on their PC and spending it on their mobile or tablet instead, they still own and spend time on a laptop or PC. This bet doesnt quite say what I want to say. What I want to say is that PCs arent going away any time soon. They definitely are not going away by the end of 2015. Eventually, yes, but the change is not going to happen in less than three years. What the bet actually says is more about how fast people in the US will buy new tablets in 2015 compared to replacing PCs. Projections Ive seen put PC unit sales in the US around 16M units/quarter and mostly flat through 2015, tablet unit sales currently at 7M/quarter in the US and growing rapidly (projections vary from 10-16M/quarter by 2016). Seems unlikely that the projections would be that far off, so I took the bet. But the more interesting questions are: * What will it take to get people to stop using PCs? * Will the tablet market continue to be dominated by expensive devices (like the $600 iPad) or convert almost entirely to low priced tablets (currently $200 with the Nexus 7 and Kindle Fire, but probably soon around $100)? * Will anything coming in the next five years, including tablets, get people to stop buying and using PCs entirely? Or will people continue to buy and use multiple computing devices? Ive said what I think (breakthroughs in input/output, almost all $100 tablets, no). What do you think?
Some of what has caught my attention lately: * Pump and dump, both at the Facebook and Groupon IPOs. ( ) * "The thrilling demise of Groupons crummy business model" ( ) * Dave McClure says, "Returns for venture capital absolutely suck ... even worse ... most VCs are insufferable, arrogant, fucking assholes." () * And good advice here, also from Dave McClure: "Dont do a startup, you idiot!" () * Remember all the startups in desktop search a few years ago? They all disappeared when Microsoft fixed desktop search in Windows. Likewise, cloud storage is increasingly becoming part of the operating system (in MacOS, Windows, and Ubuntu), and that likely will kill off startups like Dropbox. ( ) * This is the end of the customizable home page hype, also a popular startup idea a few years ago () * "Once valued at more than $160 million, [Digg] is selling for the deeply discounted price of about $500,000" () * I wonder why we dont see engineers leave en masse for another company. Lack of organization? Fear of being sued?() * After saying "Windows 8 is terrible for desktops", a reviewer goes on to predict, "Windows 7, with its 630 million licenses sold, will remain an incredibly popular OS for the next 10 years -- just like Windows XP." (  ) * WinXP amazingly still has 26% market share but, in a bizarre twist on top of that, Microsoft decided not to support IE9 on WinXP; WinXP users have to use Chrome or Firefox if they want a modern browser. ( ) * Brutal (and long) Vanity Fair article on Microsoft. To summarize, stack ranking and Ballmers repeated errors killed confidence, morale, and the companys performance in the last decade. This quote captures the dysfunction: "People responsible for features will openly sabotage other people’s efforts. One of the most valuable things I learned was to give the appearance of being courteous while withholding just enough information from colleagues to ensure they didnt get ahead of me on the rankings." (    ) * Microsoft has a decent phone out now, but its priced so high, no one sees the point of getting it. You cant have a product consumers see as inferior to an Apple product but charge Apple-level prices, people will just get the Apple product. ( ) * Others had the same idea as the iPhone, just no one but Apple was willing to piss off the carriers and partners and launch it () * A change that may have widespread impact, current smartphones are getting powerful enough that people are waiting longer before replacing them; theyre happy with what they already have. A similar thing happened a while ago with PCs, with dramatic impact on that industry, could be just starting for smartphones. () * Of course you can sacrifice customer service in the short-term to boost short-term profitability. Customers take a while to learn that the service is not what it once was; youre essentially drawing down from past investment in your brand. After a few years, your brand becomes soiled, retention rates fall, customer acquisition costs rise, and profitability plummets. This has happened many times in the past, and is happening again right now. (  ) * A hybrid recommender, using both content and behavior data, wins A/B tests on Forbes.com articles. Why does that sound familiar (cough, Findory, cough)? ( ) * Cute idea, default local search results not to where you are, but where you are likely going, based on your current trajectory () * Nice example of how better hardware in your database can be faster and cheaper than expanding your caching layer () * What we introverts have to go through to act like extroverts () * If you have ever worked with software engineers and thought, "Why are they so grumpy?", this article provides insight, understanding, and solutions. () * Long article from Steve Yegge, but with some thought-provoking points about liberal (risk embracing) and conservative (risk avoiding) programmers. () * A start on personalized education, recommendations for courses () * An interesting difference between Coursera and Udacity is that Udacity is sticking mostly to computer science. I think Udacity is right to do so, but also curious how well Coursera manages to do in fields outside of CS. () * Love DragonBox, a game that is primarily a fun puzzle game, but also teaches algebra. The math is subtle; the puzzles involve matching and moving things on two sides of the screen that, as it turns out, represent two sides of the equation and all your moves are the same as moving things between two sides of an equation. Great for kids, really fun and addictive to play, love this, more like this please. ( ) * Nice example of A/B testing in the physical world () * Google App Engine launched at the top of the stack (write code and dont know where or how it is running) and Amazon EC2 at the bottom (just providing virtual machines). Its been interesting to watch both of them move toward each other, Amazon launching more and more features on top of EC2 (like CloudFront and Elastic MapReduce) and Google launching lower level services (like this new move to allow you to run your own virtual machine in Googles cloud). (      ) * Why read research papers? "These papers often foreshadow where the rest of the world is going." () * I like this search quality metric, WTF! @ k. Colorfully useful. () * Google Research and their hybrid research model blends research and engineering (to maximize impact and avoid the problematic tech transfer from research) and keeps projects short (but still do long-term research by iterating). () * Cow Clicker is a very amusing (and bizarrely successful) deconstructive satire of Zynga games, reduced to just clicking, waiting, and buying your way out of waiting, hilarious. Also worth seeing is Nekogames Parameters, which breaks down Diablo-like games to their core elements. (  ) * NNet guru Geoffrey Hinton says, "The brain is confronted by a buzzing, blooming confusion. It needs to fit many different models and use wisdom of the crowds." He then goes on to show the surprising benefits of massive NNets that drop out hidden units randomly. ( ) * "We oversimplify because, simply, there is no other way of getting by in the world" () * "Its not that our memory is a glitchy wetware version of computer flash memory; it’s that the computer metaphor just doesnt apply ... We store only bits and pieces of what happened—a smattering of impressions we weave together into feels like a seamless narrative. When we retrieve a memory, we also rewrite it, so that the time next we go to remember it, we dont retrieve the original memory but the last one we recollected." () * Amazing technology, a camera fast enough to catch light moving, can see around corners using clever algorithms, well worth watching this short talk () * Another amazing technology, very clever algorithms allowing an autonomous plate to fly at high speed in a constrained space. Go robots! Well worth watching this too, also short. () * Yet another impressive video, worth watching. Simple idea that breaks an assumption, solves a long standing problem with robot grippers, very effective, clever. ()
Marissa Mayer as CEO of Yahoo may be a test of a new style of executive leadership, the optimizing CEO. She is not the first computer scientist to lead a major company, but she is the first computer scientist (MSCS or higher) hired in as CEO to a Fortune 500 company. Many computer scientists view everything as an optimization problem. People, work, politics, life, everything is a search (often of a dynamic space) to find a maximum near the global maximum. Marissa Mayer is an important test of a new style of CEO. She is not a Neutron Jack or Carly Fiorina, the strong military general style of bold decisions, loyalty-first, follow me, right or wrong. She is not going to be a charismatic cheerleading, press-focused CEO, the type that views their job solely as managing the message and marketing and selling the company and themselves. She is not going to be the mad visionary of Steve Jobs, yelling at everyone while single-handedly designing breakthrough products. She is a computer scientist and appears to be leading like one. I suspect she views the company, people at the company, the products, even her own role, all as an optimization process, a search to find the most productive and most useful outcomes. The most common degree of CEOs hired into Fortune 500 companies is an MBA. Marissa appears to be the first computer scientist. This may be a test of a new style of leadership. Will Marissa Mayer be the start of companies hiring optimizing CEOs?
A fun upcoming KDD 2012 paper out of Microsoft, "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained" (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing. Its a light and easy read, definitely worthwhile. Selected excerpts:
I want to share more of the ideas Ive been exploring. First, let me start with this, an early version of a game Im calling Stick Portal. Click on the image to play: Coffescript using HTML5 canvas. Just need a browser to play, works pretty well on mobile devices (add it to your home screen and itll even go full screen and behave like a free app). The idea is to create a simplified puzzle game with a level editor where kids could share levels they created. The current version has ten levels that are the tutorials to teach players how to play the game. Ive just started on the level editor that will, eventually, allow people to create their own levels easily and share them with others. The motivation for this came from seeing what Valve did with Portal 2. Portal 2 had a level editor called Hammer that was amazing but incredibly hard to use. Kids were using Hammer to create puzzles for each other that they could play in Portal 2 -- which is great exposure to CAD-like modeling tools and a nice spatial reasoning workout -- but it was really painful. Valve just launched a much easier-to-use editor for Portal 2 that is truly fantastic, highly recommend it. Stick Portal is free to play, open source (MIT license), and the code is available on GitHub. The source might be useful to people working on similar games as it contains examples of ways to use the Box2Djs physics engine, handling touch and multi-touch (and accelerometer) on mobile devices, how to make your web page look like an app, plenty of examples of working with HTML5 Canvas, crazy things like a way to automatically resize the canvas when the browser window changes or a device rotates, and a lot of other goodies. Wont claim its the most beautiful code ever, but it is well commented and was fun to write. I hope it is useful. I plan to keep working on this and extend it to include an editor, but Ive been sitting on this long enough so, in the spirit of launch early and often, Im putting it out now. Please let me know what you think in the comments, and Id love it if youd drop me a note if your kids like the game or if the examples in the source turn out to be useful to you. UPDATE: A couple people have told me they have gotten stuck not being able to guess the controls in the tutorial. Its AWSD or arrow keys for movement and mouse button and mouse movement for the portal gun. On mobile, its hold down your finger to run toward your finger and hold down above you to jump, tap to aim and fire the portal gun, and second finger (multi-touch) to move the portal gun without firing (like to maneuver a held box). I also should have said more explicitly that one very cool thing is that the game doesnt use Flash, its just HTML5. So, it works on all modern browsers without a plug-in, which is neat-o. Also interesting is that it is a fairly complicated HTML5 game running smoothly in the browser on PCs and mobile, almost looking like a native app, but not a native app. Finally, let me add that I did this game mostly to learn about making games fun. Thats a surprisingly hard thing to do. If youre interested in that topic too, nothing like trying to do it yourself, but Id also recommend the books "A Theory of Fun Game Design" and "The Art of Game Design: A Book of Lenses". And, if you find Stick Portal fun or dont find it fun, please let me know!
More of what has caught my attention recently: * $1B for Instagram was silly and caused by fear (   ), but it is impressive the scale Instagram built with just three engineers ( ) * Felix Salmon at Reuters writes that Twitter is under revenue pressure and will start doing things that make the site much less pleasant to use. Id say Facebook is under similar pressure. Both likely will do increasingly aggressive attempts to sell their users to advertisers and may face a backlash. (  ) * Google has millions of machines (  ), so many that "a performance improvement of even 1% can results in millions of dollars saved", which explains why they spend so much time on the details, like how threads run on cores and estimating disk space needed ( ) * Great recent talk by Googler Jeff Dean on problems due to hitting occasional latency in large scale distributed systems, some surprising and useful advice here. ( ) * While 89% of ad clicks are incremental (visit wouldnt have happened without the ad), only 50% of ad clicks on the top ad are incremental. Is that due to ads on navigational queries? And does Google effectively force companies to buy those ads (so competitors dont get them) even though the ads are not very effective? () * "In this two-part blog post, we will open the doors of one of the most valued Netflix assets: our recommendation system." () * "Yahoos Chief Product Officer Blake Irving resigns" over disagreements on strategy, in particular he was "concerned about the massive engineering and research talent exodus of late, especially in Yahoos vaunted Labs arm." ( ) * The field of astronomy appears to be going through a major shift to large scale analysis of truly massive data sets ( ) * Amazing to me that Walmart has taken this long to ramp up online against Amazon. Amazon even has been called the "Walmart of the Web"; you going to take that, Walmart? () * A clever analysis deduces that Amazon has 450k machines in AWS. ( ) * A video out of Microsoft Research shows how different interacting with a tablet would feel if touch response times could be made faster. Very compelling. () * Other work out of Microsoft Research demos a Kinect-like gesture interface built using what is essentially echolocation via a laptops built-in microphone and speaker, no other hardware required. (video  and CHI 2012 paper )
What has caught my attention lately: * Videos showing Windows 8 is horribly painful for most people, looks likely to be another Windows Vista-like flop. Really worth watching the videos or trying it yourself (videos  , try it  ) * "38% of the ads are never in view to a user" and another 12% "of the ads are in view for less than 0.5 seconds" () * "Many more ads" are coming on Facebook, "a lot more advertising ... [on] Facebooks traditionally clean interface." Could this mean Facebook is having revenue trouble already? (  ) * Not only is the iPhone over half of Apples revenue, it is more than 70% of their profits. Apple really is a mobile phone manufacturer with a few other businesses attached. () * Coming soon, a "voice-activated assistant that remembers everything you say ... systems that are more conversational, that have the ability to ask more sophisticated followup questions and adapt to the individual ... [with] short-term and long-term memory." () * "Microsoft tries to find pockets of unrealized revenue and then figures out what to make. Apple is just the opposite: It thinks of great products, then sells them." () * "The best way to get the most out of engineers is to surround them with other great engineers." () * "It’s positively de-motivating to work for a company where your job is just to shut up and take orders. In tech startup land, we all understand instinctively that we have to hire super smart people, but we forget that we then have to organize the workforce so that those people can use their brains." () * Programmers want to learn new skills and technology while working in a team of people they respect, and over 90% of programmers said they are willing to take a lower paying job to get that. () * Netflixs streaming catalog continues to deteriorate, is now down to only 853 good movies, of which only 155 were released within the last five years (  ) * "This class is about setting you on the path to developing good taste as a programmer" (free, from Udacity, taught by Googler and AI guru Peter Norvig, starts Apr 16) () * Could this be the business model for Udacity? Offer free classes online, then send companies candidates pre-screened for machine learning programming ability? ( ) * If my blog is any indication, the only RSS feeder still being used is Google Reader. Are all others dead now? () * Huge and wide open opportunity in personalized advertising for online news. Amazes me Yahoo and Amazon havent gone after this, and that Google hasnt done a better job going after it. () * Paper with fascinating statistics on Groupon and other daily deal sites. Most dramatic, it costs restaurants half a star in their Yelp rating if they offer Groupon deals. ()
A remarkably detailed paper, "Web-Scale User Modeling for Targeting" (PDF), will be presented at WWW 2012 that gives many insights into how Yahoo does personalized advertising. In summary, the researchers describe a system used in production at Yahoo that does daily builds of large user profiles. Each profile contains tens of thousands of features that summarize the interests of each user from the web pages they have viewed, searches they made, and ads they have viewed, clicked on, and converted (bought something) on. They explain how important it is to use conversions, not just ad clicks, to train the system. They measure the importance of using recent history (what you did in the last couple days), of using fine-grained data (detailed categories and even some specific pages and queries), of using large profiles, and of including data about ad views (which is a huge and low quality data source since there are multiple ad views per page view), and find all those significantly help performance. Some excerpts from the paper:
We present the experiences from building a web-scale user modeling platform for optimizing display advertising targeting at Yahoo .... Our work ... [looks] into understanding the effect of different user activities on prediction, [gives] insights about the temporal aspect of user behavior (recency vs. long-term trends), and [explores] different variants (user representation and target label) through large offline and online experiments .... We deployed our platform to production and achieved a [large] boost in online metrics, such as eCPA, compared to the old system. Our objective is to refine the targeting constraints using the past behavior of the users ... [so] we can improve the number of conversions per ad impression without greatly increasing the number of impressions. User profiles are aggregated logs from different systems/products (e.g. user logs of Yahoo News, Yahoo Finance, etc.) .... We consider several different events ... [including] pages visited .. the category of the page ... searches issued, clicks on search links, clicks on search advertising links ... [and] the category of the search query ... [and] views and clicks on ads ... [and] the ad category ... from an existing hierarchical ad categorizer. Our results show [a] large performance loss incurred in favoring long-term history over short-term history. This is obvious as the recent history clearly communicates with a high probability the current interest of the user ... Although recent history is more important than older history, we still need to include older history to get the most complete idea about the user. Results show ... many of our raw features are completely non-discriminative. However, a small percentage of these features are actually important ... [For example, just] ... dropping all raw ad views ... [or if] we drop all raw features and only keep categorical features ... [causes] dropping [of] the weighted AUC measure by 3.69% and 4.26%, respectively ... In production ... we apply a coarse feature selection through mutual information, then we apply a rigorous feature selection through l1 regularization.Very interesting. A couple things I am left wondering: First, they found recent history is very effective, yet only update the profiles daily. Wouldnt their results on the value of recent behavior (which others found too) suggest that there would be benefit from hourly or, even better, real-time updates of the profiles (perhaps with a second memory-based, unreliable, and partial coverage system supplementing the data in the more complete and more accurate older profiles)? That would allow the system to adapt immediately when someone, for example, starts looking at information for a vacation to Hawaii and show relevant offers immediately instead of only being able to do it the next day when it is usually too late. Unfortunately, I suspect were not going to see really big gains in relevance and usefulness of ads without real-time updates to profiles of fine-grained interests; results that show that data only 24 hours old is better than data a week old may only be a tease of the gains to be seen with data only seconds old. Second, they find that features based on individual search queries and pages viewed ("raw features") usually have no value, but occasionally have enough value that it is important to include some. Wouldnt that suggest that the categorization scheme for pages viewed, searches made, and ads need to be more fine-grained (e.g. not just the category "pants", but the category "mens boot cut jeans")? Or, better, perhaps more fine-grained while also correctly cross correlated (interest in "mens boot cut jeans" not only shows in the data a weak interest in all pants, but also maybe has been shown to indicate a fairly strong interest in "mens flannel shirts")? If you are interested in this paper, you might also want to look at another recent paper out of Yahoo Research, "Learning to Target: What Works for Behavioral Advertising" (ACM), which is referenced multiple times by this paper and describes the features used in the user profile in a bit more detail, as well as the results of some other experiments. Please see also my 2007 post, "What to advertise when there is no commercial intent?"
More of what has caught my attention lately: * Laptops with Kinect sensors are coming. Worth paying attention to, gesturing in air to issue commands, a very different UX could be built on top of this ( ) * "Each streaming subscriber is worth only $2.40 in profit each quarter to Netflix, compared to $17.32 for each DVD subscriber. The old business was very lucrative. The new business kind of sucks." () * "Youre not going to get content owners to license ... for less than what they get from the cable companies ... [if you will] use that cheap content to destroy the cable companies business model." () * "Federal officials approached Google with evidence of its employees wrongdoing ... Google agreed to pay $500 million to ... ward off criminal charges against the company." () * Google is spending nearly $1B every quarter buying new servers and data centers. That buys a lot of machines. ( ) * Education startups are suddenly very, very hot. (   ) * "Tiny directional antennas at the top of each rack ... send and receive data. A central controller monitors traffic patterns, finds network bottlenecks, configures the antennas and turns on the wireless links when more bandwidth is required ... The design sped up traffic by at least 45 percent." () * "Wimpy cores are fine, but if you go down to the wimpiest range, your gains really have to be enormous if you want to consider all the aggravation -- and the hit to their productivity -- that your software engineers face." () * A Facebook engineer explains why is actually the right thing for Facebook to produce buggy code () * "How sex, bombs, and burgers shaped our world" () * "There is a monolithic view that this generation of technology I.P.O.s is completely broken." () * Just three engineers built and run Instagram, which has 14 million users, 150 million photos, several terabytes of data, and hundreds of machines. ( ) * Startup founders "say that if theyd known when they were starting their company about the obstacles theyd have to overcome, they might never have started it." () * Two 17-year-olds used a weather balloon to send a little Lego astronaut and a video camera 15 miles into the stratosphere. Very fun. ()
Some of what has caught my attention recently: * Security guru Bruce Schneier predicts "smart phones are going to become the primary platform of attack for cybercriminals" soon () * If, next, Amazon does a smartphone, I hope it is WiFi-based, like Steve Jobs originally wanted to do with the iPhone (  ) * iPhone owners love Siri despite its flaws () * Valve, makers of Steam, talks about their pricing experiments: "Without making announcements, we varied the price ... pricing was perfectly elastic ... Then we did this different experiment where we did a sale ... a highly promoted event ... a 75 percent price reduction ... gross revenue [should] remain constant. Instead what we saw was our gross revenue increased by a factor of 40. Not 40 percent, but a factor of 40 ... completely not predicted by our previous experience with silent price variation." [] * An idea whose time has come, profiling code based not on the execution time required, but the power consumed () * Grumpy about work and dreaming about doing a startup? Some food for thought for those romanticizing startup life. ( ) * Yahoo discovers toolbar data (the urls people click on and browse to) helps a lot for web crawling () * Google Personalized Search adds explanations. Explanations not only add credibility to recommendations, but also make people more accepting of recommendations they dont like. () * "Until now, many education studies have been based on populations of a few dozen students. Online technology can capture every click: what students watched more than once, where they paused, what mistakes they made ... [massive] data ... for understanding the learning process and figuring out which strategies really serve students best." () * Andrew Ngs machine learning class at Stanford was excellent; I highly recommend it. If you missed it the first time, it is being offered again (for free again) next quarter. () * Microsoft giving up on its version of Hadoop? Surprising. () * The NYT did a fun experiment crowdsourcing predictions. The results are worth a look. ( ) * Web browsers (Firefox and Chrome) will be a gaming platform soon ( )
A recent paper out of Yahoo, "Discovering URLs through User Feedback" (ACM), describes the value from using what pages people browse to and click on (which is in Yahoos toolbar logs) to inform their web crawler about new pages to crawl and index. From the paper:
Major commercial search engines provide a toolbar software that can be deployed on users Web browsers. These toolbars provide additional functionality to users, such as quick search option, shortcuts to popular sites, and malware detection. However, from the perspective of the search engine companies, their main use is on branding and collecting marketing statistics. A typical toolbar tracks some of the actions that the user performs on the browser (e.g., typing a URL, clicking on a link) and reports these actions to the search engine, where they are stored in a log file. A Web crawler continuously discovers new URLs and fetches their content ... to build an inverted index to serve [search] queries. Even though the basic mechanism of a crawler is simple, crawling efficiently and eff ectively is a difficult problem ... The crawler not only has to continuously enlarge its repository by expanding its frontier, but also needs to refresh previously fetched pages to incorporate in its index the changes on those pages. In practice, crawlers prioritize the pages to be fetched, taking into account various constraints: available network bandwidth, peak processing capacity of the backend system, and politeness constraints of Web servers ... The delay to discover a Web page can be quite long after its creation and some Web sites may be only partially crawled. Another important challenge is the discovery of hidden Web content ... often ... backed by a database. Our work is the first to evaluate the benefits of using the URLs collected from a Web browser toolbar as a form of user feedback to the crawling process .... On average, URLs accessed by the users are more important than those found ... [by] the crawler ... The crawler has a significant delay in discovering URLs that are first accessed by the users ... Finally, we [show] that URL discovery via toolbar [has a] positive impact on search result quality, especially for queries seeking recently created content and tail content.The paper goes on to quantify the surprisingly large number of URLs found by the toolbar that are useful, not private, and not excluded by robots.txt. Importantly, a lot of these are deep web pages, only visible by doing a query on a database, and hard to ferret out of that database any way but looking at the pages people actually look at. Also interesting are the metrics on pages the toolbar data finds first. People often send links to new web pages by e-mail or text message. Eventually, those links might appear on the web, but eventually can be a long time, and many of the urls found first in the toolbar data ("more than 60%") are found way before the crawler manages to discover them ("at least 90 days earlier than the crawler"). Great paper out of Yahoo Research and a great example of how useful behavior data can be. It is using big data to help people help others find what they found.
A recent paper out of Google, "Extracting Patterns From Location History" (PDF), is interesting not only for confirming that Google is studying using location data from mobile devices for a variety of purposes, but also for the description of the data they can get. From the paper:
Google Latitude periodically sends his location to a server which shares it with his registered friends. A users location history can be used to provide several useful services. We can cluster the points to determine where he frequents and how much time he spends at each place. We can determine the common routes the user drives on, for instance, his daily commute to work. This analysis can be used to provide useful services to the user. For instance, one can use real-time traffic services to alert the user when there is traffic on the route he is expected to take and suggest an alternate route. Much previous work assumes clean location data sampled at very high frequency ... [such as] one GPS reading per second. This is impractical with todays mobile devices due to battery usage ... [Inferring] locations by listening to RF-emissions from known wi-fi access points ... requires less power than GPS ... Real-world data ... [also] often has missing and noisy data. 17% of our data points are from GPS and these have an accuracy in the 10 meter range. Points derived from wifi signatures have an accuracy in the 100 meter range and represent 57% of our data. The remaining 26% of our points are derived from cell tower triangulation and these have an accuracy in the 1000 meter range.The paper goes on to describe how they clean the data and pin noisy location trails to roads. But the most interesting tidbit for me was how few of their data points come from GPS and how much they have to rely on less accurate cell tower and WiFi hotspot triangulation. A lot of people have assumed mobile devices would provide nice trails of accurate and frequently sampled locations. But, if the Googlers data is typical, it sounds like location data from mobile devices is going to be very noisy and very sparse for a long time.
Even more of what has caught my attention recently: * Spooky but cool research: "Electrical pulses to the brain and muscles ... activate and deactivate the insects flying mechanism, causing it to take off and land ... Stimulating certain muscles behind the wings ... cause the beetle to turn left or right on command." () * Good rant: "Our hands feel things, and our hands manipulate things. Why aim for anything less than a dynamic medium that we can see, feel, and manipulate? ... Pictures Under Glass is old news ... Do you seriously think the Future Of Interaction should be a single finger?" () * Googler absolutely shreds traditional Q&A and argues that the important thing is getting a good product, not implementing a bad product correctly to spec. Long talk, if youre short on time, the talk starts at 6:00, meat of the talk starts at 13:00, and the dont miss parts of the talk are at 17:00 and 21:00. () * "There has been very little demand for Chromebooks since Acer and Samsung launched their versions back in June. The former company reportedly only sold 5,000 units by the end of July, and the latter Samsung was said to have sold even less than that in the same timeframe." () * With the price change to offer Kindles at $79, Amazon is now selling them below cost () * Personalization applied to education, using the "combined data power of millions of students to provide uniquely personalized learning to each." (     ) * It is common to use human intuition to choose algorithms and tune parameters on algorithms, but this is the first Ive ever heard of using games to crowdsource algorithm design and tuning () * Great slides from a Recsys tutorial by Daniel Tunkelang, really captures the importance of UX and HCIR in building recommendation and personalization features () * Bing finally figured out that when judges disagree with clicks, clicks are probably right () * Easy to forget, but the vast majority of US mobile devices still are dumbphones () * Finally, finally, Microsoft produces a decent mobile phone () * Who needs a touch screen when any surface can be a touch interface? () * Impressive augmented reality research demo using Microsoft Kinect technology () * Very impressive new technique for adding objects to photographs, reproducing lighting, shadows, and reflections, and requiring just a few corrections and hints from a human about the geometry of the room. About as magical as the new technology for reversing camera shake to restore out-of-focus pictures to focus. ( ) * Isolation isnt complete in the cloud -- your neighbors can hurt you by hammering the disk or network -- and some startups have decided to go old school back to owning the hardware ( ) * "The one thing that Siri cannot do, apparently, is converse with Scottish people." () * Amazon grew from under 25,000 employees to over 50,000 in two years () * Google Chrome is pushing Mozilla into bed with Microsoft? Really? () * Is advice Steve Jobs gave to Larry Page the reason Google is killing so many products lately? () * Why does almost everyone use the default software settings? Research says it appears to be a combination of minimizing effort, an assumption of implied endorsement, and (bizarrely) loss aversion. ()
More of what has caught my attention recently: * The first Kindle was so ugly because Jeff Bezos so loved his BlackBerry () * "Sometimes it takes Bad Steve to bring products to market. Real artists ship." () * "The Mac sleep indicator is timed to glow at the average breathing rate of an adult: 12 breaths per minute." Beautiful example of attention to design. () * "A one-star increase in Yelp rating leads to a 5-9 percent increase in revenue" () * Facebook games, rather than try to be fun, try to be addictive. They feed on the compulsive until they give up their cash. The most addicted spend $10k in one game in less than a year. () * "The Like and Recommend buttons Facebook provides to other Web sites send information about your visit back to Facebook, even if you dont click on them ... Facebook can find out an awful lot about what you do online." () * A new automated attack on CAPTCHAs that can break them in an average of three tries. Even so, paying people to break CAPTCHAs is so cheap that that is probably what the bad guys will continue to do. ( ) * Online backup and storage is now basically free. I expect this to be part of the operating systems soon (nearly is in Windows and Ubuntu) and all profits in online backup to drop to zero. () * Prices for Netflix acquiring their streaming content appear to be going way up. Netflix just paid $1B over eight years for some CW network shows, and Starz rejected $300M/year -- a x10 increase -- for their movies. ( ) * Someone spun up a truly massive cluster on Amazon EC2, "30,472 cores, 26.7TB of RAM and 2PB (petabytes) of disk space." () * "Googles brain [is] like a babys, an omnivorous sponge that [is] always getting smarter from the information it [soaks] up." ()
Some of what has caught my attention recently: * "60 percent of Netflix views are a result of Netflixs personalized recommendations" and "35 percent of [Amazon] product sales result from recommendations" ( ) * When doing personalization and recommendations, implicit ratings (like clicks or purchases) are much less work and turn out to be highly correlated to what people would say their preferences are if you did ask () * Good defaults are important. 95% wont change the default configuration even in cases where they clearly should. () * MSR says 68% of mobile local searches occur while people are actually in motion, usually in a car or bus. Most are looking for the place they want to go, usually a restaurant. () * Google paper on Tenzing, a SQL layer on top of MapReduce that appears similar in functionality to Microsofts Scope or Michael Stonebrakers Vertica. Most interesting part is the performance optimizations. () * Googler Luiz Barroso talks data centers, including giving no love to using flash storage and talking about upcoming networking tech that might change the game. ( ) * High quality workers on MTurk are much cheaper than they should be () * Most newspapers should focus on being the definitive source for local news and the primary channel to get to small local advertisers ( ) * Text messaging charges are unsustainable. Only question is when and how they break. () * "If you want to create an educational game focus on building a great game in the first place and then add your educational content to it. If the game does not make me want to come back and play another round to beat my high-score or crack the riddle, your educational content can be as brilliant as it can be. No one will care." () * A few claims that it is not competitors failures, but Apples skillful dominance of supply chains, that prevents Apples competitors from successfully copying Apple products. Im not convinced, but worth reading nonetheless. (  ) * Surprising amount of detail about the current state of Amazons supply chain in some theses out of MIT. Long reads, but good reads. () * If you want to do e-commerce in a place like India, you have to build out your own delivery service. () * Like desktop search in 2005, Dropbox and other cloud storage products exist because Microsofts product is broken. Microsoft made desktop search go away in 2006 by launching desktop search that works, and it will make the cloud storage opportunity go away by launching a cloud drive that works. (  ) * Just like in 2005, merging two failing businesses doesnt make a working business. Getting AOL all over you isnt going to fix you, Yahoo. ( ) * Good rant on how noreply@ e-mail addresses are bad customer service. And then the opposite point of view from Googles Sergey Brin. ( ) * Google founder Sergey Brin proposed taking Googles entire marketing budget and allocating it "to inoculate Chechen refugees against cholera" () * Brilliant XKCD comic on passwords and how websites should ask people to pick passwords ()