One hundred twenty-three years ago my great grandmother's first husband died in a hotel in Kansas City from asphyxiation from the gas having been left on over night (the hotel did not yet have electric lighting). A letter was hastily written on a piece of hotel stationary to be delivered to his wife in the neighboring farming community where she lived.
It is fortunate to me that someone thought to hang on to that note since I have become interested in genealogy and this was a fairly significant event in family history (had he not died I don't suppose I would be around since it was her second marriage that gave me my grandfather).
I long for scraps of anything that my dead relatives, wrote, created, etc. It connects me better to the past — the lives they lived, how they lived them. It somehow grounds me a little better ... well, it's rather hard to explain the draw of genealogy.
Sadly very little of the ephemera of everyday life was kept. I get it. It might have seemed like hanging on to junk mail — like you were a hoarder or whatever, but in this digital era we should be able to hold terabytes of what may appear to be ephemera.
I'm doing what I can – not for ego, I think, but for future generations that may find a connection to their past interesting.
30 years ago there was no digital world. Nearly all information was in physical artifacts. The things worth saving haven't really changed, but the amount of noise they are buried in has. Imagine if that letter was kept in a two ton pile of ad fliers. Sure, someone would find some of those fliers interesting, but you'd have been much less likely to even know about the letter.
Well, I remember a lot of great stuff on Usenet circa 1994, but it looks like Google shut down access to it via Google Groups, which used to archive it in a searchable way.
There was a ton of great stuff 30 years ago, and I think it's definitely worth saving.
The Internet was a very different place, but it was quite real 30 years ago, and I think the idea that the further back you go the more valuable this kind of thing is is the right way of looking at it.
An aside about ad spam from companies that I occasionally buy from:
Often as spam comes from the same mailbox as order receipts and includes words like “order” while messages with receipts never include the word “receipt”. When inundated with daily or sometimes multiple times a day ad spam from the same company it becomes very difficult to filter for only not receipts, to clean a neglected inbox.
After I’m gone, I fully expect my family just to delete it all because the signal to noise is so low.
I don't have anyone to do anything after I'm gone, so I just delete the emails myself. I do keep the notable ones, like registration information and some payment receipts but otherwise everything goes to the trash.
Bonus points:
I don't need 30/50/100Gb mailbox (and the associated mailbox cost nowadays).
Search is not only fast but if I didn't found something - then there is nothing of this something in the mailbox.
I't mentally pleasurable to log in once in a while and throw a bunch of unneeded stuff into the trash bin, quite similar to a real life room cleaning.
Fortunately Gmail tabs go a lot of the way to letting you mass delete junk you don’t care about. Assuming you do even a modicum of labeling stuff you might like to refer to or act on, deleting at least older promotions and updates eliminates a lot of things.
Didn't use GMail for years but the labels were not quite up to the task.
Thankfully FastMail interface makes 'search from this address' and 'search to this address' (I'm using per-service addresses) and then 'select all', 'delete' actions a breeze.
A selection 74 items over a 10 year period is a different proposal compared to e.g. keeping two tons of ad fliers from November 17th 1907 (and every other thing, every other day, all the time).
Ads range from a (necessary, in a capitalist society) nuisance to a scourge, and people justly put up increasingly thick boundaries to shield themselves from their influence. When waning cultural relevance or whatever dilutes that influence, you can more easily see the ads for what they are— often manipulative marketing tactics implemented through often genuinely beautiful art and design. Both aspects are fascinating to consider and the art can be quite enjoyable. Early modernist posters from Paris are beautiful. Watching collections of mid century television ads in the prelinger archives is fun, and tells us a lot about the ways we are influenced by modern ads speaking to current perspectives, fashions, and concerns.
Capitalism would work 100% fine without ads because people naturally compare and contrast options when buying stuff.
All that's necessary is making it possible for people seeking out your type of product to find you. And for revolutionary products, there's word of mouth.
If anything I think capitalism would function better without ads, because I would argue that advertising overall results in less informed customers, especially the modern lifestyle/brand type of advertising that's clearly quite effective at manipulating people.
It’s an interesting question I guess (and slightly worrying that I can more easily imagine the end of the world than the end of advertising). Especially if we take it to the extreme and imagine sponsored listings also don’t exist. I guess incumbents would have a big advantage.
There are second order effects of ads that we’d need to consider. Facebook and Google wouldn’t exist as we know them. Maybe that means some of their research doesn’t happen?
If incumbents would be favored then it stands to reason that total advertising spend would be at least loosely proportional to market decentralization. Yet in America advertising spend has increased many orders of magnitude since the 50s, while the market has simultaneously become dramatically more centralized.
By contrast in parts of the world with relatively negligible advertising, markets tend to be heavily decentralized.
And I think this makes far more sense of you think about it. If you make a soft drink that is rated far higher than Coke in blind testings, or perhaps one that is near indistinguishable in flavor, but cheaper, you stand very little chance of competing successfully. There's a reason Coke spends billions per year in advertising, and it has nothing to do with reaching the three remaining people who are not aware Coke exists.
And yeah without advertising the "free" services on the internet wouldn't exist, replaced by a mixture of genuinely free services, and for-pay. This would IMO be a dramatically better state of affairs. Businesses whose actual customer is not the people using their product/service leads to such dystopic nonsense.
There's a similar question that crosses my mind occasionally: 'how would capitalism work if there were no brand names and no advertising, but only product reviews?'
Would it even be possible to safeguard the product reviews system from bribery? The current systems we use for product reviews obviously would be unsuitable.
At any rate, I commiserate about the role of branding and advertising today. It's as often noise as it is information.
If there were no ads, how would people know that products existed? Would they just see the products on store shelves? What about services? Would labels be ads? Would how stores merchandise things be advertisements? Could businesses negotiate for specific product placement? How would you find out about stores? Would store signs be ads? How about really big ones? How about at the edge of their property along a road highway? Could the sign say what the store sold? If you were to start a product guide to help people find what they need, how could you possibly afford to buy enough products to be useful and up-to-date enough while slow crawl word of mouth got the business off the ground? Would asking people to tell their friends be an ad? If not, could you pay someone to spread the word about your product? Would traveling sales reps be ads? What if they wore head to toe logo gear? Could you just pay people to do that without selling things? Ads suck but I don’t see how a capitalist society could survive without them.
I think the definition would have to be an exchange of something of value for telling other people about a product. There are some companies that got off the ground with no paid advertising but I think they’re an exception. Generally people are not seeking out new products.
But the whole point of a capitalist society is that competitors that do things better/cheaper start taking customers so the capital moves to the best and most efficient system.
I don’t think that’s true? Tons of stuff from that era had been digitized, even before newer more relevant stuff and older rarer stuff, because the acid paper had a short shelf life and there were so many ads in printed stuff then. I might have a skewed perspective from working in the digitization world for quite some time. I think they’re selling what they sell with all their other content— discovery, curation, preparation, and easy delivery.
It’s not like you currently go to a webpage and save all the images onto deep storage for archival… I’m not sure what relevance things being digital has on identifying noise.
If the ancestor before you is hoarding anything that comes across their path, be it digital ads or every physical greeting card they’ve ever gotten, the problem is with the person’s collection habits, not the medium.
What about robots reading each flier and checking if something is odd about that particular one? It could find the letter and report it to you. Even easier if it was all digital information.
> ...well, it's rather hard to explain the draw of genealogy.
I've noticed people becoming more interested in genealogy when they - let me phrase this delicately - reach a certain age. My speculation is that it is a component of grappling with one's own mortality. As the grays and wrinkles multiply, some obsess over healthy eating and exercise, some wealthier ones invest in immortality research, some get blood boys, and the rest feel an urgent need to research our genealogy; any detritus that shows our progenitors existed proves some trace of us having been here will remain, and perhaps our existence means something, as time cruelly keeps marching on.
People's interests change over time. It's not necessarily because folks are grappling with their own mortality. For instance, lots of older folks seem to get into bird watching.
I also want to point out that saying "let me phrase this delicately" to the person who is the subject of the sentence is not tactful. It's honestly kind of rude when you're on the receiving end. If you're going to judge me to my face, just say the words.
A lot of young people who don't know their ancestors are interested in genealogy. (Adoption, immigration, war refugee). People who have so much genealogy already built into their life via a large family don't need to be consciously into genealogy, because they're already immersed.
This reminds me of a recent flea market experience. There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.
It's hard to assign a value to these things. They are simultaneously junk and treasure. I think about the journey these items took to find their way to that flea market table. It was too diverse a collection to have come from one place. So I imagine all the paths each individual item traversed. The joy of the recipient reading a post card, holding on to it, rediscovering it on spring cleaning days. Or the photo living in an album or framed on a wall somewhere for a lifetime.
I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
> There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.
> I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
I purchase, scan, and resell those kinds of things. I'd love to have a centralized, public repository in which to store the data. As our tech gets better at extracting data from that material more and more interesting applications could be developed. Imagine being able to find 100+ year old photos of your ancestors via facial recognition and extracted metadata searches.
I wish I could come up with a non-profit business model that worked for preserving that kind of stuff. I would love to gather up the historical ephemera that's being lost, catalog it via manual and automated processes, and make it available to the public. (Yes, I am aware there are privacy concerns. It's a pie-in-the-sky idea. I just hate to see all of the previously captured and curated effort that went into ephemera cast to the winds.)
I've been thinking about the same topic recently, with a specific focus on antiquarian books.
There is a local business I think of as a "Book Butcher". They buy an old book with beautiful engravings, cut out the pages and resell the individual pages as interior design pictures for hanging. Imagine if we could get them to scan & archive each book before pulling it apart...
Another idea is simply archiving the graphics from eBay listings. Sometimes there is valuable information in the pictures that accompany listing, but they disappear forever once the item sells.
I'd be glad to connect with anyone who's interested in this stuff.
At the same flea market, there was a stand selling pages ripped from a book (I think it was a dictionary) and fed through a printer to add a picture to the page. I found the concept interesting, but the pictures they used were mostly bad. There were a few good ones, but most were pixelated Star Wars clip art kind of stuff or Jack Daniels logo.
Regarding genealogy it is great to look at the work The Church of Jesus Christ of Latter-day Saints was doing that help genealogical researchers around the globe [1] beyond that specific church.
Sure, there are a ton of reasons to archive. And if it's cheap to do (in terms of money, yes, but also in terms of time, effort, mental health, etc.) then I am of the mind that we should archive everything.
But, it often isn't cheap to do, and in that case, it makes sense to prioritize. The high priority items for me are the things that I might want to share, the ideas I want to amplify for my contemporaries and future generations that might examine my life. Stuff like [1] [2] and [3] which has influenced my thinking fundamentally, that I hope to build upon so that others can build upon what I have built.
I'd argue that you do this intuitively: you're mentioning a letter from your family's past because it is a high priority item--it's relevant because it was the last written words of your great-grandmother's first husband.
But, there's a lot that isn't worth keeping. My first form of archiving as a teenager was keeping ticket stubs for movies and concerts--a decade later I was going through my pile and found that I didn't even remember most of them. The better movies, I remembered--and I had them on DVD. The better concerts, I remembered--and I also had journal entries and CDs to remember the experience and the music. It's not important to me where/when I saw Everything, Everywhere, All At Once in theaters, but I have it on DVD and I can't wait to show it to my niece when she's older. And sure, I saw Amigo the Devil live, but frankly, he's not an artist you need to see in concert--the greatest impact of Cocaine and Abel[4] on me was when I listened to it alone in my room. The ticket stubs simply don't matter to me.
It's funny you mention ticket stubs, because I also have a similar collection, and I kind of treasure it. Before my Google tracking my every step, before Twitter, as the years go by, I have some record of what I was doing at exceedingly specific times and dates. It helps to structure my memories a bit more than I'd otherwise be able to. I scanned them all at once (in several pages), and it's sort of a map of my adolescence. I can jump across time. I would be sad to lose it. (Along with the photo of the tickets for my make-shift - and first - double feature of Everything Everywhere/Dr. Strange. Multiverse-themed, doncha know?)
These days whenever I read an interesting article, I will take 2 minutes to copy and paste it into my Obsidian vault under my Articles folder. I'll take care to paste the images as images (and not links) and make sure I've got the author and source URL at the top, and have my separate notes section link to it. It's a bit silly and obsessive, but given how transient content on the Internet is, I think it's necessary to make a copy of anything you care about.
I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.
Also, I believe by default the files are saved as plain html (with resources being base64 encoded), so search tools which can index the contents of html files will work.
There is also the option to have the contents compressed, and (a separate option) to keep the plaintext of the file uncompressed, which will likewise still allow indexing to work while saving space.
I built Obsidian Web Clipper to automate that process. It also allows you to save web pages as nicely formatted Markdown files with YAML properties even if you don't use Obsidian.
I do something similar but with Discord. I made a server accessible only by me, and I have a few different channels like work, life, music, ideas, etc. I also send all screenshots I take into a separate channel, and set up a chrome extension that sends whatever page I'm on as a link.
Unfortunately it's not super easy to get data out of Discord either. Last I checked, one needs to carefully setup a bot then script the bot to download messages to CSV, etc., but if you're not careful with the account and bot setup, the export process itself could lead to a ban.
They can but generally that includes any Javascript on the same page which sometimes does funny stuff when you open it up offline or after the remote server goes away.
It's not perfect, but Edge will let one take a simple full page screenshot with Ctrl+Shift+S. It results in a hefty PNG but at least it's a visual copy of everything which might suffice for a certain set of purposes (e.g. links will be lost, so it's not good for that).
I can still right-click > Save any page as .html, but that doesn't guarantee server streamed stuff, media, images, etc. will be preserved correctly.
> I got a picture of my great grandfather, thing took six hours to take your picture. [...] Every guy had one picture back then. And it's just him like, "[grimacing] I gotta get back, feed them hogs!" Now, in the future of course it'll be different. 50 years from now, people will be going like, "Hey! You wanna see a hundred thousand pictures of my great grandfather? I got 'em right here plus everything he did every day of his life." --Norm Macdonald[1]
There is certainly a quantity of stuff online that is absolutely worth saving, but there's a considerably larger proportion that's just redundant to the point of being unremarkable and pointless. The trick is filtering, which can be capital-H Hard. That's why some may want to err on the side of over-collecting to reduce the possibility of missing something that will actually be important someday.
Yeah, this is a good point. Isn't it better we save too much, as tooling for filtering stuff out will always get better, rather than saving too little? The latter has no workaround (today at least).
I DVR the nightly news with NextPVR, more as a convenience in case I'm doing something when it's on, want to pause/rewind, want to watch it the next morning instead, etc.
Come 2020, I was convinced that the world was going to end. So I simply... turned off the retention rule. One hour of news is around 5GB, but that's a very-high-bitrate MPEG-2 stream with an extra audio channel in Spanish. So I instead wrote a cron job to take that week's news, drop the stuff I don't care about, and H.264 the entire set of them down to 4.7GB, then burn them to a DVD for offline storage, since there's not much value to keeping them online.
By 2022, it was obvious the world was not, in fact, ending, but I never stopped this practice because of how simple it was, and how unobtrusive to store they are. I just make sure a fresh DVD is in the NAS every week, and put the DVDs on a spindle - they collectively take up about as much room as a toaster. I could make that even smaller and simpler if I opted for a portable hard drive.
Occasionally I'll manually toss something interesting in, like the presidential debates, or special coverage of some newsworthy event.
In 20 years, when it comes time to re-burn the earliest of them, maybe I'll make a value judgment on whether that's worth it, but for now it feels like I'd be losing something for not much of a good reason.
Any information created by humans is part of our "culture". You may consider it of no value, but someone else may beg to differ.
I went to a fantastic talk a few years ago at the British Library about digitizing a substantial quantity historic Australian newspapers. It was amazing to be able to read funeral announcements, product advertisements and other signals from the past showing us Australian culture from the 1800s.
Since we leave much less behind in terms of physical assets (personal letters, postcards, personal diaries), we should at least aspire to archive more from the digital realm, or to future historians we'd look like a blank century.
Maths at the PhD level has been described to me by several people as research which can be understood by only a literal handful of people, a full five if you're extraordinarily lucky. Is that knowledge cultural?
I've put some thought into what it takes for a specific skill, art, craft, or technology to be considered "alive". This presumes not only current practitioners, but a new generation who will learn, practice, and pass on that knowledge. Possibly additionally the cultural infrastructure (schools, businesses, markets, etc.) which are necessary to support, sustain, and reward the practice.
One approach to this is the SingleFile browser plugin [1], configured to save pages to a GitHub repository - it saves the whole web page as a single HTML file in the repo. (Ok it's probably closer to archiving than bookmarking... but it's not too far off)
I was thinking the other day about the longevity of useless data. One idea that floated around in my head was self expiring emails.
I recently deleted about 40,000 emails. Most of them were identical, duplicate marketing emails. I was forced to do this to free up storage.
That's when I realized something. I am paying my email provider for the full price for every byte of "represented" data. In reality, their distributed file systems could compress an arbitrary number of copies of these emails and only consume the amount of space that one email consumes. So 100,000 duplicate emails on the server are consolidated into one representation of the data, but each customer has to pay for each byte that is represented.
The vendor stores a file once and charge full price every time they reproduce it for someone. If you have 10,000 copies of a file they only have to store it once but you will pay for every byte in all 10,000 copies.
This is the Dropbox business model, especially when they encourage using their service to share files and it counts as space used in source and destination accounts.
There were some early blog posts by the single person running mailinator.
Since they only stored text, they would make a single db entry for each unique line of text that came in and just made more and more references to that.
There are many things in life that have immense personal value and zero value to nearly everyone else. This creates a lot of misunderstanding and incentive misalignment.
Most likely it is not worth it. But people should not be doing only things that are “worth doing”. Then again if something brought you joy but was complete waste of time - it was worth it.
Hate dementors who tell you otherwise, it is limited life time but it is yours. You should be helpful to others but doing only “what is worth” suck the beauty out of existence.
Well, except future historians who may find value in "personal" information (although I guess we've got such a surfeit of recorded "personal" information these days compared to even just 50 years ago, it may not be quite as useful as when they find, e.g., some Babylonian tablet with a shopping list on. But you never know!)
I started collecting thousands of URLs (as resources for notes) about 16 years ago. In use, I'd estimate that link-rot has affected about 1 or 2 out of four in that time. Out of those two fails, I expect I'd recover about half by feeding the URL to Wayback and asking for their oldest save. That tends to be independent of age.
Most successes are from reliable sources.[1] When those sources fail, it's usually because the resource is still there (in identical or revised form) but has a modified URL. They're often recoverable by searching by 'site:www.abcd.efg ' and a string of terms from the original.
Each fail requires that I consider whether the time to recover is worth it. I also imagine how difficult it'd be to have saved each resource, and have to search for it. Ideally each resource would have an ID and permanent online home; unlikely.
[1] Each fail continues to teach me which specific resources are most reliable over the long term. Short answer:References and WayBack.
Yes. Sometimes when I'm doing research into recent history of why certain technical decisions were made, and the arguments for or against, I find archive.org invaluable for piecing a line of thought back then. Recently, this was to look up what the debate between React's Functional components vs Signals was.
Also, it's helpful to get perspective on the attitudes for or against a new technology in recent history. I remembered there were people that said "If you aren't writing a kernel, you don't have their problems, so you don't need git." Turns out that's not true. Now that git is everywhere, it's harder to remember whether or even if there was pushback against it.
I often reference it, and if it wasn't still up, I'd have only web archive to rely on.
So for me, lots of stuff I look at online (mainly blog posts) are worth saving. Sometimes, if the discussion is on a twitter thread, that too. Which makes me fear for the day Microsoft decides to do Github in, and we'd lose all the issues and comments.
I've recently come back to a PC game (B-17 the Mighty Eighth) from 2000 that, quite unexpectedly, is getting a remaster and potentially a port to VR. It had a thriving community for several years, with many mods and guides and knowledge contained in the single dominant forum (bombs-away.net). When it shuttered, the vast majority of that information was lost. Old workarounds for bugs in the engine and detailed instructions of exactly how certain mechanics works are unavailable. One popular youtuber who continued playing through at least 2010 maintained a dropbox that had most of the mods that were ever available, but not the forum posts explaining them. So, for example, there's a mod that survives there to let you replace a generic 'sign on the dotted line' handwriting with your own - but gone are the instructions of exactly how to apply it.
When I had returned to the game after bombs-away.net had gone defunct, I posted my own personal archive to the GoG forum for the game. Now that I've returned to the Redux version I find my own files, with my personal notes, shared by a single other soul who had similarly maintained an archive, and apparently had collected mine at some point. I'm very glad to have helped preserve knowledge - but not everything of mine was there. Now that I've noticed the 2024 remaster effort and joined that community, I've been able to share files that were otherwise apparently completely lost - in particular, a set of images showing dimensions of certain common features in bombing targets, that allow estimating the total size of the target.
Unfortunately, my own personal archive included many forum topics that I just dragged off shortcuts to. I can see the old titles of the pages from the surviving shortcut files. I remember the questions I had (and now have again) that those shortcuts held the answers to. But because I didn't save the page itself, it's.. gone. That's immensely frustrating.
Yes, things are worth saving. Especially for topics with extensive information among a small niche audience that have a single point of failure. I've found an extension (SingleFileZ) that does a good job of archiving a web page with all embedded content into what's a zip file under the hood - so futureproof even if the extension disappears and it becomes difficult to simply open the file directly in browser.
EDIT - montebicyclelo mentions SingleFile, which apparently is a continuation of SingleFileZ, with new features. SingleFileZ already allowed automatically saving every visted page in a tab (or even among all tabs), batch archiving of a list of urls, etc, so presumably SingleFile has all these capabilities and more.
There is a wealth of live performances on youtube that individuals have uploaded and that likely violate mpaa copyright crap.
IMO, this content is of high cultural value and I fear it won’t be long that the goog suffers us to watch “their” content without infecting it with ads.
I wish there was an easier way to self host this content with a way to organize and browse using tags.
5 years ago I was working on a semi-novel crowdfunding platform that relied on video presentations. First iteration we used the YouTube API because hosting our own video seemed daunting and that worked fine for a bit. Over time we started to run into limits/errors/interruptions/audits at inconvenient times until one weekend I was like “screw it, let’s find out what the problems of self transcoding and hosting are.” Spent some time learning to use ffmpeg and throwing the results on our static resource pile. Tagging was a fairly straightforward lift. Honestly worked better than I’d have anticipated and was much less hassle. I’m sure we would have hit the problems if we’d reached a critical scaling point buuut that didn’t happen inside our year or so of having clients.
I often find myself revisiting old posts and stories. As with any human artifacts, most things aren't worth revisiting or are only meaningful in the moment. If they're gone, few people miss them.
I'm a link hoarder myself (over 13k links on Pinboard: https://pinboard.in/u:pmigdal/). While I don't revisit most of them, some have proven invaluable for re-reading and sharing. I'm not sure about the typical half-life of internet content, but a lot disappears—whether because people stop paying for domains, official websites get reorganized (or their content removed), or other reasons.
This is where the Internet Archive steps in, doing the essential work of a digital librarian. I often share links from its Wayback Machine, which has been a link-saver more times than I can count.
From an age perspective (but the crowd here will not like that): before I trusted myself I could always find it back so I don't need to save it. Now I can't anymore, but I don't care so much.
Personally, I like that the internet is ephemeral. It matches real life in that way. I would rather see the internet as a means of connecting people over large distances (across space, Mars, etc), maintaining 20,000 copies of every irrelevant thing is just silly.
The problem is that not everything it has replaced was originally ephemeral.
In a the Internet is both too ephemeral (self-hosted blogs disappear, Youtube videos get taken down) and too persistent at the same time; I don't think that most Twitter posts of non-public figures would need to remain public forever by default, for example, and I don't think I need to mention various data breaches.
The Internet Archive somewhat mitigates the first issue, but it makes me pretty nervous that there's essentially just one organization doing what used to be much more distributed to various physical libraries.
For the second one, I hope we'll see better solutions (both technical and social) as the technology and our interactions with it mature.
> Personally, I like that the internet is ephemeral.
It is not. It is only for us normal people. But the companies which log our lives in order to then capitalize on it, for them the internet is not ephemeral. They have copies of videos, pages, podcasts, whatever it is what can be found there.
Why would you want those companies to know more about yourself than you do?
>Why would you want those companies to know more about yourself than you do?
That's not a question of wants, companies will always know more about you than you, for the simple reason that even if you had all their data you have no means to extract any meaning from it. It requires immense organization and resources, increasingly so as the rate of data production increases.
For that reason the correct response isn't to engage in the same hoarding and privacy abuse of the companies, it's like bringing a knife to a tank fight, but to 1. make sure you don't produce that data to begin with through privacy protections and technical means and 2. create environments in which you have ownership of your data, instead of businesses.
How do you backup websites? I mean, it sounds trivial, but I kinda still haven't figured out what is the way. I sometimes think that I'd like some script to automatically make a copy of every webpage I ever link in my notes (it really happens quite often that a blog I linked some years ago is no more), and maybe even replace links to that mirror of my own, but all websites I've actually backed up by now are either "old-web" that are trivial to mirror, or basically required some custom grabber to be writen by myself. If you just want to copy a webdpage, often it either has some broken CSS&JS, missing images, because it was "too shallow", or otherwise it is too deep and has a ton of tiny unnecessary files that are honestly just quite painful to keep on your filesystem as it grows. Add to that cloudaflare, Captchas, ads (that I don't see when browsing with ublock and ideally wouldn't want them in my mirrored sites as well), cookie warning splash-screens, all sorts of really simple (but still above wget's paygrade) anti-scraping measures, you get the idea.
For saving a webpage you have open, I use a browser extension called SingleFile, I've been using it for a while (IIRC I discovered it on HN's front page a few years ago), in my experience it "just works", works really well.
You click the "browser action" icon/button of the extension and it saves a single HTML file that looks exactly like the webpage you have open.
From its FAQ[1] on GitHub:
# What does SingleFile do?
SingleFile is a browser extension designed to help users save web pages as complete, self-contained files. The extension's primary function is to capture an entire web page, including its HTML, CSS, JavaScript, images, and other resources, and package them into a single HTML file.
# I am a web archivist, is it ok to use SingleFile to archive content?
No, SingleFile is not a tool used by professionals to archive content on the Web, especially in the academic field. Professionals prefer to rely on tools based on the WARC specification instead.
Yeah, pretty much all browsers on all OSes have print-to-PDF/save-to-PDF, I prefer saving an HTML file over saving a PDF file for 3 reasons:
1. SingleFile allows me to save a an HTML file that looks exactly like the webpage I saved. I never used a save-to-PDF functionality in any browser that allowed me to save a PDF that looks exactly like the webpage I was saving/printing. I wish browsers implement that, somebody did that once, they patched chromium to save a web page as SVG[1], AFAIK if you can save to SVG you can also save to PDF with not much modification to the code, unfortunately the fork is not maintained anymore.
2. The HTML files that SingleFile creates are responsive (just like the webpage you had open), PDF is not responsive.
I like that because it makes it easier to read the webpage I saved on my phone later, with a PDF file you saved on your desktop, you have to pinch to zoom and pan while you read it on your phone.
3. HTML-files/Webpages are accessible to screen readers and my browser's extensions work on them, extensions don't work on PDF files (they _can_ work on HTML files opened from disk, if you allow/enable it in the extension's settings).
There are extensions like "Save Page WE" that will dump the current state of the DOM to an HTML file, including CSS and Images, but these are static and don't make the scripting work.
We have become so cloud-native (god forbid!). Just recently I realised that I can save an interesting page to my hard drive instead of saving its link. What a wonderful word has opened since! It's so liberating to live without all these bs tools.
When I save things, I try to make sure that it'll be immediately useful to me once I find it again.
I'll highlight, summarise and take notes of what I save. Or some combination of those. If I don't find anything new or directly applicable to my life, I'll let it pass by.
This approach isn't good for archival purposes, but I hesitate to save a lot of things that I'll never read again.
I'm going through my file cabinets right now. I'll keep a few things that catch my eye but I'll likely throw out most of it. The odd 25 year old computer magazine is probably interesting but not all of them collectively for the most part. And I'm certainly not going to index them in a way that they'd be useful to me.
That's more or less what I did over time. No harm in a few folders of clippings for old times sake but not really boxes and boxes of magazines that will probably never be looked at.
You can probably sell or donate those old magazines to a collector, or a kid interested in that stuff. At the very least drop them off at a thirft store instead of just dumping them.
Thrift stores don't want a ton of old paper. There are a lot of things that someone somewhere would probably like but I'm not going to track them down or get them there. Mostly it's not magazines anywway. It's a bunch of articles I ripped out over the years.
The one thing I have in my garage I know someone would want is a big pile of laserdiscs. But, again, a thrift shop (or my library) wouldn't want them and I live pretty far out from a major city. Probably will try Craigslist post-winter though as I'm trying to declutter.
Laserdiscs appear and gradually disappear at my local thrift, so someone must be buying them. Now in the vinyl records pile, there are copies of Mantovani, Jim Nabors, and Herb Alpert which have been there for years, but anything classic rock or newer sells the same day.
I was just thinking yesterday, wanting some Christmas music to get into the spirit while wrapping presents, remembering being a kid, when my mom would put on Jim Nabors' Christmas album.
Luckily there are (currently) multiple playlists of it on Youtube.
In the spring I'll probably do take it or leave it for the whole collection on Craigslist for the whole pile at a nominal price and, if that doesn't work, just take it up to the local thrift and I'll at least have tried.
I have trouble letting go of things, and I found it interesting to read through.
There's a part of me that thinks "It would be so useful to 100% automatically
log and cache everything I do and be able to search it". But I think maybe
being healthly means not doing it.
One thing that is worth saving is the PDF manuals for physical products that you own.
These sometimes disappear from the Web. Or disappear except for some third-party site that modifies and/or paywalls them.
Also, save the occasional important support info Web pages for those products. You'll know it when you see it. And if you don't save it now, it might be gone when you need it.
You don't need a fancy system for this. I just made a directory `~/doc/`, and started dropping files into it. Someday, I'll take the time to merge this with `~/wiki/`, but for now, I'm capturing the information with low friction, which is most important.
And even when they don't disappear, they still end up dozens of weird pages deep that none of the on-site help text or search points to correctly due to the various pointless redesigns the site has gone through.
Some times you have strange obsessions or a strange mindset related to your technological habits. And you might easily think that it is only you that is weird, not thinking straight. If you are the only one doing something, you are probably wrong.
And then, hopefully, there are nice personal blog posts like this one, showing you that you are not alone having some peculiar habits and so that it might make sense even if most people don't even think about it.
I have the exact same feeling when I discover through hn, blog posts and events that I'm not the only one having my web browsers full of tabs. Literally having thousand of tabs.
I created a local-only web app to wrap up some of my favorite web haunts, with HN being one of them. It allows me to look at the headlines, and save any of them in a locale SQLite db that the app maintains.
This feels like confirmation bias to me. The author seemed to genuinely consider the question, but didn't think critically about how little value he got from two decades of bookmarking and instead focused on how he could use this archive in the future.
This resonates so strongly with me.
I worked a job where I needed to use outdated Microsoft toolchains to build plugins for software, and the documentation was just -- gone. Good luck. I've been almost compulsively saving the things that feel important to me, while seldom browsing them for years -- all the while hunting for a faster and more intuitive recall system that lets me find them later.
My ex, however had a much more fluid relationship with the internet and media in general. They liked new things, and didn't particularly care if they enjoyed something and it faded into obscurity. I feel like that's the winning mentality, but I just can't bring myself to embrace it.
Stuff online is absolutely worth saving. It is a window into the past - what people concerned themselves with, what they loved and hated.
Scholars will write papers on this era, speculating what it was like and how it fit into what came after.
The web documents the massive societal changes underway which do not relate to the internet directly. Things like changes in transportation technology, medicine, sexuality and gender, and how your average people felt about all of it. Scholars will data mine those opinions to understand who felt what ways and why, with the benefit of hindsight. New knowledge will come of it.
I’m the opposite of most of the “archivists” on HN. I delete everything and save nothing. I have maybe 25 sheets of paper in my apartment, including social security card and birth certificate.
Saving stuff just isn’t fun or useful for me. Never for more than a passing moment have I thought, “Boy I wish I had saved that whatever.”
Old people are the worst about this stuff. They think/hope somebody will want it and then just make it the next generation’s problem.
I told my dad if he thinks it has value, give it away while he’s alive. I have neither the interest nor the space to deal with it so it’s going straight into the trash.
I mean one man's trash is another man's treasure. We can't save it all but if we save even 10% of it, that would be great for future AIs to learn from our mistakes and what humans once did on a day to day basis before they were replaced.
I suppose it comes down to what the purpose of such archiving is.
I think it's the preservation of information, but I also believe 90% is absolutely pointless. There is just so much of it, and data storage so cheap, that it makes sense to just save everything.
That data storage is also ephemeral. Nobe of it will last as long as a paper note, unless some human goes to the trouble of copying it all onto new drives with new software every ten years or so.
With a proper NAS and RAID10 for double parity, it's a bit like Theseus ship. Just keep swapping out drives when they become unhealthy and you never have to rebuild or migrate
Eventually the controller will die and eventually compatible ones will no longer be produced or will at least be inconvenient to obtain or commission and hence expensive.
Paper lasts for centuries without any attention beyond keeping it moderately dry and away from things that eat it.
Whether you're using hardware RAID or not you still need a hardware storage controller of some type which accepts the new disks you can buy and works with the NAS. What they are saying is eventually that'll be more $ and time than just migrating off the system would be. From ENIAC to now could fit in one lifespan, would you still be maintaining a home floppy drive backup system in the 2040s or just save the time and effort with a migration?
Well... storage is cheap, but not cheap enough to save everything, with just usenet being in the 400TB/day range these days. Sure, it's cheap enough to save every webpage you visit during your life, but probably not cheap enough to save every video you click on youtube or watch on a streaming-service, and all the music you listen to all day.
Though just the music compressed in opus at 128kbit might work ok, 60 years of 24/7 128kbit is 30TB, so that would fit on 1 large HDD currently.
Music is actually an ideal candidate. I don't listen to music all day, and when I do listen to it, it's often something I've listened to before. My current collection is about 200GB and that includes a ton of stuff I've never listened to; it seems reasonable that a full life's worth of music could fit in 1TB, easily.
Data rots though, you can't just save it once and be done with it. You have to migrate it across storage mediums, formats etc. It's a recurrent effort/cost.
If it is 1860 and you want to see that your data is preserved for 164 years, then you start by keeping it in geographically diverse places where people will look after it and tend its needs for 164 years.
As a concept, that's really not different at all from holding on to today's data here in 2024.
> I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?
Storage keeps growing and price of storage keeps doing down.
My DOS and even some C64 source code made it to this day on backups (DVDs, HDDs, SSDs, USB memory sticks, etc., both online and offline) and to ZFS pools. Medium that didn't exist in the 80s/early 90s.
Some people are going to complain about the naming but I have all my emails except for six months back since I started using the Internet. And I still have all nearly a lot of my data since I started using computers. 8-bit computers.
Do you?
I don't care about naming much. "search, don't sort".
We've got emulators for just about every and any system. My vintage arcade cab has both real PCBs and a Pi running an emulator with thousands of arcade games on it.
You can already, today, emulate, say, the Raspberry Pi model you want using QEMU. There are container file that'll gladly do that for you.
Unless civilization ends there's simply a not a world in which, say, PNG, JPG and x265 files aren't readable. This just won't happen.
FWIW I'm paranoid integrity of my data: I've got my own naming scheme where a cryptographic hash is added to many of my files.
For example:
DSC_91394-b3-ae4f2877d3.jpg
This means "This file's Blake3 checksum begins with ae4f2877d3".
I then have a script doing statistical sampling: I enter a percentage and that percentage of files where a cryptographic hash is part of the filename are checked, randomly (if I enter 100 then 100% of the files are tested).
If I enter for example '7', then 7% of the files are tested and then there's high probability all checksums are correct.
In general, I am pro-turnover where there is rivalry: ceteris paribus keep the newer thing. However, information is so cheap as to be effectively non-rivalrous so I am considering running my own archival and to keep kagi's small sites etc. alive. Unfortunately, there is not a good tool for this that matches whatever Archive.org has. ArchiveHub needs routine management to keep the feed up and viewing it is not that easy. I'm sure we'll come up with stuff.
The other thing is that searching for the long tail is near impossible. The big sites dominate Google, so I need something like marginalia to actually get to the old stuff that it used to be so easy to find. Because of the median user having simple queries, some questions are no longer answerable on Google: they are dominated by the median user and never show up.
This does still happen. Microsoft may nuke a git repo and someone has to figure out who has the latest version of the entire repo with all the latest commits of every branch.
They don't allow uploading large binary blobs either, though, and steganographically storing gigabytes of data with probably terabytes of overhead sounds like a quick way to get banned.
There was a story making rounds a few years back that Wikipedia was being used as a file / media sharing platform, particularly in countries with limited other options for such services.
I can't find a link currently, though I'm pretty sure I'm not hallucinating this.
This supposes a broader question: what is the purpose of recording information generally?
The alternatives are observing the here and now (boys, obIsland), conversing, or musing / meditating / cogitating on topics.
I'm not hostile to recorded information of many forms, and am in fact something of a packrat myself. But as I acquire archives and observe the increasing enshittification of the Internet as a whole, the value of investing time in searching and saving information only found online and created for online consumption seems ... increasingly questionable.
As I've mentioned in numerous earlier HN comments,[1] my preferred information forms are increasingly more traditionally-published books and articles, highly prepared discussions, or conversations amongst experts within a field (or with a strongly discriminating host and an expert). It's worth noting that the origin of much (though of course not all) online content is in fact discussions based around such works: author interviews or exerpts from books, discussion of articles, etc.
Another principle, and one that seems worth considering in the context of online content, is that false leads and faulty assumptions are absolutely toxic to learning or training in a domain. This turns up in AI contexts, but is also evident to, say, a survey of the history of philosophy, particularly the roughly 2,000 year period in Western philosophy of the dominance of Christian theological philosophy, much of which was based on utterly misguided premises. It wasn't until such assumptions were dropped that scientific and technological understanding really began advancing.[2] One wonders to what extent our present accumulated document[3] trove might actually weigh down future progress.
Or succinctly: What's the balance between retention and study, on the one hand, and observation and reason on the other? Tradition vs. empiricism.
2. I'm not writing this as an utter dismissal of all theological thought, nor am I asserting that what I'm terming "misguided" was utterly useless. There are useful ideas which emerge, there are religiously-aligned philosophers who arrived at keen insights, the process of working through arguments, even if founded on utterly counterfactual premises, can lead to useful developments in reasoning and logic, etc., etc. That's even without getting to "it can always serve as a bad example". But in this and many other areas it becomes clear that relying on early authorities (in the classical sense of that term) heavily retarded intellectual advances.
3. In the Otletian sense of any type of recorded work: books, articles, images, plastic sculpture, video, etc.
The rise of LLM’s has really devalued saving stuff online. What is the point of saving an article if I could just ask ChatGPT to created it and would probably do a pretty good job? It’s still worth keeping notes and stuff that may be hard to find but the majority of things online can easily be reproduced and are not worth saving.
I just asked ChatGPT where I could find info about a specific tabletop game that was heavily discussed online some years back and had multiple fan sites with house rules, forum, mailing list, etc.
It couldn't help me except to reference a few sites that no longer exist, one that's HTTP only so now causes browser warnings and was mostly only links to more sites that also no longer exist, and an old general gaming forum that doesn't even have a search function.
It didn't mention the mailing list archive site which has 21 years of discussion, indexed and searchable, still available. Or the still-active site where some fans archived all that they could from the old sites some years back, along with emulators, binaries, and instructions to get some of the old fan-made software running again on modern systems.
Neither of those sites are new since ChatGPT was trained, they have been around much longer than ChatGPT. But it knows nothing about them or their content.
I then asked it what it could tell me about the topic of a blog that ran for 10 years, with 314 posts, most of which had around 100+ comments, and the site it most often linked to. ChatGPT's answer was simply: "It seems like I can’t do more browsing right now. Please try again later."
So no, you can't "just ask ChatGPT." Contrary to popular belief, it doesn't know much about what is or was on the internet, even at the time it was trained, nor about many topics.
Given the way the web has developed over time, it seems quite likely to have huge gaps on anything related to the small web, any niche hobbies or interests, etc. All that non-commercial stuff that you can't easily find in modern search engines, ChatGPT doesn't know about it either.
Those big chunks of content that wink out of existence whenever a hosting company goes under or someone just stops paying the bill for a site that they used to love but haven't actively maintained in awhile? It doesn't know any of that either.
Entire online communities rose, developed, created a great deal of stuff, then slowly atrophied, and eventually disappeared. ChatGPT knows nothing of them.
I think you are right. But I think the answer goes deeper: we have encouraged a culture where the most supported information is also the most superficial. The essence of individual experience itself has long been discouraged on the web in favour of SEO and the trashy news and the trivial.
So the fact that ChatGPT can replace much of the web actually says less about the marvel of ChatGPT and more about the lack of anything really worthwhile because the profound just happens to be the least economically valuable.
One hundred twenty-three years ago my great grandmother's first husband died in a hotel in Kansas City from asphyxiation from the gas having been left on over night (the hotel did not yet have electric lighting). A letter was hastily written on a piece of hotel stationary to be delivered to his wife in the neighboring farming community where she lived.
It is fortunate to me that someone thought to hang on to that note since I have become interested in genealogy and this was a fairly significant event in family history (had he not died I don't suppose I would be around since it was her second marriage that gave me my grandfather).
I long for scraps of anything that my dead relatives, wrote, created, etc. It connects me better to the past — the lives they lived, how they lived them. It somehow grounds me a little better ... well, it's rather hard to explain the draw of genealogy.
Sadly very little of the ephemera of everyday life was kept. I get it. It might have seemed like hanging on to junk mail — like you were a hoarder or whatever, but in this digital era we should be able to hold terabytes of what may appear to be ephemera.
I'm doing what I can – not for ego, I think, but for future generations that may find a connection to their past interesting.
30 years ago there was no digital world. Nearly all information was in physical artifacts. The things worth saving haven't really changed, but the amount of noise they are buried in has. Imagine if that letter was kept in a two ton pile of ad fliers. Sure, someone would find some of those fliers interesting, but you'd have been much less likely to even know about the letter.
Well, I remember a lot of great stuff on Usenet circa 1994, but it looks like Google shut down access to it via Google Groups, which used to archive it in a searchable way.
There was a ton of great stuff 30 years ago, and I think it's definitely worth saving.
The Internet was a very different place, but it was quite real 30 years ago, and I think the idea that the further back you go the more valuable this kind of thing is is the right way of looking at it.
This. Usenet. Then IRC. I think back to full on conversations I had with friends on IRC that I never kept. Then people stopped using IRC. All gone.
Also slashdot. Go back 20 years and start clicking links. Most of them are broken.
>but it looks like Google shut down access to it via Google Groups
Huh? It still looks accessible to me.
https://groups.google.com/g/sci.med.radiology.interventional
Disclosure: I work at Google, but not on Groups.
An aside about ad spam from companies that I occasionally buy from:
Often as spam comes from the same mailbox as order receipts and includes words like “order” while messages with receipts never include the word “receipt”. When inundated with daily or sometimes multiple times a day ad spam from the same company it becomes very difficult to filter for only not receipts, to clean a neglected inbox.
After I’m gone, I fully expect my family just to delete it all because the signal to noise is so low.
Sorting through twenty years of spammy email is one of those things that seem like an llm would actually be good for.
Some might say, that years of spammy emails drove the creation of the llms we know today. It's easy to forget how fast some things have moved: https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
I don't have anyone to do anything after I'm gone, so I just delete the emails myself. I do keep the notable ones, like registration information and some payment receipts but otherwise everything goes to the trash.
Bonus points:
I don't need 30/50/100Gb mailbox (and the associated mailbox cost nowadays).
Search is not only fast but if I didn't found something - then there is nothing of this something in the mailbox.
I't mentally pleasurable to log in once in a while and throw a bunch of unneeded stuff into the trash bin, quite similar to a real life room cleaning.
Fortunately Gmail tabs go a lot of the way to letting you mass delete junk you don’t care about. Assuming you do even a modicum of labeling stuff you might like to refer to or act on, deleting at least older promotions and updates eliminates a lot of things.
Didn't use GMail for years but the labels were not quite up to the task.
Thankfully FastMail interface makes 'search from this address' and 'search to this address' (I'm using per-service addresses) and then 'select all', 'delete' actions a breeze.
>...a two ton pile of ad fliers
Alamy is selling scans of ad prints from the 1850s.
https://www.alamy.com/stock-photo/1850s-advert.html
A selection 74 items over a 10 year period is a different proposal compared to e.g. keeping two tons of ad fliers from November 17th 1907 (and every other thing, every other day, all the time).
Ads range from a (necessary, in a capitalist society) nuisance to a scourge, and people justly put up increasingly thick boundaries to shield themselves from their influence. When waning cultural relevance or whatever dilutes that influence, you can more easily see the ads for what they are— often manipulative marketing tactics implemented through often genuinely beautiful art and design. Both aspects are fascinating to consider and the art can be quite enjoyable. Early modernist posters from Paris are beautiful. Watching collections of mid century television ads in the prelinger archives is fun, and tells us a lot about the ways we are influenced by modern ads speaking to current perspectives, fashions, and concerns.
Capitalism would work 100% fine without ads because people naturally compare and contrast options when buying stuff.
All that's necessary is making it possible for people seeking out your type of product to find you. And for revolutionary products, there's word of mouth.
If anything I think capitalism would function better without ads, because I would argue that advertising overall results in less informed customers, especially the modern lifestyle/brand type of advertising that's clearly quite effective at manipulating people.
It’s an interesting question I guess (and slightly worrying that I can more easily imagine the end of the world than the end of advertising). Especially if we take it to the extreme and imagine sponsored listings also don’t exist. I guess incumbents would have a big advantage.
There are second order effects of ads that we’d need to consider. Facebook and Google wouldn’t exist as we know them. Maybe that means some of their research doesn’t happen?
If incumbents would be favored then it stands to reason that total advertising spend would be at least loosely proportional to market decentralization. Yet in America advertising spend has increased many orders of magnitude since the 50s, while the market has simultaneously become dramatically more centralized.
By contrast in parts of the world with relatively negligible advertising, markets tend to be heavily decentralized.
And I think this makes far more sense of you think about it. If you make a soft drink that is rated far higher than Coke in blind testings, or perhaps one that is near indistinguishable in flavor, but cheaper, you stand very little chance of competing successfully. There's a reason Coke spends billions per year in advertising, and it has nothing to do with reaching the three remaining people who are not aware Coke exists.
And yeah without advertising the "free" services on the internet wouldn't exist, replaced by a mixture of genuinely free services, and for-pay. This would IMO be a dramatically better state of affairs. Businesses whose actual customer is not the people using their product/service leads to such dystopic nonsense.
There's a similar question that crosses my mind occasionally: 'how would capitalism work if there were no brand names and no advertising, but only product reviews?'
Would it even be possible to safeguard the product reviews system from bribery? The current systems we use for product reviews obviously would be unsuitable.
At any rate, I commiserate about the role of branding and advertising today. It's as often noise as it is information.
If there were no ads, how would people know that products existed? Would they just see the products on store shelves? What about services? Would labels be ads? Would how stores merchandise things be advertisements? Could businesses negotiate for specific product placement? How would you find out about stores? Would store signs be ads? How about really big ones? How about at the edge of their property along a road highway? Could the sign say what the store sold? If you were to start a product guide to help people find what they need, how could you possibly afford to buy enough products to be useful and up-to-date enough while slow crawl word of mouth got the business off the ground? Would asking people to tell their friends be an ad? If not, could you pay someone to spread the word about your product? Would traveling sales reps be ads? What if they wore head to toe logo gear? Could you just pay people to do that without selling things? Ads suck but I don’t see how a capitalist society could survive without them.
I think the definition would have to be an exchange of something of value for telling other people about a product. There are some companies that got off the ground with no paid advertising but I think they’re an exception. Generally people are not seeking out new products.
But the whole point of a capitalist society is that competitors that do things better/cheaper start taking customers so the capital moves to the best and most efficient system.
Because they are rare
I don’t think that’s true? Tons of stuff from that era had been digitized, even before newer more relevant stuff and older rarer stuff, because the acid paper had a short shelf life and there were so many ads in printed stuff then. I might have a skewed perspective from working in the digitization world for quite some time. I think they’re selling what they sell with all their other content— discovery, curation, preparation, and easy delivery.
It’s not like you currently go to a webpage and save all the images onto deep storage for archival… I’m not sure what relevance things being digital has on identifying noise.
If the ancestor before you is hoarding anything that comes across their path, be it digital ads or every physical greeting card they’ve ever gotten, the problem is with the person’s collection habits, not the medium.
What about robots reading each flier and checking if something is odd about that particular one? It could find the letter and report it to you. Even easier if it was all digital information.
If only we had search algorithms...
A two-ton pile of ad fliers? Sounds like Ted Nelson's Junk Mail collection, https://archive.org/details/tednelsonjunkmail .
> ...well, it's rather hard to explain the draw of genealogy.
I've noticed people becoming more interested in genealogy when they - let me phrase this delicately - reach a certain age. My speculation is that it is a component of grappling with one's own mortality. As the grays and wrinkles multiply, some obsess over healthy eating and exercise, some wealthier ones invest in immortality research, some get blood boys, and the rest feel an urgent need to research our genealogy; any detritus that shows our progenitors existed proves some trace of us having been here will remain, and perhaps our existence means something, as time cruelly keeps marching on.
People's interests change over time. It's not necessarily because folks are grappling with their own mortality. For instance, lots of older folks seem to get into bird watching.
I also want to point out that saying "let me phrase this delicately" to the person who is the subject of the sentence is not tactful. It's honestly kind of rude when you're on the receiving end. If you're going to judge me to my face, just say the words.
I respectfully disagree that it's rude. I will also point out that you had no idea what my age is when you assumed I was judging you.
A lot of young people who don't know their ancestors are interested in genealogy. (Adoption, immigration, war refugee). People who have so much genealogy already built into their life via a large family don't need to be consciously into genealogy, because they're already immersed.
This reminds me of a recent flea market experience. There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.
It's hard to assign a value to these things. They are simultaneously junk and treasure. I think about the journey these items took to find their way to that flea market table. It was too diverse a collection to have come from one place. So I imagine all the paths each individual item traversed. The joy of the recipient reading a post card, holding on to it, rediscovering it on spring cleaning days. Or the photo living in an album or framed on a wall somewhere for a lifetime.
I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
> There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.
> I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.
I purchase, scan, and resell those kinds of things. I'd love to have a centralized, public repository in which to store the data. As our tech gets better at extracting data from that material more and more interesting applications could be developed. Imagine being able to find 100+ year old photos of your ancestors via facial recognition and extracted metadata searches.
I wish I could come up with a non-profit business model that worked for preserving that kind of stuff. I would love to gather up the historical ephemera that's being lost, catalog it via manual and automated processes, and make it available to the public. (Yes, I am aware there are privacy concerns. It's a pie-in-the-sky idea. I just hate to see all of the previously captured and curated effort that went into ephemera cast to the winds.)
I've been thinking about the same topic recently, with a specific focus on antiquarian books.
There is a local business I think of as a "Book Butcher". They buy an old book with beautiful engravings, cut out the pages and resell the individual pages as interior design pictures for hanging. Imagine if we could get them to scan & archive each book before pulling it apart...
Another idea is simply archiving the graphics from eBay listings. Sometimes there is valuable information in the pictures that accompany listing, but they disappear forever once the item sells.
I'd be glad to connect with anyone who's interested in this stuff.
At the same flea market, there was a stand selling pages ripped from a book (I think it was a dictionary) and fed through a printer to add a picture to the page. I found the concept interesting, but the pictures they used were mostly bad. There were a few good ones, but most were pixelated Star Wars clip art kind of stuff or Jack Daniels logo.
Genealogy has a way of grounding us, doesn’t it?
Regarding genealogy it is great to look at the work The Church of Jesus Christ of Latter-day Saints was doing that help genealogical researchers around the globe [1] beyond that specific church.
[1] https://newsroom.churchofjesuschrist.org/topic/genealogy
Now it's easier to save stuff, but there's more stuff to save. YouTubes and TikToks instead of text notes. Chat messages instead of letters.
Sure, there are a ton of reasons to archive. And if it's cheap to do (in terms of money, yes, but also in terms of time, effort, mental health, etc.) then I am of the mind that we should archive everything.
But, it often isn't cheap to do, and in that case, it makes sense to prioritize. The high priority items for me are the things that I might want to share, the ideas I want to amplify for my contemporaries and future generations that might examine my life. Stuff like [1] [2] and [3] which has influenced my thinking fundamentally, that I hope to build upon so that others can build upon what I have built.
I'd argue that you do this intuitively: you're mentioning a letter from your family's past because it is a high priority item--it's relevant because it was the last written words of your great-grandmother's first husband.
But, there's a lot that isn't worth keeping. My first form of archiving as a teenager was keeping ticket stubs for movies and concerts--a decade later I was going through my pile and found that I didn't even remember most of them. The better movies, I remembered--and I had them on DVD. The better concerts, I remembered--and I also had journal entries and CDs to remember the experience and the music. It's not important to me where/when I saw Everything, Everywhere, All At Once in theaters, but I have it on DVD and I can't wait to show it to my niece when she's older. And sure, I saw Amigo the Devil live, but frankly, he's not an artist you need to see in concert--the greatest impact of Cocaine and Abel[4] on me was when I listened to it alone in my room. The ticket stubs simply don't matter to me.
[1] https://www.viridiandesign.org/notes/451-500/the_last_viridi...
[2] https://www.ted.com/talks/brene_brown_the_power_of_vulnerabi...
[3] https://digital.wpi.edu/pdfviewer/wm117p10z
[4] https://www.youtube.com/watch?v=ZzjtLm0G49E
EDIT: All the things linked above, I have backed up in one form or another. Notably, the Schutt paper isn't at its original URL.
It's funny you mention ticket stubs, because I also have a similar collection, and I kind of treasure it. Before my Google tracking my every step, before Twitter, as the years go by, I have some record of what I was doing at exceedingly specific times and dates. It helps to structure my memories a bit more than I'd otherwise be able to. I scanned them all at once (in several pages), and it's sort of a map of my adolescence. I can jump across time. I would be sad to lose it. (Along with the photo of the tickets for my make-shift - and first - double feature of Everything Everywhere/Dr. Strange. Multiverse-themed, doncha know?)
These days whenever I read an interesting article, I will take 2 minutes to copy and paste it into my Obsidian vault under my Articles folder. I'll take care to paste the images as images (and not links) and make sure I've got the author and source URL at the top, and have my separate notes section link to it. It's a bit silly and obsessive, but given how transient content on the Internet is, I think it's necessary to make a copy of anything you care about.
I use https://github.com/gildas-lormeau/SingleFile
I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.
You might also have a look at https://github.com/ArchiveBox/ArchiveBox
Also, I believe by default the files are saved as plain html (with resources being base64 encoded), so search tools which can index the contents of html files will work.
There is also the option to have the contents compressed, and (a separate option) to keep the plaintext of the file uncompressed, which will likewise still allow indexing to work while saving space.
I built Obsidian Web Clipper to automate that process. It also allows you to save web pages as nicely formatted Markdown files with YAML properties even if you don't use Obsidian.
https://github.com/obsidianmd/obsidian-clipper
Wow this is awesome, really love the AI features!
I noticed a web clipper was just released for Obsidian last month. Maybe that'd cut down those two minutes for you.
Yes! The Obsidian Web Clipper is pretty neat. I just published an article about it: https://www.dsebastien.net/supercharge-your-knowledge-captur...
I am using monolith to just save the whole page to disk.
https://github.com/Y2Z/monolith
I do something similar but with Discord. I made a server accessible only by me, and I have a few different channels like work, life, music, ideas, etc. I also send all screenshots I take into a separate channel, and set up a chrome extension that sends whatever page I'm on as a link.
What if discord goes away. I would think you want the data local.
terrible idea. people get their discord accounts banned randomly without warning
Unfortunately it's not super easy to get data out of Discord either. Last I checked, one needs to carefully setup a bot then script the bot to download messages to CSV, etc., but if you're not careful with the account and bot setup, the export process itself could lead to a ban.
like recently they banned the entire country of germany by accident
How often do you reference your vault?
Agreed. I think you could automate some of that too, could save time if you do it often.
In my day browsers could save an archive of a page
Is this still the case?
They can but generally that includes any Javascript on the same page which sometimes does funny stuff when you open it up offline or after the remote server goes away.
SingleFile can make a snapshot with just content/styling
It's not perfect, but Edge will let one take a simple full page screenshot with Ctrl+Shift+S. It results in a hefty PNG but at least it's a visual copy of everything which might suffice for a certain set of purposes (e.g. links will be lost, so it's not good for that).
I can still right-click > Save any page as .html, but that doesn't guarantee server streamed stuff, media, images, etc. will be preserved correctly.
Thank you for this! I pressed Ctrl+Shift+S in Firefox just to see if it would work and it has the same functionality.
for the lazy, I think the web archive safari exports is standardised and gives you a good website backup.
> I got a picture of my great grandfather, thing took six hours to take your picture. [...] Every guy had one picture back then. And it's just him like, "[grimacing] I gotta get back, feed them hogs!" Now, in the future of course it'll be different. 50 years from now, people will be going like, "Hey! You wanna see a hundred thousand pictures of my great grandfather? I got 'em right here plus everything he did every day of his life." --Norm Macdonald[1]
There is certainly a quantity of stuff online that is absolutely worth saving, but there's a considerably larger proportion that's just redundant to the point of being unremarkable and pointless. The trick is filtering, which can be capital-H Hard. That's why some may want to err on the side of over-collecting to reduce the possibility of missing something that will actually be important someday.
[1]: https://www.youtube.com/watch?v=sY6SjMITHrQ
Yeah, this is a good point. Isn't it better we save too much, as tooling for filtering stuff out will always get better, rather than saving too little? The latter has no workaround (today at least).
Another funny take from Macfarlan
Definitely no smiling:
https://youtu.be/8SslNMLO0tw
I DVR the nightly news with NextPVR, more as a convenience in case I'm doing something when it's on, want to pause/rewind, want to watch it the next morning instead, etc.
Come 2020, I was convinced that the world was going to end. So I simply... turned off the retention rule. One hour of news is around 5GB, but that's a very-high-bitrate MPEG-2 stream with an extra audio channel in Spanish. So I instead wrote a cron job to take that week's news, drop the stuff I don't care about, and H.264 the entire set of them down to 4.7GB, then burn them to a DVD for offline storage, since there's not much value to keeping them online.
By 2022, it was obvious the world was not, in fact, ending, but I never stopped this practice because of how simple it was, and how unobtrusive to store they are. I just make sure a fresh DVD is in the NAS every week, and put the DVDs on a spindle - they collectively take up about as much room as a toaster. I could make that even smaller and simpler if I opted for a portable hard drive.
Occasionally I'll manually toss something interesting in, like the presidential debates, or special coverage of some newsworthy event.
In 20 years, when it comes time to re-burn the earliest of them, maybe I'll make a value judgment on whether that's worth it, but for now it feels like I'd be losing something for not much of a good reason.
Reminded me of the story of Marion Marguerite Stokes who recorded TV news from 1977 to her passing in 2012.
https://blog.archive.org/2013/11/22/a-dream-to-preserve-tv-n...
Any information created by humans is part of our "culture". You may consider it of no value, but someone else may beg to differ.
I went to a fantastic talk a few years ago at the British Library about digitizing a substantial quantity historic Australian newspapers. It was amazing to be able to read funeral announcements, product advertisements and other signals from the past showing us Australian culture from the 1800s.
Since we leave much less behind in terms of physical assets (personal letters, postcards, personal diaries), we should at least aspire to archive more from the digital realm, or to future historians we'd look like a blank century.
You raise an interesting question: when is recorded knowledge actually cultural?
What of the zettabytes of data which today are written but never read? (The old saw of a WORN drive: write once, read never, has never been more apt.)
What of a knowledge that is the provenance of a single individual? A recipe, poem, memory, craft, even languages (which are being lost at the rate of several per year: <https://en.wikipedia.org/wiki/List_of_languages_by_time_of_e...>)?
Maths at the PhD level has been described to me by several people as research which can be understood by only a literal handful of people, a full five if you're extraordinarily lucky. Is that knowledge cultural?
I've put some thought into what it takes for a specific skill, art, craft, or technology to be considered "alive". This presumes not only current practitioners, but a new generation who will learn, practice, and pass on that knowledge. Possibly additionally the cultural infrastructure (schools, businesses, markets, etc.) which are necessary to support, sustain, and reward the practice.
What's the threshold of truly cultural knowledge?
One approach to this is the SingleFile browser plugin [1], configured to save pages to a GitHub repository - it saves the whole web page as a single HTML file in the repo. (Ok it's probably closer to archiving than bookmarking... but it's not too far off)
[1] https://github.com/gildas-lormeau/SingleFile
I was thinking the other day about the longevity of useless data. One idea that floated around in my head was self expiring emails.
I recently deleted about 40,000 emails. Most of them were identical, duplicate marketing emails. I was forced to do this to free up storage.
That's when I realized something. I am paying my email provider for the full price for every byte of "represented" data. In reality, their distributed file systems could compress an arbitrary number of copies of these emails and only consume the amount of space that one email consumes. So 100,000 duplicate emails on the server are consolidated into one representation of the data, but each customer has to pay for each byte that is represented.
The vendor stores a file once and charge full price every time they reproduce it for someone. If you have 10,000 copies of a file they only have to store it once but you will pay for every byte in all 10,000 copies.
This is the Dropbox business model, especially when they encourage using their service to share files and it counts as space used in source and destination accounts.
There were some early blog posts by the single person running mailinator.
Since they only stored text, they would make a single db entry for each unique line of text that came in and just made more and more references to that.
Even different emails… were mostly the same.
There are many things in life that have immense personal value and zero value to nearly everyone else. This creates a lot of misunderstanding and incentive misalignment.
Sounds about same what I was going to write.
Most likely it is not worth it. But people should not be doing only things that are “worth doing”. Then again if something brought you joy but was complete waste of time - it was worth it.
Hate dementors who tell you otherwise, it is limited life time but it is yours. You should be helpful to others but doing only “what is worth” suck the beauty out of existence.
> zero value to nearly everyone else
Well, except future historians who may find value in "personal" information (although I guess we've got such a surfeit of recorded "personal" information these days compared to even just 50 years ago, it may not be quite as useful as when they find, e.g., some Babylonian tablet with a shopping list on. But you never know!)
I started collecting thousands of URLs (as resources for notes) about 16 years ago. In use, I'd estimate that link-rot has affected about 1 or 2 out of four in that time. Out of those two fails, I expect I'd recover about half by feeding the URL to Wayback and asking for their oldest save. That tends to be independent of age.
Most successes are from reliable sources.[1] When those sources fail, it's usually because the resource is still there (in identical or revised form) but has a modified URL. They're often recoverable by searching by 'site:www.abcd.efg ' and a string of terms from the original.
Each fail requires that I consider whether the time to recover is worth it. I also imagine how difficult it'd be to have saved each resource, and have to search for it. Ideally each resource would have an ID and permanent online home; unlikely.
[1] Each fail continues to teach me which specific resources are most reliable over the long term. Short answer:References and WayBack.
If you're interested in that sort of thing, come hang out with ArchiveTeam:
https://wiki.archiveteam.org/
Yes. Sometimes when I'm doing research into recent history of why certain technical decisions were made, and the arguments for or against, I find archive.org invaluable for piecing a line of thought back then. Recently, this was to look up what the debate between React's Functional components vs Signals was.
Also, it's helpful to get perspective on the attitudes for or against a new technology in recent history. I remembered there were people that said "If you aren't writing a kernel, you don't have their problems, so you don't need git." Turns out that's not true. Now that git is everywhere, it's harder to remember whether or even if there was pushback against it.
This was written about the insights from using git that he needed to highlight to people back then. https://keithp.com/blog/Repository_Formats_Matter/
I often reference it, and if it wasn't still up, I'd have only web archive to rely on.
So for me, lots of stuff I look at online (mainly blog posts) are worth saving. Sometimes, if the discussion is on a twitter thread, that too. Which makes me fear for the day Microsoft decides to do Github in, and we'd lose all the issues and comments.
I've recently come back to a PC game (B-17 the Mighty Eighth) from 2000 that, quite unexpectedly, is getting a remaster and potentially a port to VR. It had a thriving community for several years, with many mods and guides and knowledge contained in the single dominant forum (bombs-away.net). When it shuttered, the vast majority of that information was lost. Old workarounds for bugs in the engine and detailed instructions of exactly how certain mechanics works are unavailable. One popular youtuber who continued playing through at least 2010 maintained a dropbox that had most of the mods that were ever available, but not the forum posts explaining them. So, for example, there's a mod that survives there to let you replace a generic 'sign on the dotted line' handwriting with your own - but gone are the instructions of exactly how to apply it.
When I had returned to the game after bombs-away.net had gone defunct, I posted my own personal archive to the GoG forum for the game. Now that I've returned to the Redux version I find my own files, with my personal notes, shared by a single other soul who had similarly maintained an archive, and apparently had collected mine at some point. I'm very glad to have helped preserve knowledge - but not everything of mine was there. Now that I've noticed the 2024 remaster effort and joined that community, I've been able to share files that were otherwise apparently completely lost - in particular, a set of images showing dimensions of certain common features in bombing targets, that allow estimating the total size of the target.
Unfortunately, my own personal archive included many forum topics that I just dragged off shortcuts to. I can see the old titles of the pages from the surviving shortcut files. I remember the questions I had (and now have again) that those shortcuts held the answers to. But because I didn't save the page itself, it's.. gone. That's immensely frustrating.
Yes, things are worth saving. Especially for topics with extensive information among a small niche audience that have a single point of failure. I've found an extension (SingleFileZ) that does a good job of archiving a web page with all embedded content into what's a zip file under the hood - so futureproof even if the extension disappears and it becomes difficult to simply open the file directly in browser.
EDIT - montebicyclelo mentions SingleFile, which apparently is a continuation of SingleFileZ, with new features. SingleFileZ already allowed automatically saving every visted page in a tab (or even among all tabs), batch archiving of a list of urls, etc, so presumably SingleFile has all these capabilities and more.
There is a wealth of live performances on youtube that individuals have uploaded and that likely violate mpaa copyright crap.
IMO, this content is of high cultural value and I fear it won’t be long that the goog suffers us to watch “their” content without infecting it with ads.
I wish there was an easier way to self host this content with a way to organize and browse using tags.
5 years ago I was working on a semi-novel crowdfunding platform that relied on video presentations. First iteration we used the YouTube API because hosting our own video seemed daunting and that worked fine for a bit. Over time we started to run into limits/errors/interruptions/audits at inconvenient times until one weekend I was like “screw it, let’s find out what the problems of self transcoding and hosting are.” Spent some time learning to use ffmpeg and throwing the results on our static resource pile. Tagging was a fairly straightforward lift. Honestly worked better than I’d have anticipated and was much less hassle. I’m sure we would have hit the problems if we’d reached a critical scaling point buuut that didn’t happen inside our year or so of having clients.
I often find myself revisiting old posts and stories. As with any human artifacts, most things aren't worth revisiting or are only meaningful in the moment. If they're gone, few people miss them.
I'm a link hoarder myself (over 13k links on Pinboard: https://pinboard.in/u:pmigdal/). While I don't revisit most of them, some have proven invaluable for re-reading and sharing. I'm not sure about the typical half-life of internet content, but a lot disappears—whether because people stop paying for domains, official websites get reorganized (or their content removed), or other reasons.
This is where the Internet Archive steps in, doing the essential work of a digital librarian. I often share links from its Wayback Machine, which has been a link-saver more times than I can count.
From an age perspective (but the crowd here will not like that): before I trusted myself I could always find it back so I don't need to save it. Now I can't anymore, but I don't care so much.
Personally, I like that the internet is ephemeral. It matches real life in that way. I would rather see the internet as a means of connecting people over large distances (across space, Mars, etc), maintaining 20,000 copies of every irrelevant thing is just silly.
The problem is that not everything it has replaced was originally ephemeral.
In a the Internet is both too ephemeral (self-hosted blogs disappear, Youtube videos get taken down) and too persistent at the same time; I don't think that most Twitter posts of non-public figures would need to remain public forever by default, for example, and I don't think I need to mention various data breaches.
The Internet Archive somewhat mitigates the first issue, but it makes me pretty nervous that there's essentially just one organization doing what used to be much more distributed to various physical libraries.
For the second one, I hope we'll see better solutions (both technical and social) as the technology and our interactions with it mature.
> Personally, I like that the internet is ephemeral.
It is not. It is only for us normal people. But the companies which log our lives in order to then capitalize on it, for them the internet is not ephemeral. They have copies of videos, pages, podcasts, whatever it is what can be found there.
Why would you want those companies to know more about yourself than you do?
Archive.org or Google can cache more of the internet than I do while still having the majority of the content be ephemeral.
I'd also hazard to guess most people in this camp would want these companies to also not store these things the same as they don't want people to.
Maybe I don't, but the solution for that is to destroy what they have, not torture myself by making myself store and know more about myself.
I am not a company. As a human, my needs are different.
>Why would you want those companies to know more about yourself than you do?
That's not a question of wants, companies will always know more about you than you, for the simple reason that even if you had all their data you have no means to extract any meaning from it. It requires immense organization and resources, increasingly so as the rate of data production increases.
For that reason the correct response isn't to engage in the same hoarding and privacy abuse of the companies, it's like bringing a knife to a tank fight, but to 1. make sure you don't produce that data to begin with through privacy protections and technical means and 2. create environments in which you have ownership of your data, instead of businesses.
How are you determining what is and isn't relevant?
How do you backup websites? I mean, it sounds trivial, but I kinda still haven't figured out what is the way. I sometimes think that I'd like some script to automatically make a copy of every webpage I ever link in my notes (it really happens quite often that a blog I linked some years ago is no more), and maybe even replace links to that mirror of my own, but all websites I've actually backed up by now are either "old-web" that are trivial to mirror, or basically required some custom grabber to be writen by myself. If you just want to copy a webdpage, often it either has some broken CSS&JS, missing images, because it was "too shallow", or otherwise it is too deep and has a ton of tiny unnecessary files that are honestly just quite painful to keep on your filesystem as it grows. Add to that cloudaflare, Captchas, ads (that I don't see when browsing with ublock and ideally wouldn't want them in my mirrored sites as well), cookie warning splash-screens, all sorts of really simple (but still above wget's paygrade) anti-scraping measures, you get the idea.
Is there something that "just works"?
For saving a webpage you have open, I use a browser extension called SingleFile, I've been using it for a while (IIRC I discovered it on HN's front page a few years ago), in my experience it "just works", works really well.
You click the "browser action" icon/button of the extension and it saves a single HTML file that looks exactly like the webpage you have open.
From its FAQ[1] on GitHub:
[1] https://github.com/gildas-lormeau/SingleFile/blob/master/faq...> For saving a webpage you have open
There's also print-to-PDF that most OSes now have.
Yeah, pretty much all browsers on all OSes have print-to-PDF/save-to-PDF, I prefer saving an HTML file over saving a PDF file for 3 reasons:
1. SingleFile allows me to save a an HTML file that looks exactly like the webpage I saved. I never used a save-to-PDF functionality in any browser that allowed me to save a PDF that looks exactly like the webpage I was saving/printing. I wish browsers implement that, somebody did that once, they patched chromium to save a web page as SVG[1], AFAIK if you can save to SVG you can also save to PDF with not much modification to the code, unfortunately the fork is not maintained anymore.
2. The HTML files that SingleFile creates are responsive (just like the webpage you had open), PDF is not responsive. I like that because it makes it easier to read the webpage I saved on my phone later, with a PDF file you saved on your desktop, you have to pinch to zoom and pan while you read it on your phone.
3. HTML-files/Webpages are accessible to screen readers and my browser's extensions work on them, extensions don't work on PDF files (they _can_ work on HTML files opened from disk, if you allow/enable it in the extension's settings).
[1] https://news.ycombinator.com/item?id=33584941
I use WebScrapBook, an extension for Firefox. It seems to save a whole page in one file, and I can tweak a lot of the settings.
Sometimes I wonder if there's an even easier browser-builtin function that does the same?
There are extensions like "Save Page WE" that will dump the current state of the DOM to an HTML file, including CSS and Images, but these are static and don't make the scripting work.
We have become so cloud-native (god forbid!). Just recently I realised that I can save an interesting page to my hard drive instead of saving its link. What a wonderful word has opened since! It's so liberating to live without all these bs tools.
When I save things, I try to make sure that it'll be immediately useful to me once I find it again.
I'll highlight, summarise and take notes of what I save. Or some combination of those. If I don't find anything new or directly applicable to my life, I'll let it pass by.
This approach isn't good for archival purposes, but I hesitate to save a lot of things that I'll never read again.
I'm going through my file cabinets right now. I'll keep a few things that catch my eye but I'll likely throw out most of it. The odd 25 year old computer magazine is probably interesting but not all of them collectively for the most part. And I'm certainly not going to index them in a way that they'd be useful to me.
I went through mine and cut out the stuff that was still relevant, threw out the rest of the magazine.
Ended up with a single folder of clippings about x86 assembly etc
That's more or less what I did over time. No harm in a few folders of clippings for old times sake but not really boxes and boxes of magazines that will probably never be looked at.
You can probably sell or donate those old magazines to a collector, or a kid interested in that stuff. At the very least drop them off at a thirft store instead of just dumping them.
Thrift stores don't want a ton of old paper. There are a lot of things that someone somewhere would probably like but I'm not going to track them down or get them there. Mostly it's not magazines anywway. It's a bunch of articles I ripped out over the years.
The one thing I have in my garage I know someone would want is a big pile of laserdiscs. But, again, a thrift shop (or my library) wouldn't want them and I live pretty far out from a major city. Probably will try Craigslist post-winter though as I'm trying to declutter.
Laserdiscs appear and gradually disappear at my local thrift, so someone must be buying them. Now in the vinyl records pile, there are copies of Mantovani, Jim Nabors, and Herb Alpert which have been there for years, but anything classic rock or newer sells the same day.
I was just thinking yesterday, wanting some Christmas music to get into the spirit while wrapping presents, remembering being a kid, when my mom would put on Jim Nabors' Christmas album.
Luckily there are (currently) multiple playlists of it on Youtube.
But they might not be there next year.
In the spring I'll probably do take it or leave it for the whole collection on Craigslist for the whole pile at a nominal price and, if that doesn't work, just take it up to the local thrift and I'll at least have tried.
I think people don't get another perspective on this until someone dies.
Mostly it just goes away at death.
It might be interesting to read:
https://en.wikipedia.org/wiki/Digital_hoarding
I have trouble letting go of things, and I found it interesting to read through.
There's a part of me that thinks "It would be so useful to 100% automatically log and cache everything I do and be able to search it". But I think maybe being healthly means not doing it.
One thing that is worth saving is the PDF manuals for physical products that you own.
These sometimes disappear from the Web. Or disappear except for some third-party site that modifies and/or paywalls them.
Also, save the occasional important support info Web pages for those products. You'll know it when you see it. And if you don't save it now, it might be gone when you need it.
You don't need a fancy system for this. I just made a directory `~/doc/`, and started dropping files into it. Someday, I'll take the time to merge this with `~/wiki/`, but for now, I'm capturing the information with low friction, which is most important.
And even when they don't disappear, they still end up dozens of weird pages deep that none of the on-site help text or search points to correctly due to the various pointless redesigns the site has gone through.
But hey, there's more whitespace now.
The ephemeral nature of the web feels both liberating and tragic: so much creativity and insight, gone with a domain expiry or a server crash
Some times you have strange obsessions or a strange mindset related to your technological habits. And you might easily think that it is only you that is weird, not thinking straight. If you are the only one doing something, you are probably wrong.
And then, hopefully, there are nice personal blog posts like this one, showing you that you are not alone having some peculiar habits and so that it might make sense even if most people don't even think about it.
I have the exact same feeling when I discover through hn, blog posts and events that I'm not the only one having my web browsers full of tabs. Literally having thousand of tabs.
I created a local-only web app to wrap up some of my favorite web haunts, with HN being one of them. It allows me to look at the headlines, and save any of them in a locale SQLite db that the app maintains.
https://i.postimg.cc/v8znk92x/ycomb-hn.png.
This feels like confirmation bias to me. The author seemed to genuinely consider the question, but didn't think critically about how little value he got from two decades of bookmarking and instead focused on how he could use this archive in the future.
This resonates so strongly with me. I worked a job where I needed to use outdated Microsoft toolchains to build plugins for software, and the documentation was just -- gone. Good luck. I've been almost compulsively saving the things that feel important to me, while seldom browsing them for years -- all the while hunting for a faster and more intuitive recall system that lets me find them later.
My ex, however had a much more fluid relationship with the internet and media in general. They liked new things, and didn't particularly care if they enjoyed something and it faded into obscurity. I feel like that's the winning mentality, but I just can't bring myself to embrace it.
Stuff online is absolutely worth saving. It is a window into the past - what people concerned themselves with, what they loved and hated.
Scholars will write papers on this era, speculating what it was like and how it fit into what came after.
The web documents the massive societal changes underway which do not relate to the internet directly. Things like changes in transportation technology, medicine, sexuality and gender, and how your average people felt about all of it. Scholars will data mine those opinions to understand who felt what ways and why, with the benefit of hindsight. New knowledge will come of it.
So yeah! It is all worth saving.
It reminds me of the cool links page I see now and then.
Is this the classic webpage that you're referring to? https://www.w3.org/Provider/Style/URI
"stuff online" is an exceptionally course filter to deem something worthy of saving.
I think some stuff is -- the stuff that is crucial to rebuilding all the other stuff.
Instead of saving them as PDFs, I started saving web pages using a Chrome extension called Single File [1] (after testing it, of course).
To my dismay, some saved files (.htm extension) didn't open when I wanted to open them.
So I'm glad people are discussing ways to archive web pages while that reproduce the original page faithfully.
[1] https://chromewebstore.google.com/detail/singlefile/mpiodijh...
I used to think so and then I ran out of space
I’m the opposite of most of the “archivists” on HN. I delete everything and save nothing. I have maybe 25 sheets of paper in my apartment, including social security card and birth certificate.
Saving stuff just isn’t fun or useful for me. Never for more than a passing moment have I thought, “Boy I wish I had saved that whatever.”
Old people are the worst about this stuff. They think/hope somebody will want it and then just make it the next generation’s problem.
I told my dad if he thinks it has value, give it away while he’s alive. I have neither the interest nor the space to deal with it so it’s going straight into the trash.
I mean one man's trash is another man's treasure. We can't save it all but if we save even 10% of it, that would be great for future AIs to learn from our mistakes and what humans once did on a day to day basis before they were replaced.
I suppose it comes down to what the purpose of such archiving is.
I think it's the preservation of information, but I also believe 90% is absolutely pointless. There is just so much of it, and data storage so cheap, that it makes sense to just save everything.
That data storage is also ephemeral. Nobe of it will last as long as a paper note, unless some human goes to the trouble of copying it all onto new drives with new software every ten years or so.
With a proper NAS and RAID10 for double parity, it's a bit like Theseus ship. Just keep swapping out drives when they become unhealthy and you never have to rebuild or migrate
Eventually the controller will die and eventually compatible ones will no longer be produced or will at least be inconvenient to obtain or commission and hence expensive.
Paper lasts for centuries without any attention beyond keeping it moderately dry and away from things that eat it.
No sane person uses hardware RAID in 2024, if that's what you're referring to.
Whether you're using hardware RAID or not you still need a hardware storage controller of some type which accepts the new disks you can buy and works with the NAS. What they are saying is eventually that'll be more $ and time than just migrating off the system would be. From ENIAC to now could fit in one lifespan, would you still be maintaining a home floppy drive backup system in the 2040s or just save the time and effort with a migration?
sure, you can always move the old storage mechanism to something new if it is too cumbersome.
why still back up floppies if you could just move the data to a single dvd, or throw is on the SAN?
RAID is just algorithms, the actual transport doesn't matter (i.e. spinning platter and solid state both use SATA connectors).
Well... storage is cheap, but not cheap enough to save everything, with just usenet being in the 400TB/day range these days. Sure, it's cheap enough to save every webpage you visit during your life, but probably not cheap enough to save every video you click on youtube or watch on a streaming-service, and all the music you listen to all day.
Though just the music compressed in opus at 128kbit might work ok, 60 years of 24/7 128kbit is 30TB, so that would fit on 1 large HDD currently.
Music is actually an ideal candidate. I don't listen to music all day, and when I do listen to it, it's often something I've listened to before. My current collection is about 200GB and that includes a ton of stuff I've never listened to; it seems reasonable that a full life's worth of music could fit in 1TB, easily.
If that much data comes across Usenet daily then how do services afford the storage to offer years of retention?
You can't dedupe the large binary files because they're encoded in small parts likely differently every time they're posted.
Data rots though, you can't just save it once and be done with it. You have to migrate it across storage mediums, formats etc. It's a recurrent effort/cost.
More planning for less effort.
Do your research first. Use standards
Eg: html, pdf, h264/h265/av1 in mp4 container, chd, zip and so on depending on what you are storing.
On what physical medium?
I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?
If it is 1860 and you want to see that your data is preserved for 164 years, then you start by keeping it in geographically diverse places where people will look after it and tend its needs for 164 years.
As a concept, that's really not different at all from holding on to today's data here in 2024.
> I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?
Storage keeps growing and price of storage keeps doing down.
My DOS and even some C64 source code made it to this day on backups (DVDs, HDDs, SSDs, USB memory sticks, etc., both online and offline) and to ZFS pools. Medium that didn't exist in the 80s/early 90s.
Floppy disks -> 40 MB HDD -> 6.4 GB HDD -> 80 GB HDD -> 500 GB HDD -> 240 GB SSD -> 1 TB NVMe SSD.
You get the idea.
The way you get sure you still have your data is by not focusing on the medium but by focusing on the fact that data is data.
Medium comes and goes. Data can (and should) be copied to new medium.
Not unlike:
Some people are going to complain about the naming but I have all my emails except for six months back since I started using the Internet. And I still have all nearly a lot of my data since I started using computers. 8-bit computers.Do you?
I don't care about naming much. "search, don't sort".
We've got emulators for just about every and any system. My vintage arcade cab has both real PCBs and a Pi running an emulator with thousands of arcade games on it.
You can already, today, emulate, say, the Raspberry Pi model you want using QEMU. There are container file that'll gladly do that for you.
Unless civilization ends there's simply a not a world in which, say, PNG, JPG and x265 files aren't readable. This just won't happen.
FWIW I'm paranoid integrity of my data: I've got my own naming scheme where a cryptographic hash is added to many of my files.
For example:
This means "This file's Blake3 checksum begins with ae4f2877d3".I then have a script doing statistical sampling: I enter a percentage and that percentage of files where a cryptographic hash is part of the filename are checked, randomly (if I enter 100 then 100% of the files are tested).
If I enter for example '7', then 7% of the files are tested and then there's high probability all checksums are correct.
> On what physical medium?
That is the wrong question.
At this point in history, we can't tell which 90% is absolutely pointless.
In general, I am pro-turnover where there is rivalry: ceteris paribus keep the newer thing. However, information is so cheap as to be effectively non-rivalrous so I am considering running my own archival and to keep kagi's small sites etc. alive. Unfortunately, there is not a good tool for this that matches whatever Archive.org has. ArchiveHub needs routine management to keep the feed up and viewing it is not that easy. I'm sure we'll come up with stuff.
The other thing is that searching for the long tail is near impossible. The big sites dominate Google, so I need something like marginalia to actually get to the old stuff that it used to be so easy to find. Because of the median user having simple queries, some questions are no longer answerable on Google: they are dominated by the median user and never show up.
Curve smoothing. Chaikin's algorithm and Jarek's tweak etc. Very clever and nice way of making angular geometry curvy. Constructive geometry stuff.
There were like a dozen algs. I kept links to nice papers with diagrams. Then they started disappearing. Now I'd be pressed to find 2.
This is really useful info that is apparently disappearing. So yes, it happens, and maybe you should save that stuff.
Digital storage is free; yes, save it all
Please do share where I can reliably store my backups for free!
> Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it.
— Linus Torvalds
This does still happen. Microsoft may nuke a git repo and someone has to figure out who has the latest version of the entire repo with all the latest commits of every branch.
The vast majority of people aren't privileged enough to have anyone mirror their data.
But how do I get everyone to mirror my gigabytes of encrypted photo backups?
Title it "(current star name) Leaked Nudes.zip" and seed a torrent? Every few years, change the title to keep it current.
just upload them to social media accounts. Afik twitter, facebook, and youtube do not have storage limits . no deletion for inactivity either.
They don't allow uploading large binary blobs either, though, and steganographically storing gigabytes of data with probably terabytes of overhead sounds like a quick way to get banned.
dump it on Wikipedia. afik wiki never removes anything. it just gets buried in an edit history . or Wikimedia image files
That obviously can't be true, or spammers would be all over it, using Wikimedia as a free image host.
There was a story making rounds a few years back that Wikipedia was being used as a file / media sharing platform, particularly in countries with limited other options for such services.
I can't find a link currently, though I'm pretty sure I'm not hallucinating this.
Are you thinking of this by any chance? https://phabricator.wikimedia.org/T273741
That wasn’t really somebody using Wikipedia as a file storage, but rather as a web connectivity test.
Nope, though that's interesting and somewhat expected.
I remember when web-connectivity test was spelled "Slashdot".
Is stuff online worth saving?
This supposes a broader question: what is the purpose of recording information generally?
The alternatives are observing the here and now (boys, obIsland), conversing, or musing / meditating / cogitating on topics.
I'm not hostile to recorded information of many forms, and am in fact something of a packrat myself. But as I acquire archives and observe the increasing enshittification of the Internet as a whole, the value of investing time in searching and saving information only found online and created for online consumption seems ... increasingly questionable.
As I've mentioned in numerous earlier HN comments,[1] my preferred information forms are increasingly more traditionally-published books and articles, highly prepared discussions, or conversations amongst experts within a field (or with a strongly discriminating host and an expert). It's worth noting that the origin of much (though of course not all) online content is in fact discussions based around such works: author interviews or exerpts from books, discussion of articles, etc.
Another principle, and one that seems worth considering in the context of online content, is that false leads and faulty assumptions are absolutely toxic to learning or training in a domain. This turns up in AI contexts, but is also evident to, say, a survey of the history of philosophy, particularly the roughly 2,000 year period in Western philosophy of the dominance of Christian theological philosophy, much of which was based on utterly misguided premises. It wasn't until such assumptions were dropped that scientific and technological understanding really began advancing.[2] One wonders to what extent our present accumulated document[3] trove might actually weigh down future progress.
Or succinctly: What's the balance between retention and study, on the one hand, and observation and reason on the other? Tradition vs. empiricism.
--------------------------------
Notes:
1. Searching my handle with "published books articles" turns up a few dozen of these: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>
2. I'm not writing this as an utter dismissal of all theological thought, nor am I asserting that what I'm terming "misguided" was utterly useless. There are useful ideas which emerge, there are religiously-aligned philosophers who arrived at keen insights, the process of working through arguments, even if founded on utterly counterfactual premises, can lead to useful developments in reasoning and logic, etc., etc. That's even without getting to "it can always serve as a bad example". But in this and many other areas it becomes clear that relying on early authorities (in the classical sense of that term) heavily retarded intellectual advances.
3. In the Otletian sense of any type of recorded work: books, articles, images, plastic sculpture, video, etc.
The rise of LLM’s has really devalued saving stuff online. What is the point of saving an article if I could just ask ChatGPT to created it and would probably do a pretty good job? It’s still worth keeping notes and stuff that may be hard to find but the majority of things online can easily be reproduced and are not worth saving.
I just asked ChatGPT where I could find info about a specific tabletop game that was heavily discussed online some years back and had multiple fan sites with house rules, forum, mailing list, etc.
It couldn't help me except to reference a few sites that no longer exist, one that's HTTP only so now causes browser warnings and was mostly only links to more sites that also no longer exist, and an old general gaming forum that doesn't even have a search function.
It didn't mention the mailing list archive site which has 21 years of discussion, indexed and searchable, still available. Or the still-active site where some fans archived all that they could from the old sites some years back, along with emulators, binaries, and instructions to get some of the old fan-made software running again on modern systems.
Neither of those sites are new since ChatGPT was trained, they have been around much longer than ChatGPT. But it knows nothing about them or their content.
I then asked it what it could tell me about the topic of a blog that ran for 10 years, with 314 posts, most of which had around 100+ comments, and the site it most often linked to. ChatGPT's answer was simply: "It seems like I can’t do more browsing right now. Please try again later."
So no, you can't "just ask ChatGPT." Contrary to popular belief, it doesn't know much about what is or was on the internet, even at the time it was trained, nor about many topics.
Given the way the web has developed over time, it seems quite likely to have huge gaps on anything related to the small web, any niche hobbies or interests, etc. All that non-commercial stuff that you can't easily find in modern search engines, ChatGPT doesn't know about it either.
Those big chunks of content that wink out of existence whenever a hosting company goes under or someone just stops paying the bill for a site that they used to love but haven't actively maintained in awhile? It doesn't know any of that either.
Entire online communities rose, developed, created a great deal of stuff, then slowly atrophied, and eventually disappeared. ChatGPT knows nothing of them.
I think you are right. But I think the answer goes deeper: we have encouraged a culture where the most supported information is also the most superficial. The essence of individual experience itself has long been discouraged on the web in favour of SEO and the trashy news and the trivial.
So the fact that ChatGPT can replace much of the web actually says less about the marvel of ChatGPT and more about the lack of anything really worthwhile because the profound just happens to be the least economically valuable.