Venture Capital

Albert Tai · Aug 17, 2009

I'm wondering the best way to acquire fund venture capital for a site I'm running.
For example:
Twitter, Baidu, Skype. There's way more web 2.0 sites that had venture capitals.

I was wondering the best way to do this.

Right now I'm running a search engine not out yet but the site design, fav icon etc are done.

The problem is servers. I own a dedicated server at the moment but it is simply not fast enough to index the whole web. For example google has crazy servers not mentioning tons of bots running same time.

Any idea?

BidNo · Aug 20, 2009

Just one thought; with Google's market cap there are no venture capitalist firms large enough to compete with them head to head in their own game. You're going to need to demonstrate a distinct niche/angle or exploitation of a Google weakness to attract funds. Good luck!

Albert Tai · Aug 20, 2009

BidNo said:
Just one thought; with Google's market cap there are no venture capitalist firms large enough to compete with them head to head in their own game. You're going to need to demonstrate a distinct niche/angle or exploitation of a Google weakness to attract funds. Good luck!

Thank you for the feedback.
I was wondering if you know any particular sites of venture capitalist who are in this market. That would be good.
As for the niche/angle I think all search engines are profitable. That is the reason why microsoft alone invested 100 million in bing.com for just advertising.
Etc. Etc.
Should I contact them with a .com domain or just gmail.

Adapt Web · Aug 20, 2009

most web 2.0 companies that have VC backing list the backers in an about us or investors section.. gather their names and reach out.. dont use gmail, imo, it's cheesy

Melly · Aug 20, 2009

Definitely don't contact them with a free email. It shows that you are semi professional if you have a paid email. Maybe use the one that you need funding for?

With the market today even though there are investors out there that have money to burn they want to be convinced that they are going to earn a decent ROI before they commit. GL

Albert Tai · Aug 20, 2009

Thanks guys. I'll put that in mind.

tristanperry · Aug 20, 2009

Good luck with the venture capital pitch and all. Depending on your budget it could be worth having a solid business plan drawn up for you.

Also think of perhaps trying to optimise your crawler. I don't know how complicated it is etc, however as an example I built a tool which crawls a website and draws up a list of the total file size for the webpage (it crawls the HTML, then downloads the CSS and JS, crawls those and picks out any images referenced in the CSS, etc) - it runs quite quickly * (it's just a case of some regular expressions and curl work really) on a single proc dedi, so perhaps you may be able to save some time and money by code-level optimisations. Also are you on a shared connection port? This may be what's slowing the crawler. Benchmarking and finding slow bits of code could be invaluable

(* By quite quickly I mean the whole process of crawling the page and downloading the CSS and JS and any images on the page) tends not to take more than about 0.5 seconds, but much of that is simply the bandwidth/transfer rate - the actual processing flies by)

cleverlyslick · Aug 20, 2009

IMO I think you need BING.COM money to make a search engine worth a VC'S time and money.if you feel you're search engine is of that caliber,contact the vc's up in silicon valley,i hear a lot of peeps willing to invest over there if the idea is right.

Albert Tai · Aug 20, 2009

tristanperry said:
Good luck with the venture capital pitch and all. Depending on your budget it could be worth having a solid business plan drawn up for you.

Also think of perhaps trying to optimise your crawler. I don't know how complicated it is etc, however as an example I built a tool which crawls a website and draws up a list of the total file size for the webpage (it crawls the HTML, then downloads the CSS and JS, crawls those and picks out any images referenced in the CSS, etc) - it runs quite quickly * (it's just a case of some regular expressions and curl work really) on a single proc dedi, so perhaps you may be able to save some time and money by code-level optimisations. Also are you on a shared connection port? This may be what's slowing the crawler. Benchmarking and finding slow bits of code could be invaluable

(* By quite quickly I mean the whole process of crawling the page and downloading the CSS and JS and any images on the page) tends not to take more than about 0.5 seconds, but much of that is simply the bandwidth/transfer rate - the actual processing flies by)

Right now my sphider does not crawl the site and save a cache (like you said let me know if i'm wrong). My crawler goes through the html takes extract and it can take meta description as the description of the site. Sadly however some adult sites like to fake their meta keywords and description so I have taken the feature off. I have instead used like google taking extract of the description and keyword from the site instead.
Also my crawler also gets the images has a thumbnail for them for search images.
The only thing I would say I don't have like google is the cache system and the algorthim isn't that well yet.

It spiders around 2000-3000 links per hour. That was from dmoz.org which is a pretty large site. I spidered around 23,000 links from the site so far. I know there's more.

I'm thinking of having 10 more servers but worst than the one i have right now. I know you can get a low end server for around pretty cheap with the current provider i have now. I'm asian

I can get good deals.
I can probably afford 10 more servers for 2-3 month.

The thing is this development is going to take a while and I can't really afford tens of servers for around few month. I have though already run the server for a few month now but just one. As I need faster speed and things I would need more.

I just saw that VC's invest a great sum of money into new web 2.0 things. Alot of them are like twitter, another one that is closing this month (search engine forgot the name) which got like 50-60 million VC funding right away.

I'm not sure how VC goes but I really wanted to start this off and I have.

I already got a really awesome customized template + the server etc.

Just improving the spider at the moment. And having the rest of the template coded into html.

Whoah that was pretty long.

As for your spider how fast does it spider? And from hearing it it takes a cache.

The thing is I was going to make put cache things etc. But it makes the spider slower.

Not to mention google has bots of different uses. Image bots that takes images out. Then the regular crawler. And also the cache bot (I think).

Right now I have one bot running both image and crawler making it slower.

Then again google has 50 bots of each. Not to mention super servers.

Heh.

tristanperry · Aug 20, 2009

Albert Tai said:
Right now my sphider does not crawl the site and save a cache (like you said let me know if i'm wrong). My crawler goes through the html takes extract and it can take meta description as the description of the site. Sadly however some adult sites like to fake their meta keywords and description so I have taken the feature off. I have instead used like google taking extract of the description and keyword from the site instead.
Also my crawler also gets the images has a thumbnail for them for search images.
The only thing I would say I don't have like google is the cache system and the algorthim isn't that well yet.

Hmm that makes sense. Yeah getting an overall view of the page via the keywords on it sounds the best way. It's always difficult to create such a thing (and there is the potential for abuse like people writing white text on a white background to cheat the search engines, etc), although I guess just try and start small and go from there (unless the VC thing works out

)

Albert Tai said:
It spiders around 2000-3000 links per hour. That was from dmoz.org which is a pretty large site. I spidered around 23,000 links from the site so far. I know there's more.

That's not too bad - I'd imagine that may just be due to the connection/bandwidth rate. As below, having a seperate bot for the images and the text/content would probably make things easier in that the text bot would be very quick (i.e. just the text which would d/l quite quickly, and also be processed quickly)

Albert Tai said:
I'm thinking of having 10 more servers but worst than the one i have right now. I know you can get a low end server for around pretty cheap with the current provider i have now. I'm asian I can get good deals.
I can probably afford 10 more servers for 2-3 month.

The thing is this development is going to take a while and I can't really afford tens of servers for around few month. I have though already run the server for a few month now but just one. As I need faster speed and things I would need more.

That makes sense. I wouldn't try and compete with the big guys at the moment, perhaps just try and initially spider the biggest websites and expand out from there.

Albert Tai said:
I just saw that VC's invest a great sum of money into new web 2.0 things. Alot of them are like twitter, another one that is closing this month (search engine forgot the name) which got like 50-60 million VC funding right away.

I'm not sure how VC goes but I really wanted to start this off and I have.

Sounds awesome - good luck in going ahead with it.

Albert Tai said:
I already got a really awesome customized template + the server etc.

Just improving the spider at the moment. And having the rest of the template coded into html.

Whoah that was pretty long.

Yeah the template is awesome, it really is a nice project

Obviously a large-scale project but an interesting one nonetheless.

Albert Tai said:
As for your spider how fast does it spider? And from hearing it it takes a cache.

The thing is I was going to make put cache things etc. But it makes the spider slower.

Not to mention google has bots of different uses. Image bots that takes images out. Then the regular crawler. And also the cache bot (I think).

Right now I have one bot running both image and crawler making it slower.

It's no caching - it literally just downloads the HTML code, then downloads the CSS. From there it'll create a list of all the files it'll need to download (so images in the CSS, iframes, JavaScript, <img src>s) and finds their file size after downloading them.

It can take 3-4 seconds on a larger site - for example the apple.com homepage which is about 420 Kilobytes overall takes about 4 seconds to run. Most of this is from the actual downloading of the files - I assume this is the slowest point of your script too; i.e. just a bandwidth thing.

Anywhoo as a test I crawled:

http://www.dnforum.com/f12/

And run a basic benchmark between various points in the script. The results were:

1: 0.585776805878 - time taken to download the initial HTML file

2: 0.00144219398499 - a trivial bit; a regex is ran to get a list of all imported CSS files (i.e. via <link href="..">)

3: 0.0282170772552 - this surprised me at how quick it was. This downloads each of the CSS files, and then makes a list of all the images referenced in the CSS (i.e. via background-image: url(...); for example)

4: 0.000219106674194 - a trivial bit; generates a list of all the JS files and image files (etc) it needs to download (so it can check their size)

5: 3.9039940834 - this downloads the JS and image files (etc). This is the slowest bit, I assume mainly due to the connection/transfer times

All times are in seconds.

So yeah, I think the main issue here is the time it takes to download all the images/JS files. At least for me it is

Albert Tai said:
Then again google has 50 bots of each. Not to mention super servers.

Heh.

Hehe that's very true

south · Aug 20, 2009

I have played a little with the open source spiders / front ends. Did not have enough time to fully get into them though. The biggest challenge I saw was after your database gets a lot of entries, it becomes difficult to query in a timely fashion. How's your frontend holding up to queries when your database has millions of records? Do you really want to conquer the world right at the start? Perhaps it might be easier to focus on a niche or topic where your database doesn't need to contain billions or trillions of records, and your search engine could be the "G" of that topic. Then when you get big, sell it to Newscorp or Google. :yes:

Albert Tai · Aug 20, 2009

Hmm that makes sense. Yeah getting an overall view of the page via the keywords on it sounds the best way. It's always difficult to create such a thing (and there is the potential for abuse like people writing white text on a white background to cheat the search engines, etc), although I guess just try and start small and go from there (unless the VC thing works out )

There's the place where the spider is awesome. It checks for the words on the page. I can change it anytime I want. I can make it so at least 5 words on the page and the spider will index. If no 5 words on the site is found then the spider skips the page.

Ohhh nvm I know what your saying now.....the white text thing where people use to do to cheat google. Yah I was thinking what to do about that. maybe use css and figure out all the white text.

That's not too bad - I'd imagine that may just be due to the connection/bandwidth rate. As below, having a seperate bot for the images and the text/content would probably make things easier in that the text bot would be very quick (i.e. just the text which would d/l quite quickly, and also be processed quickly)

Possible. My server isn't running on a quick line. Pretty sure its a shared bandwith line. My server is 10 mbps unmetered. Pretty slow it's alright.

That makes sense. I wouldn't try and compete with the big guys at the moment, perhaps just try and initially spider the biggest websites and expand out from there.

My site has a little problem spidering big sites. I mean little sites its just takes a few seconds. Like for example dmoz.org after I get around 20,000 to 30,000 my spider dies. Mysql crashes. Sometimes server crashes. The reason is my spider is also a bit faster by having temp files. And the temp files is making the server slower. Also the spider has not taken a rest like if it was crawling a smaller site it will be faster then it will take a while to rest while i crawl another. Big sites it just keeps running and sometimes it kills the server so bad. I wished i had those servers i saw on few datacenters they were bragging. Something like 1.5 tb of ram? Wow.

Sounds awesome - good luck in going ahead with it.

Thanks

Yeah the template is awesome, it really is a nice project Obviously a large-scale project but an interesting one nonetheless.

Yeah. Template was the easiest part in this whole project. I love the template it's really awesome. Not sure if you seen the mascot also. I decided to change from a fbi guy look with a search thingy custom mascot to a polar bear mascot.

It's no caching - it literally just downloads the HTML code, then downloads the CSS. From there it'll create a list of all the files it'll need to download (so images in the CSS, iframes, JavaScript, <img src>s) and finds their file size after downloading them.

When it downloads the html code does it download it like a .html or like just into the script. Cause if it was like downloading media, css, html etc it could be like a cache aftewards if the script doesn't delete it.
Oh theres also a little problem with your crawler. Would it actually keep downloading the same media file if it starts crawling the subpage. It will be like tons of same media files (like imagine the logo is downloaded over and over again as the homepage and subpage are having the same logo)

It can take 3-4 seconds on a larger site - for example the apple.com homepage which is about 420 Kilobytes overall takes about 4 seconds to run. Most of this is from the actual downloading of the files - I assume this is the slowest point of your script too; i.e. just a bandwidth thing.

4 seconds crawling the homepage. I guess not that bad.

Anywhoo as a test I crawled:

http://www.dnforum.com/f12/

And run a basic benchmark between various points in the script. The results were:

1: 0.585776805878 - time taken to download the initial HTML file

2: 0.00144219398499 - a trivial bit; a regex is ran to get a list of all imported CSS files (i.e. via <link href="..">)

3: 0.0282170772552 - this surprised me at how quick it was. This downloads each of the CSS files, and then makes a list of all the images referenced in the CSS (i.e. via background-image: url(...); for example)

4: 0.000219106674194 - a trivial bit; generates a list of all the JS files and image files (etc) it needs to download (so it can check their size)

5: 3.9039940834 - this downloads the JS and image files (etc). This is the slowest bit, I assume mainly due to the connection/transfer times

All times are in seconds.

So yeah, I think the main issue here is the time it takes to download all the images/JS files. At least for me it is

Yeah the media files is definately taking a while. I think you might be faster if your crawlwer is downloading the same media files over and over again.

south said:
I have played a little with the open source spiders / front ends. Did not have enough time to fully get into them though. The biggest challenge I saw was after your database gets a lot of entries, it becomes difficult to query in a timely fashion. How's your frontend holding up to queries when your database has millions of records? Do you really want to conquer the world right at the start? Perhaps it might be easier to focus on a niche or topic where your database doesn't need to contain billions or trillions of records, and your search engine could be the "G" of that topic. Then when you get big, sell it to Newscorp or Google. :yes:

Oh yes. I have fixed the problem. The mysql wasn't caching.
If you search entries with over 30,000 40,000 or sometimes 50,000 links for a keyword like for example car. The first time would be around 3.3 seconds (soemtimes 1 though depends). Then after you search it again it would be only 0.3 seconds.
My search engine caches the keywords. Don't worry I got around 1.5 TB of space for it to cache.

Thing is that I haven't done crawling alot of sites so when I reindex or when I index a site with the same keyword.

Well the cache is cleared (that's good or else it would have the same links forever and new sites will never show up)

I typed alot

Hopefully my quote thingy wasn't messed up.

hugegrowth · Aug 20, 2009

Fortune magazine has had some articles in the last year or so on VC funds, who the main players are, etc. Plus if you search "venture capital" along with 'twitter', 'digg', 'facebook', etc, you will find more.

Last thing I read was about the guy who developped the Netscape browser, Marc Andreessen, he is in a recent Fortune article with a list of things he is helping to fund right now. There are names of other VC's in the article. You would want someone with cash but also someone with connections who can give you good advice.

Side note: In the article it mentions how Andreessen is dealing with people now who are too young to remember the Netscape browser, lol, it was only what, 10 years ago?!

Here is the article, a good read as well:

http://money.cnn.com/2009/07/02/technology/marc_andreessen_venture_fund.fortune/index.htm?postversion=2009070605

and good luck!

---

sashas · Aug 20, 2009

Albert Tai said:
I'm wondering the best way to acquire fund venture capital for a site I'm running.
For example:
Twitter, Baidu, Skype. There's way more web 2.0 sites that had venture capitals.

I was wondering the best way to do this.

Right now I'm running a search engine not out yet but the site design, fav icon etc are done.

The problem is servers. I own a dedicated server at the moment but it is simply not fast enough to index the whole web. For example google has crazy servers not mentioning tons of bots running same time.

Any idea?

This is actually a very complicated, very difficult process.

Every month, THOUSANDS of entrepreneurs virtually beg venture capitalists to invest in their startup.

The only thing that will grab their attention is either:
a) A solid, very marketable product
b) A great team with a great track record
c) Clever marketing
d) Proven performance

You might not necessarily need a working prototype, but your pitch has to be perfect.

Visit YCombinator.com - they are early stage ventures that provide seed money to startups (usually $20k)

Keep an eye on blogs like Mashable.com and TechCrunch.com

Getting venture capital firms to invest in your startup is one of the hardest parts of the business. Your market (search) is monopolized and very saturated with the big players. The costs of entering the search engine market are just too high for new players. Not to burst your bubble, but I doubt you'll be able to index even 1/100000th of the web with one single dedicated server; you'll need a server farm the size of half a city block.

Also, the technology has to be very, very good, revolutionary even.

Think of all the engineers that Microsoft has - even with them, it hasn't been able to make a search engine that performs better than Google (okay, Bing is awesome, but still..it took them a decade)

tristanperry · Aug 20, 2009

Albert Tai said:
There's the place where the spider is awesome. It checks for the words on the page. I can change it anytime I want. I can make it so at least 5 words on the page and the spider will index. If no 5 words on the site is found then the spider skips the page.

That sounds a nice system

It's always best to code with flexibility in mind.

Albert Tai said:
Ohhh nvm I know what your saying now.....the white text thing where people use to do to cheat google. Yah I was thinking what to do about that. maybe use css and figure out all the white text.

Hehe no worries. Yeah I'm not too sure how best to solve it, although it's not a major practise (anymore; it was a black hat SEO method about 5 years ago) so TBH I wouldn't worry about it now. Although I guess a basic CSS check would be possible, but not a massive biggie.

Albert Tai said:
Possible. My server isn't running on a quick line. Pretty sure its a shared bandwith line. My server is 10 mbps unmetered. Pretty slow it's alright.

That might be it. I'm either on a 10 mbps dedicated line or 100 mbps shared line (capped at 3 TB) - either way it's sort of quick, but is still the 'weak link' :yes:

Albert Tai said:
My site has a little problem spidering big sites. I mean little sites its just takes a few seconds. Like for example dmoz.org after I get around 20,000 to 30,000 my spider dies. Mysql crashes. Sometimes server crashes. The reason is my spider is also a bit faster by having temp files. And the temp files is making the server slower. Also the spider has not taken a rest like if it was crawling a smaller site it will be faster then it will take a while to rest while i crawl another. Big sites it just keeps running and sometimes it kills the server so bad. I wished i had those servers i saw on few datacenters they were bragging. Something like 1.5 tb of ram? Wow.

Ahh, I see. It basically tries and does it all in one go (ish), hence driving up server loads? Hmm - I guess the only solution is to have it check server loads and end execution of the program temporarily until server loads go back to below (say) 0.5

Remember that Google won't index a site all in one go. In-fact I heard of someone who purchased a site with 20k pages of unique content. They submitted their sitemap to Google and it taken Google 12/18 months before it indexed all the pages (well, 99% of them)

I'm not sure how Google and all would figure out naturally which are the most important pages, and index them first, but this must be what Google does.

Albert Tai said:
Yeah. Template was the easiest part in this whole project. I love the template it's really awesome. Not sure if you seen the mascot also. I decided to change from a fbi guy look with a search thingy custom mascot to a polar bear mascot.

Yeah the template is class

Nah, I haven't seen the new mascot - it sounds really nice.

Albert Tai said:
When it downloads the html code does it download it like a .html or like just into the script. Cause if it was like downloading media, css, html etc it could be like a cache aftewards if the script doesn't delete it.
Oh theres also a little problem with your crawler. Would it actually keep downloading the same media file if it starts crawling the subpage. It will be like tons of same media files (like imagine the logo is downloaded over and over again as the homepage and subpage are having the same logo)
Yeah the media files is definately taking a while. I think you might be faster if your crawlwer is downloading the same media files over and over again.

I think I explained it a bit badly (although I completely agree with the points you made)

Basically the tool (I know I'm self-linking, I hope that's okay mods!) is:

http://www.cogah.com/index.php/WebsiteSize/

Say I enter in:

http://www.dnforum.com/f31/venture-capital-thread-381363.html

It'll scan the HTML, fetch the data, and return a table with the exact size of all files which make up the page.

It's not brilliant yet (it doesn't really support frames yet or flash files, and doesn't always download JS files correctly etc) but the basic idea is okay IMO.

I quite agree, though, that if it did crawl multiple pages, it could have the potential to download the same file over and over again. I do try and run array_unique() over the array of files/URLs I need to download (to check their size), although this isn't perfect.

To answer your first bit, it downloads the HTML in raw form. An extract of the code is:

PHP:

// Start the cURL stuff
$ch = curl_init();


curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);



curl_setopt($ch, CURLOPT_TIMEOUT, 5);


curl_setopt($ch, CURLOPT_URL, $url);


$content = curl_exec ($ch);

$pageSize = strlen($content);


$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

So the $content var will simply contain the HTML you'd get if you go to a page and click "View Source"

(The $httpcode is used just in-case a 404 error etc is returned; then I'd output an error message)

Albert Tai said:
Oh yes. I have fixed the problem. The mysql wasn't caching.
If you search entries with over 30,000 40,000 or sometimes 50,000 links for a keyword like for example car. The first time would be around 3.3 seconds (soemtimes 1 though depends). Then after you search it again it would be only 0.3 seconds.
My search engine caches the keywords. Don't worry I got around 1.5 TB of space for it to cache.

Sounds a nice system

Yes having a cache/index system is the best way around an issue like this.

By the way, I came across the following site earlier:

http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work

Some of the comments further down have some nice information about how this may (sort of) work on a large scale.

Still quite difficult to get ones head around though

Albert Tai said:
Thing is that I haven't done crawling alot of sites so when I reindex or when I index a site with the same keyword.

Well the cache is cleared (that's good or else it would have the same links forever and new sites will never show up)

Hmm yeah, I see what you mean. I guess one way around it is to have a sort of 'array' of results (i.e. sites) for each keyword. Then the 'array' (well, database) can be sorted based on the quality of the site, but this wouldn't be easy.

Albert Tai said:
I typed alot
Hopefully my quote thingy wasn't messed up.

Seems fine to me

hugegrowth · Aug 20, 2009

sashas said:
Think of all the engineers that Microsoft has - even with them, it hasn't been able to make a search engine that performs better than Google (okay, Bing is awesome, but still..it took them a decade)

A lot of great ideas often come from one or two guys working in their garage. Google came along while Microsoft and Yahoo already existed. YouTube came along with G, M and Y already established. Twitter started small and grew. I agree that there are probably thousands of people pitching ideas out there, but don't let that stop you if you think you have something good.

If your issue it servers, you may just need to find a partner with access to lots of server capacity.

GUA · Aug 20, 2009

Not to dump on your parade, but VCs, banks... whatever kind of investment you are looking for they wont look at you if you are under 18.

A Business Plan is a must have too.

Read a book called 'boo hoo'... see my blog for a review.

Search Engine market is way too competitive. Why should i use yours over google?

And the best advice i can give you. If you get investments, you do not own your idea any longer. You really need to assess this: do you really need funding? Think about it, then wait a week, and think about it again.

Albert Tai · Aug 20, 2009

GUA said:
Not to dump on your parade, but VCs, banks... whatever kind of investment you are looking for they wont look at you if you are under 18.

A Business Plan is a must have too.

Read a book called 'boo hoo'... see my blog for a review.

Search Engine market is way too competitive. Why should i use yours over google?

And the best advice i can give you. If you get investments, you do not own your idea any longer. You really need to assess this: do you really need funding? Think about it, then wait a week, and think about it again.

I'll reply to the quotes after up above as I am playing volleyball now

(needs exercise everyday)
I'm 16. =/ so i'm guessing it's something not benefital?
I'll try to make up a business plan.
Basically the funds will go into server, company registration, and development.
Nothing else.

And as for uniqueness first step is having a amazing template some people on dnforum has seen it.

Hopefully it loads fast. But I love it. It's amazing

every better than Live.com's template (live>bing)

Well first i wasn't going to reply on VC's at all.

I was thinking about doing this all with the funds. But then it was mostly the server problem that killed me.

Like if I want to create a successful facebook, youtube related site you start off you can have a server just one then as more visitors come you can have more.

As for a search engine you have to start with a lot or you will never index the whole web (i think it takes a year for me thats along time)

hugegrowth said:
Fortune magazine has had some articles in the last year or so on VC funds, who the main players are, etc. Plus if you search "venture capital" along with 'twitter', 'digg', 'facebook', etc, you will find more.

Last thing I read was about the guy who developped the Netscape browser, Marc Andreessen, he is in a recent Fortune article with a list of things he is helping to fund right now. There are names of other VC's in the article. You would want someone with cash but also someone with connections who can give you good advice.

Side note: In the article it mentions how Andreessen is dealing with people now who are too young to remember the Netscape browser, lol, it was only what, 10 years ago?!

Here is the article, a good read as well:

http://money.cnn.com/2009/07/02/tec...fund.fortune/index.htm?postversion=2009070605

and good luck!

---

I think it's longer than 10 years as I don't remember netscape either.
But I did try it out a few years ago (the new one) with the little fireworks. I did read about it though the history of it and how it was once the best browsers (i was really surprised) and how it got killed by IE when microsoft released IE with all the computers.
*reading the article at the moment thanks*

sashas said:
This is actually a very complicated, very difficult process.

Every month, THOUSANDS of entrepreneurs virtually beg venture capitalists to invest in their startup.

The only thing that will grab their attention is either:
a) A solid, very marketable product
b) A great team with a great track record
c) Clever marketing
d) Proven performance

You might not necessarily need a working prototype, but your pitch has to be perfect.

Visit YCombinator.com - they are early stage ventures that provide seed money to startups (usually $20k)

Keep an eye on blogs like Mashable.com and TechCrunch.com

Getting venture capital firms to invest in your startup is one of the hardest parts of the business. Your market (search) is monopolized and very saturated with the big players. The costs of entering the search engine market are just too high for new players. Not to burst your bubble, but I doubt you'll be able to index even 1/100000th of the web with one single dedicated server; you'll need a server farm the size of half a city block.

Also, the technology has to be very, very good, revolutionary even.

Think of all the engineers that Microsoft has - even with them, it hasn't been able to make a search engine that performs better than Google (okay, Bing is awesome, but still..it took them a decade)

First thanks for the link.
As for b) i think its like proving to the VC how like successful you been or like potential I guess I could prove that I had start up something like microsoft a software company. Well not really a company i did not register anything. Now I do not know why I get these big big ideas. I had infact sort of been successful 35k alexa page rank 5 (back in the days where it was hard hard). Now I was only 11 then. I had the site closed because well when your 11 and you start a site...you normally don't start hte site with nothing. Someone who sponsored me decided to now list my site on sedo even though I BUILT IT. But i haven't had the domain in my control as I was sponsored. Foolish. He sold it though not sure. I am prettty sure in the thousands as my software was on download.com tucows etc.
Yah I heard it's really hard to get start up money but I ensure you search is one of the most profitable things. And also I can probably index alot of sites but not the whole web with a single dedi. I can probably however do it in a year (i think in the calculation). Of course that is still a long time. And a server farm isn't really needed. I'm going to actually rent a block of servers but actually not buy them. That will reduce the cost alot.

The thing is bing took them a decade. Not really. The thing is microsoft wasn't really focusing on the search niche. After a while they realized google is earning tons of money and they wanted to go into this field. It actually was like one year (correct me if i was wrong) to have bing all up and go. Microsoft wanted a part of this share.

Google makes billions each year. Now a little fraction of the pie means microsoft could earn billions.

That sounds a nice system It's always best to code with flexibility in mind.

Yep. Something I think was a good feature of it.

Hehe no worries. Yeah I'm not too sure how best to solve it, although it's not a major practise (anymore; it was a black hat SEO method about 5 years ago) so TBH I wouldn't worry about it now. Although I guess a basic CSS check would be possible, but not a massive biggie.

Yah I really don't see anyone doing it anymore. Anyhow I probably could just give pentalities out if anyone did it. I have a blacklist system on my crawler.

Anyone does that they are blacklisted.

That might be it. I'm either on a 10 mbps dedicated line or 100 mbps shared line (capped at 3 TB) - either way it's sort of quick, but is still the 'weak link' :yes:

Yeah. I really can't afford the thousand of dollars to have a gigabite dedicated upline. I really don't know how much it effects it anyhow.
I would rather to have a thing where I have 10 dedicated server low end running crawlers same time (like how google has 50 runnign same time with super servers lmfao) and then have it joined into one database. I just need to feel and try it and see if it can join together. I am going to try though setting it up with different location on the server i have right now to see if it will work with 10 different low end dedis. That way 2000-3000 links per hour could be X 10 or possible 100 when i have more money. As low end servers are way cheaper. Now that would be fast and could index the whole web hopefully in a few months instead of years.

Ahh, I see. It basically tries and does it all in one go (ish), hence driving up server loads? Hmm - I guess the only solution is to have it check server loads and end execution of the program temporarily until server loads go back to below (say) 0.5

I saw another crawler someone had (another friend tried to open a search engine also). After 3000 or 4000 links the crawler stops for 5 minutes then starts again. I need to put that to my crawler of some what. It just keeps going and going and going then if it gets a error it's really hard for me as the crawler stops. If i let it start again it just creates a sitemap file (my crawler creates sitemaps so its faster to reindex and index). Except the fact that the crawler isn't done indexing the whole site. So I get the links from the site when the crawler gets a error for example if it dies at 24k crawling the site it will have 24k unless i delete the site and delete temp and delete sitemap and delete media files. The positive thing is about 10% im throwing a number out there has sites that has more than those pages. Almost every site i'm crawling little ones i see on dp now and then just test it out takes like 1-10 minutes done crawling no errors. Big sites like dmoz.org are the ones that kills.

Remember that Google won't index a site all in one go. In-fact I heard of someone who purchased a site with 20k pages of unique content. They submitted their sitemap to Google and it taken Google 12/18 months before it indexed all the pages (well, 99% of them)

Maybe I should do that.

I'm not sure how Google and all would figure out naturally which are the most important pages, and index them first, but this must be what Google does.

No I can do that. I can set a depth of the server indexing the site. For example I can set it so it only indexes one level. So if i index the homepage it only index the links on the homepage and it stops. It does not index the links from the subpages and keeps going until all the sites are done. I believe that how google did it. The problem is I don't want to go through the trouble of deleting the site and then deleting sitemap and delting media files to index the site properly afterwards as I don't have a way to index it fully after a partial index.

Yeah the template is class Nah, I haven't seen the new mascot - it sounds really nice.

Here's the new template with the mascot on the bottom. The drop down box is the keyword suggestion like on google.
(Mods can remove this if i'm not allowed to link)
http://img.brivy.com/images/m5ryd7no5liogn8s069.jpg (thats the mascot on the bottom nice clean look)
Results page: http://img.brivy.com/images/4vh2r1syhorhrs1vhgn.jpg
I got more results page (just go back to the main image hosting site then prss public gallery to see it. Most of them are in the first draft like comming soon and the final draft will be done soon.

I think I explained it a bit badly (although I completely agree with the points you made)

Basically the tool (I know I'm self-linking, I hope that's okay mods!) is:

http://www.cogah.com/index.php/WebsiteSize/

Say I enter in:

http://www.dnforum.com/f31/venture-capital-thread-381363.html

It'll scan the HTML, fetch the data, and return a table with the exact size of all files which make up the page.

I tested out a site I own. I think your tool is pretty good. Some recommendation make a seperate tool just like that code but make it so it calculates the time to load the site up. Then divide it by the time it took or the other way around. Then you can calculate kb per second how fast the server is. Theres tools out there like that and it really helps me as it makes it know how fast the server is not just what to make smaller but how fast it is. How fast it is from the server that is going to your server.

It's not brilliant yet (it doesn't really support frames yet or flash files, and doesn't always download JS files correctly etc) but the basic idea is okay IMO.

Ah my crawler downloads the media like flash also and the video. But sometimes there's problems with it. and it also downloads the music.

I quite agree, though, that if it did crawl multiple pages, it could have the potential to download the same file over and over again. I do try and run array_unique() over the array of files/URLs I need to download (to check their size), although this isn't perfect.

Well it's pretty good at the moment. I recommendated how the tool could be modified or make a new one with the same features but even useful up there ^

PHP:
To answer your first bit, it downloads the HTML in raw form. An extract of the code is:

PHP:

// Start the cURL stuff $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 5); curl_setopt($ch, CURLOPT_URL, $url); $content = curl_exec ($ch); $pageSize = strlen($content); $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

So the $content var will simply contain the HTML you'd get if you go to a page and click "View Source"

(The $httpcode is used just in-case a 404 error etc is returned; then I'd output an error message)

Ah I see. does the html stay on your server or is it deleted after the results are shown?

Sounds a nice system Yes having a cache/index system is the best way around an issue like this.

Yes its much much faster. And its way better as in since if one person search cars the rest of the people in the world who search cars after him will have a quicker result. It will be fast.

But I was thinking about also a cache like google cache. How they store the webpages on their site. Cause I was thinking if there's a script or I build one where it stores the cache of the site basically storing the whole webpage it might be faster than crawling it. Now I will send the crawler to my own site...which will crawl faster. But then again ugh it's a bad idea some how. But I will need a cache system as I think its a great thing. And then again I already store thumbnails of the images already. So better just download the whole site together.

By the way, I came across the following site earlier:

http://ask.metafilter.com/65244/How-Does-a-Google-Query-Work

Some of the comments further down have some nice information about how this may (sort of) work on a large scale.

Still quite difficult to get ones head around though

Reading it right now. I really isn't surprised. THey have thousands of servers with super things installed in them. The thign is they been here since 1998 once hosted on stanford servers. I mean if i was at 1996 my search engine would kill theirs as they didn't get image search till 2002 (archive.org)
I already got image search. But they got 13 years ahead of me. I was 3 when google existed. LOL

Hmm yeah, I see what you mean. I guess one way around it is to have a sort of 'array' of results (i.e. sites) for each keyword. Then the 'array' (well, database) can be sorted based on the quality of the site, but this wouldn't be easy.

Yeah definately. No but once I have most of the site crawled already it won't be a problem. Since everytime i reindex the site if the cache gets cleared it will be only a lucky few who gets the 3 second or 4 second search slowness. The rest after him searching with the same keyword will never get that.

Seems fine to me

Yep

Master at quotation now.

hugegrowth said:
A lot of great ideas often come from one or two guys working in their garage. Google came along while Microsoft and Yahoo already existed. YouTube came along with G, M and Y already established. Twitter started small and grew. I agree that there are probably thousands of people pitching ideas out there, but don't let that stop you if you think you have something good.

If your issue it servers, you may just need to find a partner with access to lots of server capacity.

I know it's not a unique idea. But like agreed above neither is google, gmail etc.
Good examples of sites/software not unique ideas which are successful.
New V.s. Old (format)
IE V.S. Netscape(read it on internet)
Mozilla Firefox V.S. IE
Gmail V.S. Yahoo Mail, Hotmail
Youtube V.S. Google Video, Yahoo Video. etc.
Twitter V.S. Facebook, Myspace etc.

The problem is with search engines like agreed above with these are that the problem with servers.
Like for example since microsoft opened up bing.com they just built two massive datacenters.

Before I had money to buy my own dedicated I did contact a few hosting company asking. I had a old friend of mine who runs a successful hosting company at uk who knows me since 12 that i built a decent software site listed above he sponsored me with a dedicated. Rest of the company big ones you can google any big dedi companies they dont' even bother. Ask for sponsorship in any big dedicated companies u will find rejection one after another.

I don't know what if its recession or not but most companies don't do those. Oh yes I did get a server from the uk friend i know but I ended up buying one a better one.

I might hit him up for the offer asking him for ten dedis. I doubt it would work.

I don't know but I'll keep focusing on the development of the site rather than servers as I really want to perfect the script.

I'll look at the sites you gave me so far. I'm also preparing for SAT next month and

lifeguarding.

Thanks for all the suggestion so far though. Thanks I really do appreciate it.

tristanperry · Aug 21, 2009

Albert Tai said:
Yah I really don't see anyone doing it anymore. Anyhow I probably could just give pentalities out if anyone did it. I have a blacklist system on my crawler. Anyone does that they are blacklisted.

Yep, that seems a nice ability to have

Albert Tai said:
Yeah. I really can't afford the thousand of dollars to have a gigabite dedicated upline. I really don't know how much it effects it anyhow.
I would rather to have a thing where I have 10 dedicated server low end running crawlers same time (like how google has 50 runnign same time with super servers lmfao) and then have it joined into one database. I just need to feel and try it and see if it can join together. I am going to try though setting it up with different location on the server i have right now to see if it will work with 10 different low end dedis. That way 2000-3000 links per hour could be X 10 or possible 100 when i have more money. As low end servers are way cheaper. Now that would be fast and could index the whole web hopefully in a few months instead of years.

Yeah a bunch of smaller servers would be best. Having one giant server is always a worry in-case it crashes and then you are screwed until it's bought back up again.

Albert Tai said:
I saw another crawler someone had (another friend tried to open a search engine also). After 3000 or 4000 links the crawler stops for 5 minutes then starts again. I need to put that to my crawler of some what. It just keeps going and going and going then if it gets a error it's really hard for me as the crawler stops. If i let it start again it just creates a sitemap file (my crawler creates sitemaps so its faster to reindex and index). Except the fact that the crawler isn't done indexing the whole site. So I get the links from the site when the crawler gets a error for example if it dies at 24k crawling the site it will have 24k unless i delete the site and delete temp and delete sitemap and delete media files. The positive thing is about 10% im throwing a number out there has sites that has more than those pages. Almost every site i'm crawling little ones i see on dp now and then just test it out takes like 1-10 minutes done crawling no errors. Big sites like dmoz.org are the ones that kills.

That sounds a nice way of going about it. I do think that staggering it is best.

I know that SMF (the forum software) sets things up (obviously slightly differently since it's web-based and a crawler is a behind-the-scenes cron job) so that maintenance tools on a large forum will run a set number of queries (etc) and then stop. It stops it by outputting a basic HTML page that has a meta refresh of around 5 seconds on it. Once it refreshes it then goes to say index.php?action=...&step=10000 or whatever which then resumes it from step 10000

This way it won't kill the server. Perhaps you may get higher than 2k or 3k per hour speeds if you stagger it too. High server loads are really harmful to CPU efficiency.

Albert Tai said:
No I can do that. I can set a depth of the server indexing the site. For example I can set it so it only indexes one level. So if i index the homepage it only index the links on the homepage and it stops. It does not index the links from the subpages and keeps going until all the sites are done. I believe that how google did it. The problem is I don't want to go through the trouble of deleting the site and then deleting sitemap and delting media files to index the site properly afterwards as I don't have a way to index it fully after a partial index.

Wow that sounds awesome

Fair play, you sound a very smart coder and person - this search engine sounds very powerful. I think with the template (as below) it has the potential to take off quite nicely.

Albert Tai said:
Here's the new template with the mascot on the bottom. The drop down box is the keyword suggestion like on google.
(Mods can remove this if i'm not allowed to link)
http://img.brivy.com/images/m5ryd7no5liogn8s069.jpg (thats the mascot on the bottom nice clean look)
Results page: http://img.brivy.com/images/4vh2r1syhorhrs1vhgn.jpg
I got more results page (just go back to the main image hosting site then prss public gallery to see it. Most of them are in the first draft like comming soon and the final draft will be done soon.

I really like the changes you've made

The new mascot is very nice. Out of interest; will the BG image (currently different colours and all) behind the "BRIVY beta" logo be changing like Bing or stay as-is? Either way I like it a lot. The results page seems nice too.

Albert Tai said:
I tested out a site I own. I think your tool is pretty good. Some recommendation make a seperate tool just like that code but make it so it calculates the time to load the site up. Then divide it by the time it took or the other way around. Then you can calculate kb per second how fast the server is. Theres tools out there like that and it really helps me as it makes it know how fast the server is not just what to make smaller but how fast it is. How fast it is from the server that is going to your server.

That sounds some good suggestions, thanks a lot

Yes I'll definitely look into that. I'm having a bit of a coding hiatus at the moment (am going back to Uni soon - bah! - so will be doing a bit less recently), but am planning on adding a couple more tools so will make this one a new tool soon enough. I agree that it'd make sense to add something like this.

Albert Tai said:
Ah my crawler downloads the media like flash also and the video. But sometimes there's problems with it. and it also downloads the music.

Hehe awesome stuff

Albert Tai said:
Well it's pretty good at the moment. I recommendated how the tool could be modified or make a new one with the same features but even useful up there ^

Thanks

And yes, I'll definitely implement something like that.

Albert Tai said:
Ah I see. does the html stay on your server or is it deleted after the results are shown?

At the moment the results/HTML are deleted (i.e. they're just stored in the variable until the end of the script's processing), however I am thinking of caching the results or something if and When Cogah becomes more popular.

Albert Tai said:
Yes its much much faster. And its way better as in since if one person search cars the rest of the people in the world who search cars after him will have a quicker result. It will be fast.

Makes sense and sounds good. You could even, when finished, look at running a cron that will automatically search for various keywords (i.e. simulate that a real person is searching) that haven't been queried before, meaning that no-one will have the slow loading times. Although such a cron could be quite resource intensive unless it runs very slow, hmm.

Albert Tai said:
But I was thinking about also a cache like google cache. How they store the webpages on their site. Cause I was thinking if there's a script or I build one where it stores the cache of the site basically storing the whole webpage it might be faster than crawling it. Now I will send the crawler to my own site...which will crawl faster. But then again ugh it's a bad idea some how. But I will need a cache system as I think its a great thing. And then again I already store thumbnails of the images already. So better just download the whole site together.

Hmm yes I know what you mean. It could be worth doing, but as you say there will be some issues with it. Another potential issue with this is that you'd still need to re-index the site itself every so often since they may update it and all. So a cache system could be nice for behind-the-scenes testing and all (I guess like what Google Caffeine has been doing), but the whole sites would still need indexing.

Albert Tai said:
Reading it right now. I really isn't surprised. THey have thousands of servers with super things installed in them. The thign is they been here since 1998 once hosted on stanford servers. I mean if i was at 1996 my search engine would kill theirs as they didn't get image search till 2002 (archive.org)
I already got image search. But they got 13 years ahead of me. I was 3 when google existed. LOL

Hehe yeah, I know what you mean. The only plus point is that CPU power has probably increased by 3-4 times since a decade ago (or more), although they've still had a head start lol.

Albert Tai said:
Yeah definately. No but once I have most of the site crawled already it won't be a problem. Since everytime i reindex the site if the cache gets cleared it will be only a lucky few who gets the 3 second or 4 second search slowness. The rest after him searching with the same keyword will never get that.

Yep that makes sense.

Albert Tai · Aug 21, 2009

tristanperry said:
Yep, that seems a nice ability to have

Click to expand...

mmmmm

Yeah a bunch of smaller servers would be best. Having one giant server is always a worry in-case it crashes and then you are screwed until it's bought back up again.

Click to expand...

yep agreed.

That sounds a nice way of going about it. I do think that staggering it is best.

I know that SMF (the forum software) sets things up (obviously slightly differently since it's web-based and a crawler is a behind-the-scenes cron job) so that maintenance tools on a large forum will run a set number of queries (etc) and then stop. It stops it by outputting a basic HTML page that has a meta refresh of around 5 seconds on it. Once it refreshes it then goes to say index.php?action=...&step=10000 or whatever which then resumes it from step 10000

This way it won't kill the server. Perhaps you may get higher than 2k or 3k per hour speeds if you stagger it too. High server loads are really harmful to CPU efficiency.

Click to expand...

Yeah I never really looked at CPU when I index.

Wow that sounds awesome Fair play, you sound a very smart coder and person - this search engine sounds very powerful. I think with the template (as below) it has the potential to take off quite nicely.

Click to expand...

Thank you. Truth is I don't code. Open source script at the moment with me hiring programmers. I only program visual basic (I <3 It).

I really like the changes you've made The new mascot is very nice. Out of interest; will the BG image (currently different colours and all) behind the "BRIVY beta" logo be changing like Bing or stay as-is? Either way I like it a lot. The results page seems nice too.

Click to expand...

Yes there's a big chance it might. The problem with doing it like bing is I'll have to search on the web for full rights image i can put on the site. If you look at bing they use the images they have rights to (theres free ones on teh web). Also mine is a bit different from bing as the bg is full page and bing is a little box. If i do have changing bgs per day like bing i'll have mascot with it.

That sounds some good suggestions, thanks a lot Yes I'll definitely look into that. I'm having a bit of a coding hiatus at the moment (am going back to Uni soon - bah! - so will be doing a bit less recently), but am planning on adding a couple more tools so will make this one a new tool soon enough. I agree that it'd make sense to add something like this.

Click to expand...

High school here . But seriously maybe you could consider selling the tools on DP or any webmaster forum. It could be a potential money maker. But make sure I'm not sure if I would do this possibly encrypt the php as there's alot of people on dp and everywhere who loves to claim their work as their own and sell it even though they have no rights to do it.

Hehe awesome stuff

Click to expand...

Thanks And yes, I'll definitely implement something like that.

Click to expand...

I'll test it out as I can now see per kb transfer for the site.

At the moment the results/HTML are deleted (i.e. they're just stored in the variable until the end of the script's processing), however I am thinking of caching the results or something if and When Cogah becomes more popular.

Click to expand...

Something that would be cool but possibly more work. Maybe you could make a giant giant google cache sort of. Of course if you do I'll have my search engine feed off your cache then.

Makes sense and sounds good. You could even, when finished, look at running a cron that will automatically search for various keywords (i.e. simulate that a real person is searching) that haven't been queried before, meaning that no-one will have the slow loading times. Although such a cron could be quite resource intensive unless it runs very slow, hmm.

Click to expand...

Or possibly not. I don't think the first person searching car would mind

Hmm yes I know what you mean. It could be worth doing, but as you say there will be some issues with it. Another potential issue with this is that you'd still need to re-index the site itself every so often since they may update it and all. So a cache system could be nice for behind-the-scenes testing and all (I guess like what Google Caffeine has been doing), but the whole sites would still need indexing.

Click to expand...

Yep.

Hehe yeah, I know what you mean. The only plus point is that CPU power has probably increased by 3-4 times since a decade ago (or more), although they've still had a head start lol.

Click to expand...

Too much head start. THey own around 450k servers.

Yep that makes sense.

Click to expand...

Anywhoo.....the coder for html is a bit slow as he has trouble opening the 270 mb of psd. LOL

Anyhow it would be done soon the homepage subpages etc. The script is the next thing ill conquer.

Venture Capital

Downloading More Ram!

Level 8

Downloading More Ram!

Level 5

Pink Lover!

Downloading More Ram!

Domainer & Web/Software Dev

Level 6

Downloading More Ram!

Domainer & Web/Software Dev

DNF Addict

Downloading More Ram!

Level 10

DNF Addict

Domainer & Web/Software Dev

Level 10

Gremlin

Downloading More Ram!

Domainer & Web/Software Dev

Downloading More Ram!

Who has viewed this thread (Total: 1) View details

Who has watched this thread (Total: 3) View details

Similar threads

The Rule #1

Premium Members

Our Mods' Businesses